Echo Nest Remix API Overview
As of this writing (15 March 2009), this is a description of the state of the art of the Remix API. However, it is a fast-moving target and some things are likely to change. This overview and tutorial is provided as a high-level tourist’s guide, and makes no warranty to the ongoing validity of the approach described herein.
This is a document designed for developers. Although the introduction and most of the first half will be accessible to many, it goes into technical detail that not everyone has a taste for. A document written for a general audience describes an example program made with Remix, and some mixes made with it.
I do not offer installation instructions. For them, please see the state of the art on the project wiki. Similarly, I cannot be available to answer questions personally. To raise issues, see the project issues page and/or the project mailing list.
Since there has been a little confusion: Remix is not my project, but I am a very active external contributor. Some of my research activities are based on Remix, and some of the more unexpected features (in particular the fluent interface-like properties) in the framework reflect that fact.
The Echo Nest Remix API (“Remix,” for short) is an open-source programming framework written in Python. It provides numerous classes and methods to deal with recorded songs, analyze them, inspect and manipulate the resulting analyses, and manipulate audio based on that information.
Remix has been used for several web-based audio “toys” (for the lack of a better term), creating automatic, beat-matched medleys of songs, adding cowbell or jingle bells to songs, or syncronizing and switching between two videos based on the audio. These examples only go so far in terms of explaining what Remix is really capable of, though.
Remix is a sophisticated tool to allow you to quickly, expressively, and intuitively chop up existing audio content and create new content based on the old. It allows you to reach inside the music, and let the music’s own musical qualities be your — or your computer’s — guide in finding something new in the old. By using Remix’s knowledge of a given song’s structure, you can render the familiar strange, or the strange slightly more familiar-sounding. You can create countless parameterized variations of a given song — or one of near-limitless length — that respect or desecrate the original, or land on any of countless steps in between.
My favorite way of thinking of Remix is that it makes each song its own API: each song offers queries into its own features, and can return any number of transformed versions of itself, all of which are sensitive to the original song’s musical features.
Much of the internal sophistication to Remix is provided by The Echo Nest Analyze API (“Analyze”), which is implemented as a web service that ingests MP3 files and then provides detailed information about the internal content of the songs. As a result, and because the majority of the data model is determined by Analyze’s output, it is best first to examine what the Analyze API provides.
Understanding the Analyze API
Analyze performs efficient, perception-based signal processing, simulating how people hear sounds (accounting for temporal and pitch masking, for example) in order to extract higher-level musical knowledge about a sound. From the Remix point of view, one of the most useful outputs is a three-level, hierarchical rhythmic analysis of a given song: beats, bars, and tatums.
- Beats are the easiest to understand: tap your finger or toe along with a song, and you’re most likely following the beat.
- Bars are regular groupings of beats: they’re the building blocks of most patterns within a song. If you dance to a song, you’re likely to repeat yourself every bar; if you change dance patterns, it is likely to be at the start of a bar.
- Tatums are subdivisions of beats, usually the fastest regular event happening within a song, commonly with two or four in a beat. If you try to tap your finger as fast as you can in time with a song, you are likely hitting the tatums.
The above three terms have more precise musical definitions that refer to musical notation, but since there is no musical score created or used at any point in the analysis process, I have tried to describe them in terms of the average music listener.
The rhythm hierarchy
The interesting thing about music in general and the Analyze output specifically is that bars, beats, and tatums form a hierarchy, and (in all but the most complex music) the parent in the hierarchy groups the same number of children throughout. In other words, in most cases bars contain four beats, and each beat contains two tatums throughout a song.
Analyze isn’t perfect: tricky rhythms will confuse it, and bars, especially, don’t always appear regularly. This is normal, and somewhat predictable, as educated musical listeners will disagree on how beats in complex songs should be grouped, and even some of the most banal pop songs will break regular beat patterns through syncopation or other variations in order to provide interest. As a result, not all analyses contain bars, and sometimes bars will stretch for unusually long lengths (such as nineteen beats) before Analyze re-finds a regular pattern to lock into.
There are two other ways Analyze divides a song: Sections and Segments.
- Sections reflect large scale changes in the song’s sound, whether it be in drumbeat, instrumentation, or harmonic structure. Although it is often described as dividing a song into a verse-chorus-bridge structure, reality yields not as clear-cut results as that.
- Segments are the most elemental portions of a song. They do not precisely align with beats or tatums, but rather are representative of some sort of ‘event’ in the song, such as a note or a drum beat.
Since there are likely many instruments playing many notes that overlap, the practical result of this part of the analysis is the Segments capture the starts of notes and drum beats. Because of the resulting elemental aspect of Segments containing one thing, they also act as containers for further sonic analysis. They contain information on the evolution of the loudness over the course of the segment, the overall pitch content (with the relative loudness of each step of the scale), and the timbre of the segment.
Timbre is perhaps the hardest-to-explain concept in Analyze, but it turns out to be very useful in practical applications of Remix. It is the characteristic ‘sound’ of a musical instrument, independent of pitch or loudness. For the purposes of Analyze, timbre is reduced to twelve component elements: some reflecting the spectral content of the sound (such as the relative loudness of the treble, mid-range and bass frequencies in the segment), some representing the evolution of the loudness of the segment over time (is it a sharp attack that dies away, or does it hold steady over the lifetime of the segment?), and others that are combinations of the two factors (a sharp attack may start with high frequencies and die down to low frequencies only).
The global view
By combining this low-level information with knowledge about how music works, Analyze also offers estimates about global features of the entire song, such as time signature, key signature, and tempo. As with any case of reducing thousands of complex, evolving values with a single number, there exist some edge cases in particularly irregular or complex sonic material where Analyze cannot come up with a single, simple value for some of these. Again, this is normal, and likely reflects reality as well.
Analyze starts with a user uploading an MP3 file to its servers. The analysis itself happens on The Echo Nest’s centralized servers and usually takes far less time to complete than it takes to upload the file from a domestic broadband connection. Once the analysis is complete, simple web (HTTP) requests identifying the file that was uploaded get back answers in XML documents.
Since we now understand the basics of the Analyze API, we can turn to the Remix API and how it derives its power from Analyze. Remix is a Python framework that provides objects with many methods for manipulating sound information. The first thing you do with a Python framework is import it into an easily-accessible namespace:
import echonest.audio as audio
While we are importing libraries, we may as well import a couple more helper functions from other files. We import them into the main namespace for convenience’s and readability’s sake:
from echonest.selection import fall_on_the from echonest.sorting import duration
Remix exposes most of its powerful functionality through a single class: LocalAudioFile, which combines the simple audio processing of the AudioData class with an interface to Analyze and further filtering capabilities contained within AudioAnalysis. The LocalAudioFile class is initialized with a filename pointing to an MP3 file. The system then checks to see if Analyze already knows about that file. If not, it uploads the file and waits for an analysis to finish. Either way, the analysis corresponding to the file is exposed through the LocalAudioFile.analysis accessor, and is of the AudioAnalysis class described at length in the following section. The LocalAudioFile class then loads all of the sound data from the MP3 file and retains it for remixing, and returns the object with rich data.
song = audio.LocalAudioFile("mySong.mp3")
Getting an analysis
The next object of interest, AudioAnalysis, is the part of the framework that talks with Analyze and does the work of converting the output of that web-based API into local objects that can be conveniently and efficiently manipulated. Practically speaking, in most cases we will access an AudioAnalysis object through the LocalAudioFile.analysis accessor.
When created, AudioAnalysis takes a song file as input and uploads the file for analysis. Once the analysis is finished on the servers, the AudioAnalysis object for the song in the song file is returned for further use. An AudioAnalysis object for a given song contains many sub-objects, each representing a request to the Analyze API. After the initial call is made, the value is stored locally so that any delay in communications with the server is avoided. For example, in order to resolve a call to AudioAnalysis.tempo, the AudioAnalysis object behind the scenes calls the get_tempo Analyze method, takes in the XML data, and converts that into an ordered pair of floating point values, one for the estimated tempo of the piece overall, in beats per second, and another as a measure of how confident Analyze is of the estimate, from zero to one.
tempo = song.analysis.tempo # returns something like (120.0, 0.89)
Exploring the AudioQuantum
Some calls to Analyze return a complex series of information, and as a result many sub-objects are created locally to contain that information. In the cases of beats, tatums, bars, and sections, a call like AudioAnalysis.beats returns an AudioQuantumList containing many AudioQuantums. An AudioQuantum is a small unit of the song, identified by the start time (in seconds, relative to the start of the song) and its duration. Given these two pieces of information, along with the original song file to which they refer, you can extract the sound information of the particular beat of interest, and export it to a new sound file. Although AudioQuantum has gained new sub-objects and methods over time, this basic mode of operation — using a start time and duration, collecting the sound information from the original file, and using that information to create a new file — is the source of much of Remix’s convenience.
Some of the additional information that has been added to AudioQuantum over time includes links back to the containing AudioQuantumList and identification of the type of musical rhythmic unit it represents. These two additional pieces of information allow you to get a sense of the musical context of a given beat, tatum, or bar. You can step forward or back to the next or previous beat; you can step up or down the musical hierarchy from a beat, to the bar that contains it, or get a list of the tatums it contains; or you can get a sense of where the musical unit lies, either within its containing unit (e.g., “beat 2 of 4” within its given bar), or within the entire song (e.g., “measure 42 of 122”).
AudioSegment is a sub-class of AudioQuantum containing much more information — particularly the rich Pitch and Timbre information returned by Analyze. This rich information leads to a lot of the hidden power in Remix, enabling one to combine, order, and modify AudioQuanta according to their sonic similarity or differences.
AudioQuantumLists and automatic data filtering
The AudioQuantumList is a specialization of a Python List object that has additional methods. One method, AudioQuantumList.that, is for searching within the list for AudioQuanta that satisfy different criteria, expressed as an input function. It returns another AudioQuantumList with the subset of AudioQuanta for which the input function returns a non-empty value. By convention, to reinforce the filter-like intentions of the that method, functions designed as input to that are named as if they were a part of a sentence. For example, a typical way of calling that is:
beats = song.analysis.beats ones = beats.that(fall_on_the(1))
In this case, fall_on_the() is a helper function that takes one argument (a beat position), and (taking advantage of some sophisticated properties of Python) returns a new function that actually performs the filtering. Many of the pre-defined helper functions are constructed this way, being functions that return other functions, but they are designed to be maximally expressive and useful in a way that hides their internal sophistication.
Similarly, you can automatically re-order an AudioQuantumList by using the ordered_by method. Given a list of elements, it successively applies the input function to each element, and returns output ordered by the returned value of that function (from low to high, by default). As with the convention begun with that, the helper functions are usually named to allow for maximum comprehension and expressiveness:
short_to_long = song.analysis.segments.ordered_by(duration)
…returns all of the segments in the song, but with the shortest segments first, and the longest segments last.
Incidentally, all of the helper functions are provided in their own small modules (currently echonest.selection and echonest.sorting, functions from which were imported at the start) that can be included in any program that needs them.
Returning to Audio
Now that we have some familiarity with Analyze, and how the information it returns can be further analyzed, navigated, chopped up, and reordered through Remix, it is time to learn how to connect this back to the original audio. Recall that LocalAudioFile contains the functionality (is a sub-class) of AudioData, which loads the original audio into the instantiated LocalAudioFile object.
Given that feature-rich object, you can then explore the analysis as you like, typically collecting interesting AudioQuanta together in an AudioQuantumList, reordering it and further filtering through it or intermediate results as necessary. Once you have a sequence of bars, beats, and tatums that you would like to listen to, you can call a function that creates a new AudioData object, and then save the object by encoding an new MP3:
out = audio.getpieces(song, ones) out.encode("myNewSong.mp3")
Putting It All Together
If we set aside small, illustrative example code lines, we now have a program that looks like:
import echonest.audio as audio from echonest.selection import fall_on_the song = audio.LocalAudioFile("mySong.mp3") beats = song.analysis.beats ones = beats.that(fall_on_the(1)) out = audio.getpieces(song, ones) out.encode("myNewSong.mp3")
This is a perfectly reasonable and runnable example file (alternatively, it could be typed into an interactive python session), and performs much of what the example file listing “ones.py” performs: it selects the first beat of each measure, and then collects the actual audio and saves it to a file.
In this brief API overview, I hope I have given a taste of the power and terse expressivity offered by combining a powerful programming language with sophisticated machine listening techniques. Remix code is easy to get started with once you get it installed. I personally feel Remix has almost unlimited potential as an expressive tool: these are early days in its life, and we have only begun to explore what it is capable of.
If you are comfortable with this material, you may be interested in the API documentation for Analyze. There is now some generated documentation of the Remix API python libraries. There is a lengthy worked example explaining an application of Remix in detail, with audio examples and excessive metaphor.
I am Adam Lindsay. This is one of my websites, but there are others. I was the first contributor outside of The Echo Nest to commit to the public, open-source version of Remix. I am in the process of trying to explain all that I have done with Remix, and this page was one of the first by-products.