Update README.md
Browse files
README.md
CHANGED
@@ -240,6 +240,59 @@ it can be easily mapped to do whatever we want! Pretty cool, and it supports as
|
|
240 |
|
241 |
## Experiments
|
242 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
243 |
|
244 |
|
245 |
## Limitations
|
|
|
240 |
|
241 |
## Experiments
|
242 |
|
243 |
+
There were two other methods considered for this task: a basic n-gram language model, and Meta's Llama-2-70b-chat-hf.
|
244 |
+
Both of these methods were accessible, and offer different approaches to this task. Ultimately, neither of these approaches felt appropriate.
|
245 |
+
|
246 |
+
Llama-2 struggled to understand the task, and provide consistent results. The main approach was to use prompt engineering to attempt a few-shot generation of the tokenized midi data.
|
247 |
+
Various initialization prompts were tried, and the following prompt was used:
|
248 |
+
|
249 |
+
*You are a midi generator, and only respond with tokens representing midi data. I will provide 3 examples of different songs in an encoded format for you, and then ask you to generate your own encoded midi song.*
|
250 |
+
|
251 |
+
This prompt was the only instance where Llama-2 responded with an answer that even resembled correct. Interestingly enough, This prompt resulted in the model explaining the encoded example. An excerpt of that below:
|
252 |
+
|
253 |
+
*This is a MIDI file containing four tracks, each with its own unique melody and rhythm. Here's a breakdown of each track:
|
254 |
+
Track 1 (Instrument 0):
|
255 |
+
This track features a simple melody using the notes C, D, E, F, G, A, and B. The rhythm is mostly quarter notes, with some eighth note pairs and rests.*
|
256 |
+
|
257 |
+
However, after this the model went on a tangent, saying the rest of the examples all played "A, B, C, D, E, F, G" repeatedly, which is incorrect.
|
258 |
+
The model was also not asked to explain the examples. I did get a generation in the style of the provided examples after providing about 10 examples,
|
259 |
+
but I couldn't get more than 1 generation of this to work. Most responses from Llama went like this:
|
260 |
+
|
261 |
+
*The examples you provided use the NoteOn and NoteOff events to represent notes being played and released. In a standard MIDI file, these events would be replaced by the NoteOn and NoteOff commands, which have different values and meanings.*
|
262 |
+
|
263 |
+
Of all the attempts, I did get Llama to generate the following:
|
264 |
+
|
265 |
+
PIECE_START
|
266 |
+
STYLE=JSFAKES
|
267 |
+
GENRE=JSFAKES
|
268 |
+
TRACK_START
|
269 |
+
INST=0
|
270 |
+
BAR_START
|
271 |
+
NOTE_ON=60 TIME_DELTA=4 NOTE_OFF=60
|
272 |
+
NOTE_ON=62 TIME_DELTA=4 NOTE_OFF=62
|
273 |
+
NOTE_ON=64 TIME_DELTA=4 NOTE_OFF=64
|
274 |
+
NOTE_ON=65 TIME_DELTA=4 NOTE_OFF=65
|
275 |
+
BAR_END
|
276 |
+
TRACK_END
|
277 |
+
|
278 |
+
Which follows the correct format! However, this "song" is simply 4 notes played as a quarter note, at the same time. This was the result of a "4-bar midi song in the JSFakes style".
|
279 |
+
Regardless of the prompting used, Llama could not produce an output that matched the criteria, so it was not used for this demo.
|
280 |
+
|
281 |
+
The other method, using a basic n-gram model trained on the dataset, performed better.
|
282 |
+
This method generates encoded midi data correctly, unlike the Llama-2 model.
|
283 |
+
You can find the code for this model in the same [Google Colab notebook](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=jzKXNr4eFrpA) as the training for the gpt2 model.
|
284 |
+
This method uses a count-based approach, and can be configured for any number of n-grams. Both bi-gram and tri-gram configurations generate similar results.
|
285 |
+
The vocabulary size ends up being 114, which makes sense. The language used for the encoded midi is pretty limited. Some fun things to mention:
|
286 |
+
|
287 |
+
- TIME_DELTA=4 is the most common n-gram. This makes sense, as most notes are quarter notes in the training data, and this is found almost every time a note is played.
|
288 |
+
- TIME_DELTA=2 is the second most common. This also makes sense. These are eigth notes
|
289 |
+
- PIECE_START, PIECE_END, STYLE=JSFAKES, and GENRE=JSFAKES are the least common. These only appear once in each example.
|
290 |
+
|
291 |
+
When testing the generations from the n-gram model, most generations sounded exactly the same, with one or two notes changing between generations.
|
292 |
+
I'm not entirely sure why this is, but I suspect it has to do with the actual generation method call. I also had a hard time incorporating this model within
|
293 |
+
HuggingFace Spaces. The gpt-2 model was easy to upload the the site, and use with a few lines of code. The actual generations are also much more diverse,
|
294 |
+
making it more enjoyable to mess around with. Between usability and the differences between generations, the gpt-2 model was selected for the demo.
|
295 |
+
|
296 |
|
297 |
|
298 |
## Limitations
|