Katpeeler commited on
Commit
29b3d14
·
1 Parent(s): 925fac7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md CHANGED
@@ -240,6 +240,59 @@ it can be easily mapped to do whatever we want! Pretty cool, and it supports as
240
 
241
  ## Experiments
242
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
243
 
244
 
245
  ## Limitations
 
240
 
241
  ## Experiments
242
 
243
+ There were two other methods considered for this task: a basic n-gram language model, and Meta's Llama-2-70b-chat-hf.
244
+ Both of these methods were accessible, and offer different approaches to this task. Ultimately, neither of these approaches felt appropriate.
245
+
246
+ Llama-2 struggled to understand the task, and provide consistent results. The main approach was to use prompt engineering to attempt a few-shot generation of the tokenized midi data.
247
+ Various initialization prompts were tried, and the following prompt was used:
248
+
249
+ *You are a midi generator, and only respond with tokens representing midi data. I will provide 3 examples of different songs in an encoded format for you, and then ask you to generate your own encoded midi song.*
250
+
251
+ This prompt was the only instance where Llama-2 responded with an answer that even resembled correct. Interestingly enough, This prompt resulted in the model explaining the encoded example. An excerpt of that below:
252
+
253
+ *This is a MIDI file containing four tracks, each with its own unique melody and rhythm. Here's a breakdown of each track:
254
+ Track 1 (Instrument 0):
255
+ This track features a simple melody using the notes C, D, E, F, G, A, and B. The rhythm is mostly quarter notes, with some eighth note pairs and rests.*
256
+
257
+ However, after this the model went on a tangent, saying the rest of the examples all played "A, B, C, D, E, F, G" repeatedly, which is incorrect.
258
+ The model was also not asked to explain the examples. I did get a generation in the style of the provided examples after providing about 10 examples,
259
+ but I couldn't get more than 1 generation of this to work. Most responses from Llama went like this:
260
+
261
+ *The examples you provided use the NoteOn and NoteOff events to represent notes being played and released. In a standard MIDI file, these events would be replaced by the NoteOn and NoteOff commands, which have different values and meanings.*
262
+
263
+ Of all the attempts, I did get Llama to generate the following:
264
+
265
+ PIECE_START
266
+ STYLE=JSFAKES
267
+ GENRE=JSFAKES
268
+ TRACK_START
269
+ INST=0
270
+ BAR_START
271
+ NOTE_ON=60 TIME_DELTA=4 NOTE_OFF=60
272
+ NOTE_ON=62 TIME_DELTA=4 NOTE_OFF=62
273
+ NOTE_ON=64 TIME_DELTA=4 NOTE_OFF=64
274
+ NOTE_ON=65 TIME_DELTA=4 NOTE_OFF=65
275
+ BAR_END
276
+ TRACK_END
277
+
278
+ Which follows the correct format! However, this "song" is simply 4 notes played as a quarter note, at the same time. This was the result of a "4-bar midi song in the JSFakes style".
279
+ Regardless of the prompting used, Llama could not produce an output that matched the criteria, so it was not used for this demo.
280
+
281
+ The other method, using a basic n-gram model trained on the dataset, performed better.
282
+ This method generates encoded midi data correctly, unlike the Llama-2 model.
283
+ You can find the code for this model in the same [Google Colab notebook](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=jzKXNr4eFrpA) as the training for the gpt2 model.
284
+ This method uses a count-based approach, and can be configured for any number of n-grams. Both bi-gram and tri-gram configurations generate similar results.
285
+ The vocabulary size ends up being 114, which makes sense. The language used for the encoded midi is pretty limited. Some fun things to mention:
286
+
287
+ - TIME_DELTA=4 is the most common n-gram. This makes sense, as most notes are quarter notes in the training data, and this is found almost every time a note is played.
288
+ - TIME_DELTA=2 is the second most common. This also makes sense. These are eigth notes
289
+ - PIECE_START, PIECE_END, STYLE=JSFAKES, and GENRE=JSFAKES are the least common. These only appear once in each example.
290
+
291
+ When testing the generations from the n-gram model, most generations sounded exactly the same, with one or two notes changing between generations.
292
+ I'm not entirely sure why this is, but I suspect it has to do with the actual generation method call. I also had a hard time incorporating this model within
293
+ HuggingFace Spaces. The gpt-2 model was easy to upload the the site, and use with a few lines of code. The actual generations are also much more diverse,
294
+ making it more enjoyable to mess around with. Between usability and the differences between generations, the gpt-2 model was selected for the demo.
295
+
296
 
297
 
298
  ## Limitations