Katpeeler
/

midi_model_3

 ## Experiments
+There were two other methods considered for this task: a basic n-gram language model, and Meta's Llama-2-70b-chat-hf.
+Both of these methods were accessible, and offer different approaches to this task. Ultimately, neither of these approaches felt appropriate.
+Llama-2 struggled to understand the task, and provide consistent results. The main approach was to use prompt engineering to attempt a few-shot generation of the tokenized midi data.
+Various initialization prompts were tried, and the following prompt was used:
+*You are a midi generator, and only respond with tokens representing midi data. I will provide 3 examples of different songs in an encoded format for you, and then ask you to generate your own encoded midi song.*
+This prompt was the only instance where Llama-2 responded with an answer that even resembled correct. Interestingly enough, This prompt resulted in the model explaining the encoded example. An excerpt of that below:
+*This is a MIDI file containing four tracks, each with its own unique melody and rhythm. Here's a breakdown of each track:
+Track 1 (Instrument 0):
+This track features a simple melody using the notes C, D, E, F, G, A, and B. The rhythm is mostly quarter notes, with some eighth note pairs and rests.*
+However, after this the model went on a tangent, saying the rest of the examples all played "A, B, C, D, E, F, G" repeatedly, which is incorrect.
+The model was also not asked to explain the examples. I did get a generation in the style of the provided examples after providing about 10 examples,
+but I couldn't get more than 1 generation of this to work. Most responses from Llama went like this:
+*The examples you provided use the NoteOn and NoteOff events to represent notes being played and released. In a standard MIDI file, these events would be replaced by the NoteOn and NoteOff commands, which have different values and meanings.*
+Of all the attempts, I did get Llama to generate the following:
+PIECE_START
+STYLE=JSFAKES
+GENRE=JSFAKES
+TRACK_START
+INST=0
+BAR_START
+NOTE_ON=60 TIME_DELTA=4 NOTE_OFF=60
+NOTE_ON=62 TIME_DELTA=4 NOTE_OFF=62
+NOTE_ON=64 TIME_DELTA=4 NOTE_OFF=64
+NOTE_ON=65 TIME_DELTA=4 NOTE_OFF=65
+BAR_END
+TRACK_END
+Which follows the correct format! However, this "song" is simply 4 notes played as a quarter note, at the same time. This was the result of a "4-bar midi song in the JSFakes style".
+Regardless of the prompting used, Llama could not produce an output that matched the criteria, so it was not used for this demo.
+The other method, using a basic n-gram model trained on the dataset, performed better.
+This method generates encoded midi data correctly, unlike the Llama-2 model.
+You can find the code for this model in the same [Google Colab notebook](https://colab.research.google.com/drive/1uvv-ChthIrmEJMBOVyL7mTm4dcf4QZq7#scrollTo=jzKXNr4eFrpA) as the training for the gpt2 model.
+This method uses a count-based approach, and can be configured for any number of n-grams. Both bi-gram and tri-gram configurations generate similar results.
+The vocabulary size ends up being 114, which makes sense. The language used for the encoded midi is pretty limited. Some fun things to mention:
+- TIME_DELTA=4 is the most common n-gram. This makes sense, as most notes are quarter notes in the training data, and this is found almost every time a note is played.
+- TIME_DELTA=2 is the second most common. This also makes sense. These are eigth notes
+- PIECE_START, PIECE_END, STYLE=JSFAKES, and GENRE=JSFAKES are the least common. These only appear once in each example.
+When testing the generations from the n-gram model, most generations sounded exactly the same, with one or two notes changing between generations.
+I'm not entirely sure why this is, but I suspect it has to do with the actual generation method call. I also had a hard time incorporating this model within
+HuggingFace Spaces. The gpt-2 model was easy to upload the the site, and use with a few lines of code. The actual generations are also much more diverse,
+making it more enjoyable to mess around with. Between usability and the differences between generations, the gpt-2 model was selected for the demo.
 ## Limitations