Create README.md (#1)

ee4d750 almost 2 years ago

9.76 kB

	---
	license: apache-2.0
	inference: false
	pipeline_tag: audio-to-audio
	---

	# Perceiver AR symbolic audio model

	This model is a [Perceiver AR](https://arxiv.org/abs/2202.07765) symbolic audio model (134M parameters) pretrained on
	the [GiantMIDI-Piano](https://github.com/bytedance/GiantMIDI-Piano) dataset for 27 epochs (157M tokens). It uses [rotary embedding](https://arxiv.org/abs/2104.09864)
	for relative position encoding. It is a [training example](https://github.com/krasserm/perceiver-io/blob/main/docs/training-examples.md#giantmidi-piano)
	of the [perceiver-io](https://github.com/krasserm/perceiver-io) library.

	## Model description

	Perceiver AR is a simple extension of a plain decoder-only transformer such as GPT-2, for example. A core building block
	of both is the decoder layer consisting of a self-attention layer followed by a position-wise MLP. Self-attention uses
	a causal attention mask.

	Perceiver AR additionally cross-attends to a longer prefix of the input sequence in its first attention layer. This layer
	is a hybrid self- and cross-attention layer. Self-attention is over the last n positions of the input sequence, with a
	causal attention mask, cross-attention is from the last n positions to the first m positions. The length of the input
	sequence is m + n. This allows a Perceiver AR to process a much larger context than decoder-only transformers which are
	based on self-attention only.

	<p align="center">
	<img src="https://krasserm.github.io/img/2023-01-23/perceiver-ar.png" alt="Perceiver AR" width="600"/><br/>
	<i>Fig. 1</i>. Attention in Perceiver AR with m=8 prefix tokens and n=3 latent tokens.
	<p/>

	The output of the hybrid attention layer are n latent arrays corresponding to the last n tokens of the input sequence.
	These are further processed by a stack of L-1 decoder layers where the total number of attention layers is L. A final
	layer (not shown in Fig. 1) predicts the target token for each latent position. The weights of the final layer are
	shared with the input embedding layer. Except for the initial cross-attention to the prefix sequence, a Perceiver AR
	is architecturally identical to a decoder-only transformer.

	## Model training

	The model was [trained](https://github.com/krasserm/perceiver-io/blob/main/docs/training-examples.md#giantmidi-piano) with
	the task of symbolic audio modeling on the [GiantMIDI-Piano](https://github.com/bytedance/GiantMIDI-Piano) dataset
	for 27 epochs (157M tokens). This dataset consists of [MIDI](https://en.wikipedia.org/wiki/MIDI) files, tokenized using the
	approach from the [Perceiver AR paper](https://arxiv.org/pdf/2202.07765.pdf), which is described
	in detail in Section A.2 of [Huang et al (2019)](https://arxiv.org/abs/1809.04281).
	All hyperparameters are summarized in the [training script](https://github.com/krasserm/perceiver-io/blob/main/examples/training/sam/giantmidi/train.sh).
	The context length was set to 6144 tokens with 2048 latent positions, resulting in a maximal prefix length of 4096. The
	actual prefix length per example was randomly chosen between 0 and 4096. Training was done with [PyTorch Lightning](https://www.pytorchlightning.ai/index.html)
	and the resulting checkpoint was converted to this 🤗 model with a library-specific [conversion utility](#checkpoint-conversion).

	## Intended use and limitations

	This model can be used for audio generation with a user-defined initial number of latent tokens. It mainly exists for
	demonstration purposes on how to train Perceiver AR models with the [perceiver-io library](https://github.com/krasserm/perceiver-io).
	To improve on the quality of the generated audio samples a much larger dataset than
	[GiantMIDI-Piano](https://github.com/bytedance/GiantMIDI-Piano) is required for training.

	## Usage examples

	To use this model you first need to [install](https://github.com/krasserm/perceiver-io/blob/main/README.md#installation)
	the `perceiver-io` library with extension `audio`.

	```shell
	pip install perceiver-io[audio]
	```

	Then the model can be used with PyTorch. Either use the model directly to generate MIDI files:

	```python
	import torch

	from perceiver.model.audio.symbolic import PerceiverSymbolicAudioModel
	from perceiver.data.audio.midi_processor import decode_midi, encode_midi
	from pretty_midi import PrettyMIDI

	repo_id = "krasserm/perceiver-ar-sam-giant-midi"

	model = PerceiverSymbolicAudioModel.from_pretrained(repo_id)

	prompt_midi = PrettyMIDI("prompt.mid")
	prompt = torch.tensor(encode_midi(prompt_midi)).unsqueeze(0)

	output = model.generate(prompt, max_new_tokens=64, num_latents=1, do_sample=True, top_p=0.95, temperature=1.0)

	output_midi = decode_midi(output[0].cpu().numpy())
	type(output_midi)
	```
	```
	pretty_midi.pretty_midi.PrettyMIDI
	```

	use a `symbolic-audio-generation` pipeline to generate a MIDI output:

	```python
	from transformers import pipeline
	from pretty_midi import PrettyMIDI
	from perceiver.model.audio import symbolic # auto-class registration

	repo_id = "krasserm/perceiver-ar-sam-giant-midi"

	prompt = PrettyMIDI("prompt.mid")
	audio_generator = pipeline("symbolic-audio-generation", model=repo_id)

	output = audio_generator(prompt, max_new_tokens=64, num_latents=1, do_sample=True, top_p=0.95, temperature=1.0)
	type(output["generated_audio_midi"])
	```
	```
	pretty_midi.pretty_midi.PrettyMIDI
	```

	or generate WAV output by rendering the MIDI symbols using [fluidsynth](https://www.fluidsynth.org/) (Note: fluidsynth must be installed
	in order for the following example to work):

	```python
	from transformers import pipeline
	from pretty_midi import PrettyMIDI
	from perceiver.model.audio import symbolic # auto-class registration

	repo_id = "krasserm/perceiver-ar-sam-giant-midi"

	prompt = PrettyMIDI("prompt.mid")
	audio_generator = pipeline("symbolic-audio-generation", model=repo_id)

	output = audio_generator(prompt, max_new_tokens=64, num_latents=1, do_sample=True, top_p=0.95, temperature=1.0, render=True)

	with open("generated_audio.wav", "wb") as f:
	f.write(output["generated_audio_wav"])
	```

	## Audio samples

	The following (hand-picked) audio samples were generated using various prompts from the validation subset of
	the [GiantMIDI-Piano](https://github.com/bytedance/GiantMIDI-Piano) dataset. The input prompts are
	not included in the audio output.

	<table>
	<tr>
	<th>Audio sample</th>
	<th>Top-K</th>
	<th>Top-p</th>
	<th>Temperature</th>
	<th>Prefix length</th>
	<th>Latents</th>
	</tr>
	<tr>
	<td>
	<audio controls>
	<source src="https://martin-krasser.com/perceiver/data/midi/01_nehrlich_continuation.wav" type="audio/wav">
	Your browser does not support the audio element.
	</audio>
	</td>
	<td style="vertical-align: top;">-</td>
	<td style="vertical-align: top;">0.95</td>
	<td style="vertical-align: top;">0.95</td>
	<td style="vertical-align: top;">4096</td>
	<td style="vertical-align: top;">1</td>
	</tr>
	<tr>
	<td>
	<audio controls>
	<source src="https://martin-krasser.com/perceiver/data/midi/02_eduardo_continuation.wav" type="audio/wav">
	Your browser does not support the audio element.
	</audio>
	</td>
	<td style="vertical-align: top;">-</td>
	<td style="vertical-align: top;">0.95</td>
	<td style="vertical-align: top;">1.0</td>
	<td style="vertical-align: top;">4096</td>
	<td style="vertical-align: top;">64</td>
	</tr>
	<tr>
	<td>
	<audio controls>
	<source src="https://martin-krasser.com/perceiver/data/midi/03_membree_continuation.wav" type="audio/wav">
	Your browser does not support the audio element.
	</audio>
	</td>
	<td style="vertical-align: top;">-</td>
	<td style="vertical-align: top;">0.95</td>
	<td style="vertical-align: top;">1.0</td>
	<td style="vertical-align: top;">1024</td>
	<td style="vertical-align: top;">1</td>
	</tr>
	<tr>
	<td>
	<audio controls>
	<source src="https://martin-krasser.com/perceiver/data/midi/04_membree_continuation.wav" type="audio/wav">
	Your browser does not support the audio element.
	</audio>
	</td>
	<td style="vertical-align: top;">15</td>
	<td style="vertical-align: top;">-</td>
	<td style="vertical-align: top;">1.0</td>
	<td style="vertical-align: top;">4096</td>
	<td style="vertical-align: top;">16</td>
	</tr>
	<tr>
	<td>
	<audio controls>
	<source src="https://martin-krasser.com/perceiver/data/midi/05_kinscella_continuation.wav" type="audio/wav">
	Your browser does not support the audio element.
	</audio>
	</td>
	<td style="vertical-align: top;">-</td>
	<td style="vertical-align: top;">0.95</td>
	<td style="vertical-align: top;">1.0</td>
	<td style="vertical-align: top;">4096</td>
	<td style="vertical-align: top;">1</td>
	</tr>
	</table>

	## Checkpoint conversion

	The `krasserm/perceiver-ar-sam-giant-midi` model has been created from a training checkpoint with:

	```python
	from perceiver.model.audio.symbolic import convert_checkpoint

	convert_checkpoint(
	save_dir="krasserm/perceiver-ar-sam-giant-midi",
	ckpt_url="https://martin-krasser.com/perceiver/logs-0.8.0/sam/version_1/checkpoints/epoch=027-val_loss=1.944.ckpt",
	push_to_hub=True,
	)
	```

	## Citation

	```bibtex
	@inproceedings{hawthorne2022general,
	title={General-purpose, long-context autoregressive modeling with perceiver ar},
	author={Hawthorne, Curtis and Jaegle, Andrew and Cangea, C{\u{a}}t{\u{a}}lina and Borgeaud, Sebastian and Nash, Charlie and Malinowski, Mateusz and Dieleman, Sander and Vinyals, Oriol and Botvinick, Matthew and Simon, Ian and others},
	booktitle={International Conference on Machine Learning},
	pages={8535--8558},
	year={2022},
	organization={PMLR}
	}
	```