--- license: mit tags: - art - music - midi - emotion - clip - multimodal --- # ARIA - Artistic Rendering of Images into Audio ARIA is a multimodal AI model that generates MIDI music based on the emotional content of artwork. It uses a CLIP-based image encoder to extract emotional valence and arousal from images, then generates emotionally appropriate music using conditional MIDI generation. ## Model Description - **Developed by:** Vincent Amato - **Model type:** Multimodal (Image-to-MIDI) Generation - **Language(s):** English - **License:** MIT - **Parent Model:** Uses CLIP for image encoding and midi-emotion for music generation - **Repository:** [GitHub](https://github.com/vincentamato/aria) ### Model Architecture ARIA consists of two main components: 1. A CLIP-based image encoder fine-tuned to predict emotional valence and arousal from images 2. A transformer-based MIDI generation model (midi-emotion) that conditions on these emotional values The model offers three different conditioning modes: - `continuous_concat`: Emotions as continuous vectors concatenated to all tokens - `continuous_token`: Emotions as continuous vectors prepended to sequence - `discrete_token`: Emotions quantized into discrete tokens ### Usage The repository contains three variants of the MIDI generation model, each trained with a different conditioning strategy. Each variant includes: - `model.pt`: The trained model weights - `mappings.pt`: Token mappings for MIDI generation - `model_config.pt`: Model configuration Additionally, `image_encoder.pt` contains the CLIP-based image emotion encoder. ## Intended Use This model is designed for: - Generating music that matches the emotional content of artwork - Exploring emotional transfer between visual and musical domains - Creative applications in art and music generation ### Limitations - Music generation quality depends on the emotional interpretation of input images - Generated MIDI may require human curation for professional use - Model's emotional understanding is limited to valence-arousal space ## Training Data The model combines: 1. Image encoder: Uses ArtBench with emotional annotations 2. MIDI generation: Uses the Lakh-Spotify dataset as processed by the midi-emotion project ## Attribution This project builds upon: - **midi-emotion** by Serkan Sulun et al. ([GitHub](https://github.com/serkansulun/midi-emotion)) - Paper: "Symbolic music generation conditioned on continuous-valued emotions" ([IEEE Access](https://ieeexplore.ieee.org/document/9762257)) - Citation: S. Sulun, M. E. P. Davies and P. Viana, "Symbolic Music Generation Conditioned on Continuous-Valued Emotions," in IEEE Access, vol. 10, pp. 44617-44626, 2022 - **CLIP** by OpenAI for the base image encoder architecture ## License This model is released under the MIT License. However, usage of the midi-emotion component should comply with its GPL-3.0 license.