vincentamato
/

ARIA

Model card Files Files and versions Community

ARIA / README.md

vincentamato's picture

Update README.md

0755ed0 verified about 2 months ago

|

history blame contribute delete

2.92 kB

	---
	license: mit
	tags:
	- art
	- music
	- midi
	- emotion
	- clip
	- multimodal
	---

	# ARIA - Artistic Rendering of Images into Audio

	ARIA is a multimodal AI model that generates MIDI music based on the emotional content of artwork. It uses a CLIP-based image encoder to extract emotional valence and arousal from images, then generates emotionally appropriate music using conditional MIDI generation.

	## Model Description

	- Developed by: Vincent Amato
	- Model type: Multimodal (Image-to-MIDI) Generation
	- Language(s): English
	- License: MIT
	- Parent Model: Uses CLIP for image encoding and midi-emotion for music generation
	- Repository: [GitHub](https://github.com/vincentamato/aria)

	### Model Architecture

	ARIA consists of two main components:
	1. A CLIP-based image encoder fine-tuned to predict emotional valence and arousal from images
	2. A transformer-based MIDI generation model (midi-emotion) that conditions on these emotional values

	The model offers three different conditioning modes:
	- `continuous_concat`: Emotions as continuous vectors concatenated to all tokens
	- `continuous_token`: Emotions as continuous vectors prepended to sequence
	- `discrete_token`: Emotions quantized into discrete tokens

	### Usage

	The repository contains three variants of the MIDI generation model, each trained with a different conditioning strategy. Each variant includes:
	- `model.pt`: The trained model weights
	- `mappings.pt`: Token mappings for MIDI generation
	- `model_config.pt`: Model configuration

	Additionally, `image_encoder.pt` contains the CLIP-based image emotion encoder.

	## Intended Use

	This model is designed for:
	- Generating music that matches the emotional content of artwork
	- Exploring emotional transfer between visual and musical domains
	- Creative applications in art and music generation

	### Limitations

	- Music generation quality depends on the emotional interpretation of input images
	- Generated MIDI may require human curation for professional use
	- Model's emotional understanding is limited to valence-arousal space

	## Training Data

	The model combines:
	1. Image encoder: Uses ArtBench with emotional annotations
	2. MIDI generation: Uses the Lakh-Spotify dataset as processed by the midi-emotion project

	## Attribution

	This project builds upon:
	- midi-emotion by Serkan Sulun et al. ([GitHub](https://github.com/serkansulun/midi-emotion))
	- Paper: "Symbolic music generation conditioned on continuous-valued emotions" ([IEEE Access](https://ieeexplore.ieee.org/document/9762257))
	- Citation: S. Sulun, M. E. P. Davies and P. Viana, "Symbolic Music Generation Conditioned on Continuous-Valued Emotions," in IEEE Access, vol. 10, pp. 44617-44626, 2022
	- CLIP by OpenAI for the base image encoder architecture

	## License

	This model is released under the MIT License. However, usage of the midi-emotion component should comply with its GPL-3.0 license.