t5-spotify-features-generator / README.md

Update README.md

02c146c verified 29 days ago

6.89 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: t5-base
	tags:
	- text2text-generation
	- music
	- spotify
	- audio-features
	- t5
	language:
	- en
	datasets:
	- custom
	metrics:
	- mae
	- mse
	- correlation
	---

	# T5 Spotify Features Generator

	A fine-tuned T5-base model that generates Spotify audio features from natural language music descriptions.

	## Model Details

	### Model Description

	This model converts natural language descriptions of music preferences into Spotify audio feature values. For example, "energetic dance music for a party" becomes `"danceability": 0.9, "energy": 0.9, "valence": 0.9`.

	- Developed by: afsagag
	- Model type: Text-to-Text Generation (T5)
	- Language(s): English
	- License: Apache-2.0
	- Finetuned from model: [t5-base](https://huggingface.co/t5-base)

	### Model Sources

	- Repository: https://huggingface.co/afsagag/t5-spotify-features-generator

	## Uses

	### Direct Use

	Generate Spotify audio features from music descriptions for:
	- Music recommendation systems
	- Playlist generation
	- Music discovery applications
	- Audio feature prediction research

	```python
	from transformers import T5ForConditionalGeneration, T5Tokenizer
	import torch

	# Load model and tokenizer
	model = T5ForConditionalGeneration.from_pretrained("afsagag/t5-spotify-features-generator")
	tokenizer = T5Tokenizer.from_pretrained("afsagag/t5-spotify-features-generator")

	def generate_spotify_features(prompt, model, tokenizer):
	input_text = f"prompt: {prompt}"
	input_ids = tokenizer(input_text, return_tensors="pt", max_length=256, truncation=True).input_ids

	with torch.no_grad():
	outputs = model.generate(
	input_ids,
	max_length=256,
	num_beams=4,
	early_stopping=True,
	do_sample=False,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id
	)

	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	return result

	# Example usage
	prompt = "I need energetic dance music for a party"
	features = generate_spotify_features(prompt, model, tokenizer)
	print(features) # Output: "danceability": 0.9, "energy": 0.9, "valence": 0.9
	```

	### Out-of-Scope Use

	- Generating actual audio or music files
	- Non-English music descriptions (model trained on English only)
	- Precise music recommendation without human oversight
	- Applications requiring guaranteed JSON format output

	## Bias, Risks, and Limitations

	- Training Data Bias: Reflects patterns in the training dataset, may not represent all musical styles or cultural contexts
	- JSON Format Issues: May occasionally generate incomplete JSON objects
	- Subjective Features: Audio features like "valence" and "energy" are subjective and may not align with all listeners' perceptions
	- Western Music Bias: Training focused on Western musical concepts and terminology

	### Recommendations

	- Validate generated features against expected ranges
	- Use as a starting point rather than definitive feature values
	- Consider cultural and stylistic diversity when applying to diverse music catalogs
	- Implement post-processing to ensure valid JSON output if required

	## Training Details

	### Training Data

	Custom dataset of 4,206 examples pairing natural language music descriptions with Spotify audio features:
	- Training set: 3,364 examples
	- Validation set: 421 examples
	- Test set: 421 examples

	### Training Procedure

	#### Training Hyperparameters

	- Training epochs: 5
	- Learning rate: 2e-4
	- Batch size: 32 (train), 16 (eval)
	- Gradient accumulation steps: 2
	- LR scheduler: Cosine with 5% warmup
	- Max sequence length: 256 tokens
	- Training regime: bf16 mixed precision

	#### Speeds, Sizes, Times

	- Training time: ~58 minutes
	- Final training loss: 0.5579
	- Model size: ~892MB

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	Same distribution as training data: natural language music descriptions paired with Spotify audio features.

	#### Metrics

	- Mean Absolute Error (MAE) between predicted and actual feature values
	- Mean Squared Error (MSE) for regression accuracy
	- Pearson correlation coefficients for individual features
	- Valid JSON ratio for output format correctness

	### Results

	The model demonstrates strong semantic understanding of musical concepts:

	\| Prompt \| Generated Features \|
	\|--------\|-------------------\|
	\| "I need energetic dance music for a party" \| `"danceability": 0.9, "energy": 0.9, "valence": 0.9` \|
	\| "Play calm acoustic songs for studying" \| `"acousticness": 0.8, "energy": 0.2, "valence": 0.2` \|
	\| "Upbeat music for working out" \| `"danceability": 0.7, "energy": 0.8, "valence": 0.7` \|
	\| "Relaxing instrumental background music" \| `"acousticness": 0.3, "energy": 0.2, "instrumentalness": 0.8, "valence": 0.2` \|
	\| "Happy pop music for driving" \| `"danceability": 0.8, "energy": 0.8, "valence": 0.8` \|

	## Technical Specifications

	### Model Architecture and Objective

	- Base Architecture: T5 (Text-To-Text Transfer Transformer)
	- Model Size: t5-base (220M parameters)
	- Objective: Sequence-to-sequence generation of audio features from text descriptions
	- Input Format: `"prompt: {natural_language_description}"`
	- Output Format: JSON-style audio feature values

	### Compute Infrastructure

	#### Hardware

	- GPU with CUDA support
	- Mixed precision training (bf16)

	#### Software

	- PyTorch with CUDA
	- Transformers library
	- Datasets library for data processing

	## Spotify Audio Features Reference

	The model generates these Spotify audio features:

	- danceability (0.0-1.0): How suitable a track is for dancing
	- energy (0.0-1.0): Perceptual measure of intensity and power
	- valence (0.0-1.0): Musical positivity (happy vs sad)
	- acousticness (0.0-1.0): Confidence measure of acoustic nature
	- instrumentalness (0.0-1.0): Predicts absence of vocals
	- speechiness (0.0-1.0): Presence of spoken words
	- liveness (0.0-1.0): Presence of live audience
	- loudness (dB): Overall loudness, typically -60 to 0 dB
	- tempo (BPM): Estimated beats per minute
	- duration_ms: Track duration in milliseconds
	- key (0-11): Musical key (C=0, C♯/D♭=1, etc.)
	- mode (0-1): Modality (0=minor, 1=major)
	- time_signature (3-7): Time signature
	- popularity (0-100): Spotify popularity score

	## Citation

	```bibtex
	@misc{t5-spotify-features-generator,
	author = {afsagag},
	title = {T5 Spotify Features Generator: Fine-tuned T5 for Music Feature Prediction from Natural Language},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/afsagag/t5-spotify-features-generator}}
	}
	```

	## Model Card Authors

	afsagag

	## Model Card Contact

	Contact through Hugging Face profile: [@afsagag](https://huggingface.co/afsagag)