|
|
--- |
|
|
license: mit |
|
|
language: en |
|
|
tags: |
|
|
- llm |
|
|
- music |
|
|
- multimodal |
|
|
- midi |
|
|
- phi-3 |
|
|
- question-answering |
|
|
- optical-music-recognition |
|
|
model-index: |
|
|
- name: Phi-3-MusiX |
|
|
results: [] |
|
|
datasets: |
|
|
- puar-playground/MusiXQA |
|
|
pipeline_tag: image-text-to-text |
|
|
base_model: |
|
|
- microsoft/Phi-3-vision-128k-instruct |
|
|
library_name: peft |
|
|
--- |
|
|
|
|
|
# Phi-3-MusiX 🎵 |
|
|
|
|
|
**Phi-3-MusiX** is a LoRA adapter for [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) for understanding symbolic music in the form of scanned music sheets, MIDI files, and structured annotations. |
|
|
This adapter equips Phi-3 with the ability to perform symbolic music reasoning and answer questions about scanned music sheets and MIDI content. |
|
|
- Sorce code: [GitHub](https://github.com/puar-playground/MusiXQA) |
|
|
- Dataset: [MusiXQA](https://huggingface.co/datasets/puar-playground/MusiXQA) |
|
|
- Paper: [arXiv](https://arxiv.org/abs/2506.23009) |
|
|
|
|
|
--- |
|
|
|
|
|
## Inference |
|
|
``` |
|
|
from transformers import AutoModelForCausalLM |
|
|
from transformers import AutoProcessor |
|
|
from PIL import Image |
|
|
from http import HTTPStatus |
|
|
import torch |
|
|
import requests |
|
|
from io import BytesIO |
|
|
|
|
|
def load_img(img_dir): |
|
|
if img_dir.startswith('http://') or img_dir.startswith('https://'): |
|
|
response = requests.get(img_dir) |
|
|
image = Image.open(BytesIO(response.content)).convert('RGB') |
|
|
else: |
|
|
image = Image.open(img_dir).convert('RGB') |
|
|
return image |
|
|
|
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained('microsoft/Phi-3-vision-128k-instruct', device_map="cuda", trust_remote_code=True, torch_dtype="auto") |
|
|
processor = AutoProcessor.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True) |
|
|
model.load_adapter('puar-playground/Phi-3-MusiX') |
|
|
|
|
|
|
|
|
prompt = '' + f'USER: Answer the question:\n{question_string}. ASSISTANT:' |
|
|
|
|
|
# setup message |
|
|
messages = [{"role": "user", "content": f"<|image_1|>\n{prompt}"}] |
|
|
|
|
|
# load image from dir |
|
|
image = load_img(img_dir) |
|
|
|
|
|
prompt_in = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = processor(prompt_in, [image], return_tensors="pt").to("cuda") |
|
|
|
|
|
generation_args = { |
|
|
"max_new_tokens": 500, |
|
|
"temperature": 0.1, |
|
|
"do_sample": False, |
|
|
} |
|
|
|
|
|
with torch.no_grad(): |
|
|
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args) |
|
|
|
|
|
# remove input tokens |
|
|
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] |
|
|
model_answer = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## 🧪 Training Data |
|
|
|
|
|
The model is trained on the [MusiXQA](https://huggingface.co/datasets/puar-playground/MusiXQA) dataset, which includes four QA sets: |
|
|
|
|
|
Each entry in the dataset includes: |
|
|
- A scanned music sheet image |
|
|
- Its structured metadata (`metadata.json`) |
|
|
- A MIDI file |
|
|
- QA pair targeting music understanding |
|
|
|
|
|
--- |
|
|
|
|
|
## 🎓 Reference |
|
|
If you use this dataset in your work, please cite it using the following reference: |
|
|
``` |
|
|
@misc{chen2025musixqaadvancingvisualmusic, |
|
|
title={MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models}, |
|
|
author={Jian Chen and Wenye Ma and Penghang Liu and Wei Wang and Tengwei Song and Ming Li and Chenguang Wang and Jiayu Qin and Ruiyi Zhang and Changyou Chen}, |
|
|
year={2025}, |
|
|
eprint={2506.23009}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2506.23009}, |
|
|
} |
|
|
``` |