File size: 3,449 Bytes

4f2d8da
a5303d6
 
 
cc8c9c1
 
 
 
 
 
 
a5303d6
cc8c9c1
 
 
 
 
4a9c70b
 
c642b07
4f2d8da
 
8fdc5bb
4f2d8da
c2fa77c
 
ff9ffd4
 
 
4f2d8da
a5303d6
4f2d8da
9f2733d
 
 
5dc7ac8
955e6b3
 
 
 
 
5dc7ac8
955e6b3
 
 
 
 
 
 
5dc7ac8
955e6b3
 
 
 
5dc7ac8
 
 
 
 
 
 
 
955e6b3
5dc7ac8
955e6b3
 
5dc7ac8
 
 
 
 
 
 
 
955e6b3
5dc7ac8
 
 
955e6b3
5dc7ac8
9f2733d
 
 
4f2d8da
a5303d6
4f2d8da
a5303d6
4f2d8da
a5303d6
 
 
 
 
4f2d8da
a5303d6
4f2d8da
a5303d6
 
 
2165a6a
 
 
 
 
 
 
 
a5303d6

---
license: mit
language: en
tags:
- llm
- music
- multimodal
- midi
- phi-3
- question-answering
- optical-music-recognition
model-index:
- name: Phi-3-MusiX
  results: []
datasets:
- puar-playground/MusiXQA
pipeline_tag: image-text-to-text
base_model:
- microsoft/Phi-3-vision-128k-instruct
library_name: peft
---

# Phi-3-MusiX 🎵

**Phi-3-MusiX** is a LoRA adapter for [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) for understanding symbolic music in the form of scanned music sheets, MIDI files, and structured annotations.
This adapter equips Phi-3 with the ability to perform symbolic music reasoning and answer questions about scanned music sheets and MIDI content.
- Sorce code: [GitHub](https://github.com/puar-playground/MusiXQA)
- Dataset: [MusiXQA](https://huggingface.co/datasets/puar-playground/MusiXQA)
- Paper: [arXiv](https://arxiv.org/abs/2506.23009)

---

## Inference
```
from transformers import AutoModelForCausalLM 
from transformers import AutoProcessor
from PIL import Image 
from http import HTTPStatus
import torch
import requests
from io import BytesIO

def load_img(img_dir):
  if img_dir.startswith('http://') or img_dir.startswith('https://'):
      response = requests.get(img_dir)
      image = Image.open(BytesIO(response.content)).convert('RGB')
  else:
      image = Image.open(img_dir).convert('RGB')
  return image


model = AutoModelForCausalLM.from_pretrained('microsoft/Phi-3-vision-128k-instruct', device_map="cuda", trust_remote_code=True, torch_dtype="auto")
processor = AutoProcessor.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True)
model.load_adapter('puar-playground/Phi-3-MusiX')


prompt = '' + f'USER: Answer the question:\n{question_string}. ASSISTANT:'

# setup message
messages = [{"role": "user", "content": f"<|image_1|>\n{prompt}"}]

# load image from dir
image = load_img(img_dir)

prompt_in = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt_in, [image], return_tensors="pt").to("cuda")

generation_args = { 
    "max_new_tokens": 500,
    "temperature": 0.1,
    "do_sample": False,
}

with torch.no_grad():
    generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)

# remove input tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
model_answer = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        
```



## 🧪 Training Data

The model is trained on the [MusiXQA](https://huggingface.co/datasets/puar-playground/MusiXQA) dataset, which includes four QA sets:

Each entry in the dataset includes:
- A scanned music sheet image
- Its structured metadata (`metadata.json`)
- A MIDI file
- QA pair targeting music understanding

---

## 🎓 Reference
If you use this dataset in your work, please cite it using the following reference:
```
@misc{chen2025musixqaadvancingvisualmusic,
      title={MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models}, 
      author={Jian Chen and Wenye Ma and Penghang Liu and Wei Wang and Tengwei Song and Ming Li and Chenguang Wang and Jiayu Qin and Ruiyi Zhang and Changyou Chen},
      year={2025},
      eprint={2506.23009},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.23009}, 
}
```