puar-playground
/

Phi-3-MusiX

Image-Text-to-Text

question-answering

optical-music-recognition

Model card Files Files and versions

Phi-3-MusiX / README.md

puar-playground's picture

puar-playground

Update README.md

2165a6a verified 3 months ago

|

history blame contribute delete

3.45 kB

	---
	license: mit
	language: en
	tags:
	- llm
	- music
	- multimodal
	- midi
	- phi-3
	- question-answering
	- optical-music-recognition
	model-index:
	- name: Phi-3-MusiX
	results: []
	datasets:
	- puar-playground/MusiXQA
	pipeline_tag: image-text-to-text
	base_model:
	- microsoft/Phi-3-vision-128k-instruct
	library_name: peft
	---

	# Phi-3-MusiX 🎵

	Phi-3-MusiX is a LoRA adapter for [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) for understanding symbolic music in the form of scanned music sheets, MIDI files, and structured annotations.
	This adapter equips Phi-3 with the ability to perform symbolic music reasoning and answer questions about scanned music sheets and MIDI content.
	- Sorce code: [GitHub](https://github.com/puar-playground/MusiXQA)
	- Dataset: [MusiXQA](https://huggingface.co/datasets/puar-playground/MusiXQA)
	- Paper: [arXiv](https://arxiv.org/abs/2506.23009)

	---

	## Inference
	```
	from transformers import AutoModelForCausalLM
	from transformers import AutoProcessor
	from PIL import Image
	from http import HTTPStatus
	import torch
	import requests
	from io import BytesIO

	def load_img(img_dir):
	if img_dir.startswith('http://') or img_dir.startswith('https://'):
	response = requests.get(img_dir)
	image = Image.open(BytesIO(response.content)).convert('RGB')
	else:
	image = Image.open(img_dir).convert('RGB')
	return image


	model = AutoModelForCausalLM.from_pretrained('microsoft/Phi-3-vision-128k-instruct', device_map="cuda", trust_remote_code=True, torch_dtype="auto")
	processor = AutoProcessor.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True)
	model.load_adapter('puar-playground/Phi-3-MusiX')


	prompt = '' + f'USER: Answer the question:\n{question_string}. ASSISTANT:'

	# setup message
	messages = [{"role": "user", "content": f"<\|image_1\|>\n{prompt}"}]

	# load image from dir
	image = load_img(img_dir)

	prompt_in = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(prompt_in, [image], return_tensors="pt").to("cuda")

	generation_args = {
	"max_new_tokens": 500,
	"temperature": 0.1,
	"do_sample": False,
	}

	with torch.no_grad():
	generate_ids = model.generate(inputs, eos_token_id=processor.tokenizer.eos_token_id, generation_args)

	# remove input tokens
	generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
	model_answer = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

	```



	## 🧪 Training Data

	The model is trained on the [MusiXQA](https://huggingface.co/datasets/puar-playground/MusiXQA) dataset, which includes four QA sets:

	Each entry in the dataset includes:
	- A scanned music sheet image
	- Its structured metadata (`metadata.json`)
	- A MIDI file
	- QA pair targeting music understanding

	---

	## 🎓 Reference
	If you use this dataset in your work, please cite it using the following reference:
	```
	@misc{chen2025musixqaadvancingvisualmusic,
	title={MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models},
	author={Jian Chen and Wenye Ma and Penghang Liu and Wei Wang and Tengwei Song and Ming Li and Chenguang Wang and Jiayu Qin and Ruiyi Zhang and Changyou Chen},
	year={2025},
	eprint={2506.23009},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2506.23009},
	}
	```