microsoft/Phi-4-multimodal-instruct

phi

#41

by smohdy - opened 9 days ago

base: refs/heads/main

←

from: refs/pr/41

Discussion Files changed

+42

-218

Files changed (2) hide show

README.md +41 -217
modeling_phi4mm.py +1 -1

README.md CHANGED Viewed

@@ -44,7 +44,7 @@ widget:
   src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
 - messages:
   - role: user
-    content: Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation.
 library_name: transformers
 paper: arxiv.org/abs/2503.01743
 ---
@@ -145,8 +145,6 @@ With Phi-4-multimodal-instruct, a single new open model has been trained across
 It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
 ## Model Quality
-<details>
-  <summary>Click to view details</summary>
 To understand the capabilities, Phi-4-multimodal-instruct  was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:
@@ -264,7 +262,6 @@ BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very
 ![alt text](./figures/multi_image.png)
-</details>
 ## Usage
@@ -391,9 +388,6 @@ If it is a square image, the resolution would be around (8*448 by 8*448). For mu
 After obtaining the Phi-4-multimodal-instruct model checkpoints, users can use this sample code for inference.
-<details>
-  <summary>Click to view details</summary>
 ```python
 import requests
 import torch
@@ -473,35 +467,33 @@ response = processor.batch_decode(
 )[0]
 print(f'>>> Response\n{response}')
 ```
-</details>
-More inference examples can be found [**here**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_inference_phi4mm.py).
-### vLLM inference
-User can start a server with this command
-```bash
-python -m vllm.entrypoints.openai.api_server --model 'microsoft/Phi-4-multimodal-instruct' --dtype auto --trust-remote-code --max-model-len 131072 --enable-lora --max-lora-rank 320 --lora-extra-vocab-size 0 --limit-mm-per-prompt audio=3,image=3 --max-loras 2 --lora-modules speech=<path to speech lora folder> vision=<path to vision lora folder>
-```
-The speech lora and vision lora folders are within the Phi-4-multimodal-instruct folder downloaded by vLLM, you can also use the following script to find thoses:
-```python
-from huggingface_hub import snapshot_download
-model_path = snapshot_download(repo_id="microsoft/Phi-4-multimodal-instruct")
-speech_lora_path = model_path+"/speech-lora"
-vision_lora_path = model_path+"/vision-lora"
-```
 ## Training
-### Fine-tuning
-A basic example of supervised fine-tuning (SFT) for [**speech**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/sample_finetune_speech.py) and [**vision**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/sample_finetune_vision.py) is provided respectively.
-An example on [**how to extend speech recognition to a new language**.](https://huggingface.co/microsoft/Phi-4-multimodal-instruct#appendix-b-fine-tuning-korean-speech)
 ### Model
 + **Architecture:** Phi-4-multimodal-instruct has 5.6B parameters and is a multimodal transformer model. The model has the pretrained Phi-4-Mini-Instruct as the backbone language model, and the advanced encoders and adapters of vision and speech.<br>
@@ -535,53 +527,11 @@ Phi-4-multimodal-instruct's training data includes a wide variety of sources, to
 Focus was placed on the quality of data that could potentially improve the reasoning ability for the model, and the publicly available documents were filtered to contain a preferred level of knowledge. As an example, the result of a game in premier league on a particular day might be good training data for large foundation models, but such information was removed for the Phi-4-multimodal-instruct to leave more model capacity for reasoning for the model's small size. The data collection process involved sourcing information from publicly available documents, with a focus on filtering out undesirable documents and images. To safeguard privacy, image and text data sources were filtered to remove or scrub potentially personal data from the training data.
 The decontamination process involved normalizing and tokenizing the dataset, then generating and comparing n-grams between the target dataset and benchmark datasets. Samples with matching n-grams above a threshold were flagged as contaminated and removed from the dataset. A detailed contamination report was generated, summarizing the matched text, matching ratio, and filtered results for further analysis.
-### Software
-* [PyTorch](https://github.com/pytorch/pytorch)
-* [Transformers](https://github.com/huggingface/transformers)
-* [Flash-Attention](https://github.com/HazyResearch/flash-attention)
-* [Accelerate](https://huggingface.co/docs/transformers/main/en/accelerate)
-* [soundfile](https://github.com/bastibe/python-soundfile)
-* [pillow](https://github.com/python-pillow/Pillow)
-### Hardware
-Note that by default, the Phi-4-multimodal-instruct model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
-* NVIDIA A100
-* NVIDIA A6000
-* NVIDIA H100
-If you want to run the model on:
-* NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with _attn_implementation="eager"
-## Responsible AI Considerations
-<details>
-  <summary>Click to view detail descriptions</summary>
-Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
-+ Quality of Service: The Phi models are trained primarily on English language content across text, speech, and visual inputs, with some additional multilingual coverage. Performance may vary significantly across different modalities and languages:
-  + Text: Languages other than English will experience reduced performance, with varying levels of degradation across different non-English languages. English language varieties with less representation in the training data may perform worse than standard American English.
-  + Speech: Speech recognition and processing shows similar language-based performance patterns, with optimal performance for standard American English accents and pronunciations. Other English accents, dialects, and non-English languages may experience lower recognition accuracy and response quality. Background noise, audio quality, and speaking speed can further impact performance.
-  + Vision: Visual processing capabilities may be influenced by cultural and geographical biases in the training data. The model may show reduced performance when analyzing images containing text in non-English languages or visual elements more commonly found in non-Western contexts. Image quality, lighting conditions, and composition can also affect processing accuracy.
-+ Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 4 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards.
-+ Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
-+ Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case.
-+ Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
-+ Limited Scope for Code: The majority of Phi 4 training data is based in Python and uses common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, it is strongly recommended that users manually verify all API uses.
-+ Long Conversation: Phi 4 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift.
-+ Inference of Sensitive Attributes: The Phi 4 models can sometimes attempt to infer sensitive attributes (such as personality characteristics, country of origin, gender, etc...) from the users’ voices when specifically asked to do so. Phi 4-multimodal-instruct is not designed or intended to be used as a biometric categorization system to categorize individuals based on their biometric data to deduce or infer their race, political opinions, trade union membership, religious or philosophical beliefs, sex life, or sexual orientation. This behavior can be easily and efficiently mitigated at the application level by a system message.
-Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include:
-+ Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques.
-+ High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.
-+ Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).
-+ Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
-+ Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
-</details>
 ## Safety
-<details>
-  <summary>Click to view detail descriptions</summary>
 The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed for safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. For non-English languages, existing datasets were extended via machine translation. Speech Safety datasets were generated by running Text Safety datasets through Azure TTS (Text-To-Speech) Service, for both English and non-English languages. Vision (text & images) Safety datasets were created to cover harm categories identified both in public and internal multi-modal RAI datasets.
@@ -596,7 +546,24 @@ To assess model safety in scenarios involving both text and images, Microsoft's
 ### Audio Safety Evaluation
 In addition to extensive red teaming, the Safety of the model was assessed through three distinct evaluations. First, as performed with Text and Vision inputs, Microsoft's Azure AI Evaluation SDK was leveraged to detect the presence of harmful content in the model's responses to Speech prompts. Second, [Microsoft's Speech Fairness evaluation](https://speech.microsoft.com/portal/responsibleai/assess) was run to verify that Speech-To-Text transcription worked well across a variety of demographics. Third, we proposed and evaluated a mitigation approach via a system message to help prevent the model from inferring sensitive attributes (such as gender, sexual orientation, profession, medical condition, etc...) from the voice of a user.
-</details>
 ## License
 The model is licensed under the [MIT license](./LICENSE).
@@ -686,146 +653,3 @@ The model was evaluated across a breadth of public and internal benchmarks to un
   + Red Team:
     + Responses to prompts provided by AI Red Team at Microsoft
 </details>
-## Appendix B: Fine-tuning Korean speech
-<details>
-  <summary>Click to view detail descriptions</summary>
-### Overview and Datasets
-Phi-4-multimodal is originally not designed for Korean speech-to-text task, but it can be fine-tuned for Korean speech-to-text task using your own data or public Korean speech datasets.
-We have fine-tuned Phi-4-multimodal model for Korean speech-to-text task using the following datasets:
-- kresnik/zeroth_korean
-- mozilla-foundation/common_voice_17_0 (Used Korean speech only)
-- PolyAI/minds14 (Used Korean speech only)
-- Custom dataset. The speech was a mix of fast and slow speech (Technical blog contents and presentations that the author have posted), with some modulation using [audiomentations](https://github.com/iver56/audiomentations) and [this script](https://github.com/daekeun-ml/azure-genai-utils/blob/main/azure_genai_utils/stt/augment.py)
-Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.
-You can download the fine-tuned model [here](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech). Please refer to the Jupyter notebook and video clips in the [demo folder](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech/tree/main/demos). They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly.
-### Requirements
-Based on Python 3.10, the following packages are required, and A100/H100 GPU is recommended.
-```
-torch==2.6.0
-transformers==4.48.2
-accelerate==1.4.0
-soundfile==0.13.1
-pillow==11.1.0
-scipy==1.15.2
-torchvision==0.21.0
-backoff==2.2.1
-peft==0.14.0
-datasets==3.3.2
-pandas==2.2.3
-flash_attn==2.7.4.post1
-evaluate==0.4.3
-sacrebleu==2.5.1
-```
-### Training
-The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
-The fine tuning script and command line are basically the same as [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-main-py), but you need to prepare your own dataset. Also, to perform audio encoder unfreeze, please refer to the code snippet below. The code snippet is retrieved from [the fine-tuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing).
-```python
-with accelerator.local_main_process_first():
-    processor = AutoProcessor.from_pretrained(
-        "microsoft/Phi-4-multimodal-instruct",
-        trust_remote_code=True,
-    )
-    model = create_model(
-        args.model_name_or_path,
-        use_flash_attention=args.use_flash_attention,
-    )
-def unfreeze_speech_components(model):
-    """Directly target verified components from your debug logs"""
-    # 1. Audio Embed Module (confirmed exists)
-    audio_embed = model.model.embed_tokens_extend.audio_embed
-    # 2. Entire Audio Encoder (simplified)
-    audio_encoder = audio_embed.encoder  # Direct access
-    # 3. Audio Projection (from debug logs)
-    audio_projection = audio_embed.audio_projection
-    # Unfreeze ONLY these 3 components
-    for component in [audio_embed, audio_encoder, audio_projection]:
-        for param in component.parameters():
-            param.requires_grad = True
-    return model
-model = unfreeze_speech_components(model)
-# Verify unfrozen parameters
-trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
-print(f"Trainable parameters: {trainable_params:,}")
-# After unfreezing
-encoder_params = list(model.model.embed_tokens_extend.audio_embed.encoder.parameters())
-proj_params = list(model.model.embed_tokens_extend.audio_embed.audio_projection.parameters())
-assert any(p.requires_grad for p in encoder_params), "Encoder params frozen!"
-assert any(p.requires_grad for p in proj_params), "Projection params frozen!"
-print("Components properly unfrozen ✅")
-```
-Example commands to run finetuning scripts are as follows:
-```bash
-python main.py
-```
-The latest version of the model currently uploaded was fine-tuned by **unfreezing the audio encoder**, and the ASR performance was significantly improved compared to the baseline LoRA adapter-based fine-tuning.
-Comparing the full fine-tuning and LoRA fine-tuning, the CER on zeroth-test set is **1.61%** and 2.72%, and the WER on zeroth-test set is **3.54%** and 7.19%, respectively. Please refer to the [Experimental Settings and Results](#experimental-settings-and-results) for more details.
-### Experimental Settings and Results
-The purpose of this benchmarking setup is to evaluate the basic performance of Korean audio in speech and audio understanding tasks. We did this for automatic speech recognition and automatic speech translation, and the test data used the following datasets and samples:
-Evaluation was done on the following datasets:
-+ ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) and WER (Word Error Rate) on [zeroth-test set (457 samples)](https://huggingface.co/datasets/kresnik/zeroth_korean).
-+ AST (Automatic Speech Translation): Evaluated with BLEU score on [fleurs ko <-> en speech translation test set (270 samples)](https://huggingface.co/datasets/seastar105/fleurs_ko_en_test).
-Evaluation Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py)
-We used the [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor) as a baseline to improve performance, as it showed significant performance improvement with 1 epoch. Note that the baseline was trained with [22K Zeroth Korean Korean speech data](https://huggingface.co/datasets/kresnik/zeroth_korean) for 1 epoch. Based on this baseline with 35K training samples, we conducted additional experiments with the following scenarios:
-+ [Case 1] LoRA finetune (1 epoch): LoRA adapter-based fine-tuning for 1 epochs
-+ [Case 2] LoRA finetune (4 epochs): LoRA adapter-based fine-tuning for 4 epochs
-+ [Case 3] Unfreeze audio encoder finetune (4 epochs): Full fine-tuning for 4 epochs.
-The results of the experiments are as follows:
-+ CER and WER for zeroth-test set (Lower is better)
-  + Case 1's CER and WER are 3.80% and 11.52%, respectively, which are better than the baseline (7.02% and 17.31%).
-  + Case 2's CER and WER are 2.72% and 7.19%, respectively, which are better than Case 1.
-  + Case 3's CER and WER are 1.61% and 3.54%, respectively, which are the best among the cases.
-+ BLEU score for fleurs ko <-> en speech translation test set (Higher is better)
-  + Case 1's result is not improved compared to the baseline. Especially, the BLEU score for fleurs-ko2en-cot is decreased compared to the baseline.
-  + Case 2's result is slightly improved compared to Case 1, which is the best among the cases.
-  + Case 3's result is not improved compared to the baseline and Case 2.
-| Model                          | zeroth (CER) | zeroth (WER) | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
-|--------------------------------|-------------|-------------|--------------|------------------|--------------|------------------|
-| original                       | 99.16       | 99.63       | 5.63         | 2.42             | 6.86         | 4.17             |
-| Ours - speech full finetune (4 epochs) | 1.61        | 3.54        | 7.67         | 8.38             | 12.31        | 9.69             |
-| LoRA finetune (4 epochs)        | 2.72        | 7.19        | 7.11         | 9.95             | 13.22        | 10.45            |
-| LoRA finetune (1 epoch)         | 3.80        | 11.52       | 7.03         | 7.04             | 12.50        | 9.54             |
-| Phi-4-mm-inst-zeroth-kor        | 7.02        | 17.31       | 7.07         | 9.19             | 13.08        | 9.35             |
-## Cautions
-Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
-Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
-## References
-- https://huggingface.co/microsoft/Phi-4-multimodal-instruct
-- https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor
-</details>

   src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
 - messages:
   - role: user
+    content: Can you provide ways to eat combinations of bananas and dragonfruits?
 library_name: transformers
 paper: arxiv.org/abs/2503.01743
 ---
 It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
 ## Model Quality
 To understand the capabilities, Phi-4-multimodal-instruct  was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:
 ![alt text](./figures/multi_image.png)
 ## Usage
 After obtaining the Phi-4-multimodal-instruct model checkpoints, users can use this sample code for inference.
 ```python
 import requests
 import torch
 )[0]
 print(f'>>> Response\n{response}')
 ```
+## Responsible AI Considerations
+Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
++ Quality of Service: The Phi models are trained primarily on English language content across text, speech, and visual inputs, with some additional multilingual coverage. Performance may vary significantly across different modalities and languages:
+  + Text: Languages other than English will experience reduced performance, with varying levels of degradation across different non-English languages. English language varieties with less representation in the training data may perform worse than standard American English.
+  + Speech: Speech recognition and processing shows similar language-based performance patterns, with optimal performance for standard American English accents and pronunciations. Other English accents, dialects, and non-English languages may experience lower recognition accuracy and response quality. Background noise, audio quality, and speaking speed can further impact performance.
+  + Vision: Visual processing capabilities may be influenced by cultural and geographical biases in the training data. The model may show reduced performance when analyzing images containing text in non-English languages or visual elements more commonly found in non-Western contexts. Image quality, lighting conditions, and composition can also affect processing accuracy.
++ Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 4 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards.
++ Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
++ Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case.
++ Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
++ Limited Scope for Code: The majority of Phi 4 training data is based in Python and uses common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, it is strongly recommended that users manually verify all API uses.
++ Long Conversation: Phi 4 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift.
++ Inference of Sensitive Attributes: The Phi 4 models can sometimes attempt to infer sensitive attributes (such as personality characteristics, country of origin, gender, etc...) from the users’ voices when specifically asked to do so. Phi 4-multimodal-instruct is not designed or intended to be used as a biometric categorization system to categorize individuals based on their biometric data to deduce or infer their race, political opinions, trade union membership, religious or philosophical beliefs, sex life, or sexual orientation. This behavior can be easily and efficiently mitigated at the application level by a system message.
+Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include:
++ Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques.
++ High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.
++ Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).
++ Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
++ Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
 ## Training
 ### Model
 + **Architecture:** Phi-4-multimodal-instruct has 5.6B parameters and is a multimodal transformer model. The model has the pretrained Phi-4-Mini-Instruct as the backbone language model, and the advanced encoders and adapters of vision and speech.<br>
 Focus was placed on the quality of data that could potentially improve the reasoning ability for the model, and the publicly available documents were filtered to contain a preferred level of knowledge. As an example, the result of a game in premier league on a particular day might be good training data for large foundation models, but such information was removed for the Phi-4-multimodal-instruct to leave more model capacity for reasoning for the model's small size. The data collection process involved sourcing information from publicly available documents, with a focus on filtering out undesirable documents and images. To safeguard privacy, image and text data sources were filtered to remove or scrub potentially personal data from the training data.
 The decontamination process involved normalizing and tokenizing the dataset, then generating and comparing n-grams between the target dataset and benchmark datasets. Samples with matching n-grams above a threshold were flagged as contaminated and removed from the dataset. A detailed contamination report was generated, summarizing the matched text, matching ratio, and filtered results for further analysis.
+### Fine-tuning
+A basic example of supervised fine-tuning (SFT) for [speech](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/sample_finetune_speech.py) and [vision](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/sample_finetune_vision.py) is provided respectively.
 ## Safety
 The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed for safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. For non-English languages, existing datasets were extended via machine translation. Speech Safety datasets were generated by running Text Safety datasets through Azure TTS (Text-To-Speech) Service, for both English and non-English languages. Vision (text & images) Safety datasets were created to cover harm categories identified both in public and internal multi-modal RAI datasets.
 ### Audio Safety Evaluation
 In addition to extensive red teaming, the Safety of the model was assessed through three distinct evaluations. First, as performed with Text and Vision inputs, Microsoft's Azure AI Evaluation SDK was leveraged to detect the presence of harmful content in the model's responses to Speech prompts. Second, [Microsoft's Speech Fairness evaluation](https://speech.microsoft.com/portal/responsibleai/assess) was run to verify that Speech-To-Text transcription worked well across a variety of demographics. Third, we proposed and evaluated a mitigation approach via a system message to help prevent the model from inferring sensitive attributes (such as gender, sexual orientation, profession, medical condition, etc...) from the voice of a user.
+## Software
+* [PyTorch](https://github.com/pytorch/pytorch)
+* [Transformers](https://github.com/huggingface/transformers)
+* [Flash-Attention](https://github.com/HazyResearch/flash-attention)
+* [Accelerate](https://huggingface.co/docs/transformers/main/en/accelerate)
+* [soundfile](https://github.com/bastibe/python-soundfile)
+* [pillow](https://github.com/python-pillow/Pillow)
+## Hardware
+Note that by default, the Phi-4-multimodal-instruct model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
+* NVIDIA A100
+* NVIDIA A6000
+* NVIDIA H100
+If you want to run the model on:
+* NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with _attn_implementation="eager"
 ## License
 The model is licensed under the [MIT license](./LICENSE).
   + Red Team:
     + Responses to prompts provided by AI Red Team at Microsoft
 </details>

modeling_phi4mm.py CHANGED Viewed

@@ -2096,7 +2096,7 @@ class Phi4MMForCausalLM(Phi4MMPreTrainedModel, GenerationMixin):
         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         if isinstance(input_mode, torch.Tensor):
-            # len(input_mode) == num_beams in beam search, and all elements of input_mode should have the same value
             input_mode = input_mode[0].item()
         input_mode = InputMode(input_mode)

         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
         if isinstance(input_mode, torch.Tensor):
+            assert len(input_mode) == 1
             input_mode = input_mode[0].item()
         input_mode = InputMode(input_mode)