Files changed (2) hide show
  1. README.md +41 -217
  2. modeling_phi4mm.py +1 -1
README.md CHANGED
@@ -44,7 +44,7 @@ widget:
44
  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
45
  - messages:
46
  - role: user
47
- content: Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation.
48
  library_name: transformers
49
  paper: arxiv.org/abs/2503.01743
50
  ---
@@ -145,8 +145,6 @@ With Phi-4-multimodal-instruct, a single new open model has been trained across
145
  It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
146
 
147
  ## Model Quality
148
- <details>
149
- <summary>Click to view details</summary>
150
 
151
  To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:
152
 
@@ -264,7 +262,6 @@ BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very
264
 
265
  ![alt text](./figures/multi_image.png)
266
 
267
- </details>
268
 
269
  ## Usage
270
 
@@ -391,9 +388,6 @@ If it is a square image, the resolution would be around (8*448 by 8*448). For mu
391
 
392
  After obtaining the Phi-4-multimodal-instruct model checkpoints, users can use this sample code for inference.
393
 
394
- <details>
395
- <summary>Click to view details</summary>
396
-
397
  ```python
398
  import requests
399
  import torch
@@ -473,35 +467,33 @@ response = processor.batch_decode(
473
  )[0]
474
  print(f'>>> Response\n{response}')
475
  ```
476
- </details>
477
 
478
- More inference examples can be found [**here**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_inference_phi4mm.py).
479
-
480
- ### vLLM inference
481
-
482
- User can start a server with this command
483
 
484
- ```bash
485
- python -m vllm.entrypoints.openai.api_server --model 'microsoft/Phi-4-multimodal-instruct' --dtype auto --trust-remote-code --max-model-len 131072 --enable-lora --max-lora-rank 320 --lora-extra-vocab-size 0 --limit-mm-per-prompt audio=3,image=3 --max-loras 2 --lora-modules speech=<path to speech lora folder> vision=<path to vision lora folder>
486
- ```
 
 
 
 
 
 
 
 
 
 
 
487
 
488
- The speech lora and vision lora folders are within the Phi-4-multimodal-instruct folder downloaded by vLLM, you can also use the following script to find thoses:
 
 
 
 
489
 
490
- ```python
491
- from huggingface_hub import snapshot_download
492
- model_path = snapshot_download(repo_id="microsoft/Phi-4-multimodal-instruct")
493
- speech_lora_path = model_path+"/speech-lora"
494
- vision_lora_path = model_path+"/vision-lora"
495
- ```
496
 
497
  ## Training
498
 
499
- ### Fine-tuning
500
-
501
- A basic example of supervised fine-tuning (SFT) for [**speech**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/sample_finetune_speech.py) and [**vision**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/sample_finetune_vision.py) is provided respectively.
502
-
503
- An example on [**how to extend speech recognition to a new language**.](https://huggingface.co/microsoft/Phi-4-multimodal-instruct#appendix-b-fine-tuning-korean-speech)
504
-
505
  ### Model
506
 
507
  + **Architecture:** Phi-4-multimodal-instruct has 5.6B parameters and is a multimodal transformer model. The model has the pretrained Phi-4-Mini-Instruct as the backbone language model, and the advanced encoders and adapters of vision and speech.<br>
@@ -535,53 +527,11 @@ Phi-4-multimodal-instruct's training data includes a wide variety of sources, to
535
  Focus was placed on the quality of data that could potentially improve the reasoning ability for the model, and the publicly available documents were filtered to contain a preferred level of knowledge. As an example, the result of a game in premier league on a particular day might be good training data for large foundation models, but such information was removed for the Phi-4-multimodal-instruct to leave more model capacity for reasoning for the model's small size. The data collection process involved sourcing information from publicly available documents, with a focus on filtering out undesirable documents and images. To safeguard privacy, image and text data sources were filtered to remove or scrub potentially personal data from the training data.
536
  The decontamination process involved normalizing and tokenizing the dataset, then generating and comparing n-grams between the target dataset and benchmark datasets. Samples with matching n-grams above a threshold were flagged as contaminated and removed from the dataset. A detailed contamination report was generated, summarizing the matched text, matching ratio, and filtered results for further analysis.
537
 
538
- ### Software
539
- * [PyTorch](https://github.com/pytorch/pytorch)
540
- * [Transformers](https://github.com/huggingface/transformers)
541
- * [Flash-Attention](https://github.com/HazyResearch/flash-attention)
542
- * [Accelerate](https://huggingface.co/docs/transformers/main/en/accelerate)
543
- * [soundfile](https://github.com/bastibe/python-soundfile)
544
- * [pillow](https://github.com/python-pillow/Pillow)
545
-
546
- ### Hardware
547
- Note that by default, the Phi-4-multimodal-instruct model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
548
- * NVIDIA A100
549
- * NVIDIA A6000
550
- * NVIDIA H100
551
-
552
- If you want to run the model on:
553
- * NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with _attn_implementation="eager"
554
-
555
-
556
- ## Responsible AI Considerations
557
- <details>
558
- <summary>Click to view detail descriptions</summary>
559
 
560
- Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
561
- + Quality of Service: The Phi models are trained primarily on English language content across text, speech, and visual inputs, with some additional multilingual coverage. Performance may vary significantly across different modalities and languages:
562
- + Text: Languages other than English will experience reduced performance, with varying levels of degradation across different non-English languages. English language varieties with less representation in the training data may perform worse than standard American English.
563
- + Speech: Speech recognition and processing shows similar language-based performance patterns, with optimal performance for standard American English accents and pronunciations. Other English accents, dialects, and non-English languages may experience lower recognition accuracy and response quality. Background noise, audio quality, and speaking speed can further impact performance.
564
- + Vision: Visual processing capabilities may be influenced by cultural and geographical biases in the training data. The model may show reduced performance when analyzing images containing text in non-English languages or visual elements more commonly found in non-Western contexts. Image quality, lighting conditions, and composition can also affect processing accuracy.
565
- + Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 4 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards.
566
- + Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
567
- + Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case.
568
- + Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
569
- + Limited Scope for Code: The majority of Phi 4 training data is based in Python and uses common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, it is strongly recommended that users manually verify all API uses.
570
- + Long Conversation: Phi 4 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift.
571
- + Inference of Sensitive Attributes: The Phi 4 models can sometimes attempt to infer sensitive attributes (such as personality characteristics, country of origin, gender, etc...) from the users’ voices when specifically asked to do so. Phi 4-multimodal-instruct is not designed or intended to be used as a biometric categorization system to categorize individuals based on their biometric data to deduce or infer their race, political opinions, trade union membership, religious or philosophical beliefs, sex life, or sexual orientation. This behavior can be easily and efficiently mitigated at the application level by a system message.
572
-
573
- Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include:
574
-
575
- + Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques.
576
- + High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.
577
- + Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).
578
- + Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
579
- + Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
580
- </details>
581
 
582
  ## Safety
583
- <details>
584
- <summary>Click to view detail descriptions</summary>
585
 
586
  The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed for safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. For non-English languages, existing datasets were extended via machine translation. Speech Safety datasets were generated by running Text Safety datasets through Azure TTS (Text-To-Speech) Service, for both English and non-English languages. Vision (text & images) Safety datasets were created to cover harm categories identified both in public and internal multi-modal RAI datasets.
587
 
@@ -596,7 +546,24 @@ To assess model safety in scenarios involving both text and images, Microsoft's
596
  ### Audio Safety Evaluation
597
 
598
  In addition to extensive red teaming, the Safety of the model was assessed through three distinct evaluations. First, as performed with Text and Vision inputs, Microsoft's Azure AI Evaluation SDK was leveraged to detect the presence of harmful content in the model's responses to Speech prompts. Second, [Microsoft's Speech Fairness evaluation](https://speech.microsoft.com/portal/responsibleai/assess) was run to verify that Speech-To-Text transcription worked well across a variety of demographics. Third, we proposed and evaluated a mitigation approach via a system message to help prevent the model from inferring sensitive attributes (such as gender, sexual orientation, profession, medical condition, etc...) from the voice of a user.
599
- </details>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
600
 
601
  ## License
602
  The model is licensed under the [MIT license](./LICENSE).
@@ -686,146 +653,3 @@ The model was evaluated across a breadth of public and internal benchmarks to un
686
  + Red Team:
687
  + Responses to prompts provided by AI Red Team at Microsoft
688
  </details>
689
-
690
-
691
- ## Appendix B: Fine-tuning Korean speech
692
-
693
- <details>
694
- <summary>Click to view detail descriptions</summary>
695
-
696
- ### Overview and Datasets
697
-
698
- Phi-4-multimodal is originally not designed for Korean speech-to-text task, but it can be fine-tuned for Korean speech-to-text task using your own data or public Korean speech datasets.
699
-
700
- We have fine-tuned Phi-4-multimodal model for Korean speech-to-text task using the following datasets:
701
-
702
- - kresnik/zeroth_korean
703
- - mozilla-foundation/common_voice_17_0 (Used Korean speech only)
704
- - PolyAI/minds14 (Used Korean speech only)
705
- - Custom dataset. The speech was a mix of fast and slow speech (Technical blog contents and presentations that the author have posted), with some modulation using [audiomentations](https://github.com/iver56/audiomentations) and [this script](https://github.com/daekeun-ml/azure-genai-utils/blob/main/azure_genai_utils/stt/augment.py)
706
-
707
- Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.
708
-
709
- You can download the fine-tuned model [here](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech). Please refer to the Jupyter notebook and video clips in the [demo folder](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech/tree/main/demos). They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly.
710
-
711
- ### Requirements
712
- Based on Python 3.10, the following packages are required, and A100/H100 GPU is recommended.
713
- ```
714
- torch==2.6.0
715
- transformers==4.48.2
716
- accelerate==1.4.0
717
- soundfile==0.13.1
718
- pillow==11.1.0
719
- scipy==1.15.2
720
- torchvision==0.21.0
721
- backoff==2.2.1
722
- peft==0.14.0
723
- datasets==3.3.2
724
- pandas==2.2.3
725
- flash_attn==2.7.4.post1
726
- evaluate==0.4.3
727
- sacrebleu==2.5.1
728
- ```
729
-
730
- ### Training
731
- The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
732
-
733
- The fine tuning script and command line are basically the same as [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-main-py), but you need to prepare your own dataset. Also, to perform audio encoder unfreeze, please refer to the code snippet below. The code snippet is retrieved from [the fine-tuning Colab notebook](https://colab.research.google.com/drive/1JAQdpX3BtIgDmTLlnHgstKfGw7HjSfej?usp=sharing).
734
-
735
- ```python
736
- with accelerator.local_main_process_first():
737
- processor = AutoProcessor.from_pretrained(
738
- "microsoft/Phi-4-multimodal-instruct",
739
- trust_remote_code=True,
740
- )
741
- model = create_model(
742
- args.model_name_or_path,
743
- use_flash_attention=args.use_flash_attention,
744
- )
745
-
746
- def unfreeze_speech_components(model):
747
- """Directly target verified components from your debug logs"""
748
- # 1. Audio Embed Module (confirmed exists)
749
- audio_embed = model.model.embed_tokens_extend.audio_embed
750
-
751
- # 2. Entire Audio Encoder (simplified)
752
- audio_encoder = audio_embed.encoder # Direct access
753
-
754
- # 3. Audio Projection (from debug logs)
755
- audio_projection = audio_embed.audio_projection
756
-
757
- # Unfreeze ONLY these 3 components
758
- for component in [audio_embed, audio_encoder, audio_projection]:
759
- for param in component.parameters():
760
- param.requires_grad = True
761
- return model
762
-
763
- model = unfreeze_speech_components(model)
764
-
765
- # Verify unfrozen parameters
766
- trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
767
- print(f"Trainable parameters: {trainable_params:,}")
768
-
769
- # After unfreezing
770
- encoder_params = list(model.model.embed_tokens_extend.audio_embed.encoder.parameters())
771
- proj_params = list(model.model.embed_tokens_extend.audio_embed.audio_projection.parameters())
772
-
773
- assert any(p.requires_grad for p in encoder_params), "Encoder params frozen!"
774
- assert any(p.requires_grad for p in proj_params), "Projection params frozen!"
775
- print("Components properly unfrozen ✅")
776
- ```
777
-
778
- Example commands to run finetuning scripts are as follows:
779
- ```bash
780
- python main.py
781
- ```
782
-
783
- The latest version of the model currently uploaded was fine-tuned by **unfreezing the audio encoder**, and the ASR performance was significantly improved compared to the baseline LoRA adapter-based fine-tuning.
784
- Comparing the full fine-tuning and LoRA fine-tuning, the CER on zeroth-test set is **1.61%** and 2.72%, and the WER on zeroth-test set is **3.54%** and 7.19%, respectively. Please refer to the [Experimental Settings and Results](#experimental-settings-and-results) for more details.
785
-
786
- ### Experimental Settings and Results
787
- The purpose of this benchmarking setup is to evaluate the basic performance of Korean audio in speech and audio understanding tasks. We did this for automatic speech recognition and automatic speech translation, and the test data used the following datasets and samples:
788
-
789
- Evaluation was done on the following datasets:
790
- + ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) and WER (Word Error Rate) on [zeroth-test set (457 samples)](https://huggingface.co/datasets/kresnik/zeroth_korean).
791
- + AST (Automatic Speech Translation): Evaluated with BLEU score on [fleurs ko <-> en speech translation test set (270 samples)](https://huggingface.co/datasets/seastar105/fleurs_ko_en_test).
792
-
793
- Evaluation Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py)
794
-
795
- We used the [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor) as a baseline to improve performance, as it showed significant performance improvement with 1 epoch. Note that the baseline was trained with [22K Zeroth Korean Korean speech data](https://huggingface.co/datasets/kresnik/zeroth_korean) for 1 epoch. Based on this baseline with 35K training samples, we conducted additional experiments with the following scenarios:
796
-
797
- + [Case 1] LoRA finetune (1 epoch): LoRA adapter-based fine-tuning for 1 epochs
798
- + [Case 2] LoRA finetune (4 epochs): LoRA adapter-based fine-tuning for 4 epochs
799
- + [Case 3] Unfreeze audio encoder finetune (4 epochs): Full fine-tuning for 4 epochs.
800
-
801
- The results of the experiments are as follows:
802
- + CER and WER for zeroth-test set (Lower is better)
803
- + Case 1's CER and WER are 3.80% and 11.52%, respectively, which are better than the baseline (7.02% and 17.31%).
804
- + Case 2's CER and WER are 2.72% and 7.19%, respectively, which are better than Case 1.
805
- + Case 3's CER and WER are 1.61% and 3.54%, respectively, which are the best among the cases.
806
-
807
- + BLEU score for fleurs ko <-> en speech translation test set (Higher is better)
808
- + Case 1's result is not improved compared to the baseline. Especially, the BLEU score for fleurs-ko2en-cot is decreased compared to the baseline.
809
- + Case 2's result is slightly improved compared to Case 1, which is the best among the cases.
810
- + Case 3's result is not improved compared to the baseline and Case 2.
811
-
812
- | Model | zeroth (CER) | zeroth (WER) | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
813
- |--------------------------------|-------------|-------------|--------------|------------------|--------------|------------------|
814
- | original | 99.16 | 99.63 | 5.63 | 2.42 | 6.86 | 4.17 |
815
- | Ours - speech full finetune (4 epochs) | 1.61 | 3.54 | 7.67 | 8.38 | 12.31 | 9.69 |
816
- | LoRA finetune (4 epochs) | 2.72 | 7.19 | 7.11 | 9.95 | 13.22 | 10.45 |
817
- | LoRA finetune (1 epoch) | 3.80 | 11.52 | 7.03 | 7.04 | 12.50 | 9.54 |
818
- | Phi-4-mm-inst-zeroth-kor | 7.02 | 17.31 | 7.07 | 9.19 | 13.08 | 9.35 |
819
-
820
- ## Cautions
821
-
822
- Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
823
-
824
- Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
825
-
826
- ## References
827
-
828
- - https://huggingface.co/microsoft/Phi-4-multimodal-instruct
829
- - https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor
830
-
831
- </details>
 
44
  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
45
  - messages:
46
  - role: user
47
+ content: Can you provide ways to eat combinations of bananas and dragonfruits?
48
  library_name: transformers
49
  paper: arxiv.org/abs/2503.01743
50
  ---
 
145
  It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
146
 
147
  ## Model Quality
 
 
148
 
149
  To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:
150
 
 
262
 
263
  ![alt text](./figures/multi_image.png)
264
 
 
265
 
266
  ## Usage
267
 
 
388
 
389
  After obtaining the Phi-4-multimodal-instruct model checkpoints, users can use this sample code for inference.
390
 
 
 
 
391
  ```python
392
  import requests
393
  import torch
 
467
  )[0]
468
  print(f'>>> Response\n{response}')
469
  ```
 
470
 
471
+ ## Responsible AI Considerations
 
 
 
 
472
 
473
+ Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
474
+ + Quality of Service: The Phi models are trained primarily on English language content across text, speech, and visual inputs, with some additional multilingual coverage. Performance may vary significantly across different modalities and languages:
475
+ + Text: Languages other than English will experience reduced performance, with varying levels of degradation across different non-English languages. English language varieties with less representation in the training data may perform worse than standard American English.
476
+ + Speech: Speech recognition and processing shows similar language-based performance patterns, with optimal performance for standard American English accents and pronunciations. Other English accents, dialects, and non-English languages may experience lower recognition accuracy and response quality. Background noise, audio quality, and speaking speed can further impact performance.
477
+ + Vision: Visual processing capabilities may be influenced by cultural and geographical biases in the training data. The model may show reduced performance when analyzing images containing text in non-English languages or visual elements more commonly found in non-Western contexts. Image quality, lighting conditions, and composition can also affect processing accuracy.
478
+ + Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 4 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards.
479
+ + Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
480
+ + Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case.
481
+ + Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
482
+ + Limited Scope for Code: The majority of Phi 4 training data is based in Python and uses common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, it is strongly recommended that users manually verify all API uses.
483
+ + Long Conversation: Phi 4 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift.
484
+ + Inference of Sensitive Attributes: The Phi 4 models can sometimes attempt to infer sensitive attributes (such as personality characteristics, country of origin, gender, etc...) from the users’ voices when specifically asked to do so. Phi 4-multimodal-instruct is not designed or intended to be used as a biometric categorization system to categorize individuals based on their biometric data to deduce or infer their race, political opinions, trade union membership, religious or philosophical beliefs, sex life, or sexual orientation. This behavior can be easily and efficiently mitigated at the application level by a system message.
485
+
486
+ Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include:
487
 
488
+ + Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques.
489
+ + High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.
490
+ + Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).
491
+ + Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
492
+ + Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
493
 
 
 
 
 
 
 
494
 
495
  ## Training
496
 
 
 
 
 
 
 
497
  ### Model
498
 
499
  + **Architecture:** Phi-4-multimodal-instruct has 5.6B parameters and is a multimodal transformer model. The model has the pretrained Phi-4-Mini-Instruct as the backbone language model, and the advanced encoders and adapters of vision and speech.<br>
 
527
  Focus was placed on the quality of data that could potentially improve the reasoning ability for the model, and the publicly available documents were filtered to contain a preferred level of knowledge. As an example, the result of a game in premier league on a particular day might be good training data for large foundation models, but such information was removed for the Phi-4-multimodal-instruct to leave more model capacity for reasoning for the model's small size. The data collection process involved sourcing information from publicly available documents, with a focus on filtering out undesirable documents and images. To safeguard privacy, image and text data sources were filtered to remove or scrub potentially personal data from the training data.
528
  The decontamination process involved normalizing and tokenizing the dataset, then generating and comparing n-grams between the target dataset and benchmark datasets. Samples with matching n-grams above a threshold were flagged as contaminated and removed from the dataset. A detailed contamination report was generated, summarizing the matched text, matching ratio, and filtered results for further analysis.
529
 
530
+ ### Fine-tuning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
531
 
532
+ A basic example of supervised fine-tuning (SFT) for [speech](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/sample_finetune_speech.py) and [vision](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/sample_finetune_vision.py) is provided respectively.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
533
 
534
  ## Safety
 
 
535
 
536
  The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed for safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. For non-English languages, existing datasets were extended via machine translation. Speech Safety datasets were generated by running Text Safety datasets through Azure TTS (Text-To-Speech) Service, for both English and non-English languages. Vision (text & images) Safety datasets were created to cover harm categories identified both in public and internal multi-modal RAI datasets.
537
 
 
546
  ### Audio Safety Evaluation
547
 
548
  In addition to extensive red teaming, the Safety of the model was assessed through three distinct evaluations. First, as performed with Text and Vision inputs, Microsoft's Azure AI Evaluation SDK was leveraged to detect the presence of harmful content in the model's responses to Speech prompts. Second, [Microsoft's Speech Fairness evaluation](https://speech.microsoft.com/portal/responsibleai/assess) was run to verify that Speech-To-Text transcription worked well across a variety of demographics. Third, we proposed and evaluated a mitigation approach via a system message to help prevent the model from inferring sensitive attributes (such as gender, sexual orientation, profession, medical condition, etc...) from the voice of a user.
549
+
550
+
551
+ ## Software
552
+ * [PyTorch](https://github.com/pytorch/pytorch)
553
+ * [Transformers](https://github.com/huggingface/transformers)
554
+ * [Flash-Attention](https://github.com/HazyResearch/flash-attention)
555
+ * [Accelerate](https://huggingface.co/docs/transformers/main/en/accelerate)
556
+ * [soundfile](https://github.com/bastibe/python-soundfile)
557
+ * [pillow](https://github.com/python-pillow/Pillow)
558
+
559
+ ## Hardware
560
+ Note that by default, the Phi-4-multimodal-instruct model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
561
+ * NVIDIA A100
562
+ * NVIDIA A6000
563
+ * NVIDIA H100
564
+
565
+ If you want to run the model on:
566
+ * NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with _attn_implementation="eager"
567
 
568
  ## License
569
  The model is licensed under the [MIT license](./LICENSE).
 
653
  + Red Team:
654
  + Responses to prompts provided by AI Red Team at Microsoft
655
  </details>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
modeling_phi4mm.py CHANGED
@@ -2096,7 +2096,7 @@ class Phi4MMForCausalLM(Phi4MMPreTrainedModel, GenerationMixin):
2096
  return_dict = return_dict if return_dict is not None else self.config.use_return_dict
2097
 
2098
  if isinstance(input_mode, torch.Tensor):
2099
- # len(input_mode) == num_beams in beam search, and all elements of input_mode should have the same value
2100
  input_mode = input_mode[0].item()
2101
  input_mode = InputMode(input_mode)
2102
 
 
2096
  return_dict = return_dict if return_dict is not None else self.config.use_return_dict
2097
 
2098
  if isinstance(input_mode, torch.Tensor):
2099
+ assert len(input_mode) == 1
2100
  input_mode = input_mode[0].item()
2101
  input_mode = InputMode(input_mode)
2102