--- datasets: - kresnik/zeroth_korean metrics: - bleu - cer base_model: - microsoft/Phi-4-multimodal-instruct model-index: - name: Phi-4-mm-inst-zeroth-kor results: - task: type: speech-to-text-translation dataset: type: seastar105/fleurs_ko_en_test name: fleurs (ko-en test intersection) metrics: - type: bleu name: ko2en value: 7.07 - type: bleu name: ko2en-cot value: 9.19 - type: bleu name: en2ko (ko-mecab) value: 13.08 - type: bleu name: en2ko-cot (ko-mecab) value: 9.35 - task: type: automatic-speech-recognition dataset: type: kresnik/zeroth_korean name: zeroth_korean test metrics: - type: cer name: test CER value: 7.02 language: - ko --- This model is fine-tuned from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on [kresnik/zeroth_korean](https://huggingface.co/datasets/kresnik/zeroth_korean) dataset only 1 epoch. script for fine-tuning is [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-main-py), adapted from phi-4 repository example model is trained only 174 steps on zeroth train set, and main purpose is to check if only korean ASR training can expand to other speech tasks(e.g. speech-to-text-translation) ## Evaluation ASR on zeroth-test set and Speech translation on fleurs ko <-> en speech translation result. script is [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py), and used 1 A40. | Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot | |----------|------------|--------------|------------------|--------------|------------------| | original | 195.92 | 5.62 | 2.45 | 6.87 | 4.35 | | finetune (this model) | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 | ## Example script ```python orig_model_path = "microsoft/Phi-4-multimodal-instruct" ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor" generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json') processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( ft_model_path, trust_remote_code=True, torch_dtype='auto', _attn_implementation='flash_attention_2', ).cuda() user_prompt = '<|user|>' assistant_prompt = '<|assistant|>' prompt_suffix = '<|end|>' # task prompt is from technical report asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}' ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}' ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}' ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}' ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}' asr_ds = load_dataset("kresnik/zeroth_korean", split="test") ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train") # ASR item = asr_ds[0] audio = (item["audio"]["array"], item["audio"]["sampling_rate"]) inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device) generate_ids = model.generate( **inputs, max_new_tokens=max_new_tokens, generation_config=generation_config, ) generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] # "몬토 킬은 자녀들이 사랑을 제대로 못 받고 크면 매우 심각한 결과가 초래된다는 결론을 내렸습니다" # AST, EN -> KO item = ast_ds[-1] audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"]) inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device) generate_ids = model.generate( **inputs, max_new_tokens=max_new_tokens, generation_config=generation_config, ) generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] # "가장 쉽게 접근 가능한 식물 자원은 잎과 légumes에서 접근 가능한 단백질이었을 것이다가요 하지만 이것들은 고형상 동물처럼 우리에게 소화하기 어렵습니다만 그것들이 끓여 있다면요" ```