lmz commited on
Commit
2b99572
·
verified ·
1 Parent(s): f3291e3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ - fr
6
+ library_name: moshi
7
+ tags:
8
+ - audio
9
+ - automatic-speech-recognition
10
+ ---
11
+ # Model Card for Kyutai STT
12
+
13
+ See also the [project page](https://kyutai.org/next/stt)
14
+ and the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).
15
+
16
+ This is a model for streaming speech-to-text (STT, also known as automatic speech recognition, ASR).
17
+ Unlike offline speech-to-text, where the model needs the entire audio to produce the transcript,
18
+ our model starts to output the transcript as soon as a few seconds of audio become available.
19
+
20
+ ## Model Details
21
+
22
+ The model architecture is a Transformer that consumes audio tokenized by Mimi (see [the Moshi paper](https://arxiv.org/abs/2410.00037)) and outputs text tokens.
23
+ The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens.
24
+
25
+ We release two models:
26
+ - `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
27
+ - `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.
28
+
29
+ ## Model Description
30
+
31
+ Kyutai STT is a decoder-only model for streaming speech-to-text.
32
+ It leverages the multistream architecture of [Moshi](https://moshi.chat/) to model text stream based on the speech stream.
33
+ The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio.
34
+
35
+ * Developed by: Kyutai
36
+ * Model type: Streaming Speech-to-Text transcription.
37
+ * Language(s) (NLP): English and French for `kyutai/stt-1b-en_fr`, English for `kyutai/stt-2.6b-en`
38
+ * License: Model weights are licensed under CC-BY 4.0
39
+ * Repository: [GitHub](https://github.com/kyutai-labs/delayed-streams-modeling/)
40
+
41
+ ## Uses
42
+
43
+ ### Direct Use
44
+
45
+ The model can be used for streaming speech-to-text.
46
+ It is robust to noisy conditions and was found to perform well with audio as long as 2 hours with no additonal changes.
47
+ The model produces transcripts with capitalization and punctuation.
48
+ The predicted text token timestamps can be recovered by subtracting the model's text stream offset (0.5 or 2.5 seconds) from the frame's offset.
49
+
50
+ ## How to Get Started with the Model
51
+
52
+ See the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).
53
+
54
+ ## Training Details
55
+
56
+ ### Training Data
57
+
58
+ Pretraining stage: For both `kyutai/stt-2.6b-en` and `kyutai/stt-1b-en_fr`, we use an audio collection of 2.5 million hours of publicly available audio content.
59
+ For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped).
60
+
61
+ For `kyutai/stt-2.6b-en`:
62
+
63
+ - Finetuning stage: We then finetune the model on a collection of public datasets with
64
+ ground-truth transcripts. This dataset contains 24000 hours of audio.
65
+
66
+ - Long-form finetuning stage: Finally, we finetune the model on a combination of data from the previous stage and long-form audio.
67
+ The long-form audio is obtained from two sources: (a) concatenating LibriSpeech examples (1000 hours), (b) synthesizing dialogs (22000 hours).
68
+
69
+ For `kyutai/stt-1b-en_fr`:
70
+
71
+ - Finetuning stage: We finetune on the Fisher dataset of 2000 hours of English audio, plus proprietary data (1000 hours in English, 600 hours in French).
72
+
73
+ ### Compute Infrastructure
74
+
75
+ Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively.
76
+
77
+ ## Model Card Authors
78
+
79
+ Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez