Sanchit Gandhi

sanchit-gandhi

AI & ML interests

Open-Source Speech

Recent Activity

updated a Space about 1 month ago
kakao-enterprise/vits
updated a Space about 1 month ago
kakao-enterprise/vits
updated a model 6 months ago
openai/whisper-large-v3
View all activity

Organizations

ESPnet's profile picture XTREME-S's profile picture Whisper fine-tuning sprint's profile picture Centre for Vision, Speech and Signal Processing - University of Surrey's profile picture Whisper Fine-Tuning Event's profile picture Internal Data & Models for Speech Recognition Event's profile picture Speech Recognition Community Event Version 2's profile picture Speech Seq2Seq Experiments's profile picture Speechbox's profile picture SpeechColab's profile picture Linguistic Data Consortium's profile picture Whisper Distillation's profile picture University of Edingburgh - Centre For Speech Technology Research's profile picture ESC Benchmark's profile picture End-to-End Speech Benchmark's profile picture Music Gen Sprint's profile picture Kakao Enterprise's profile picture USCD REACH's profile picture TTS Eval (OLD)'s profile picture diarizers-community's profile picture TTS AGI's profile picture Sweet Dream(Booth)s's profile picture

sanchit-gandhi's activity

replied to their post 8 months ago
view reply
  1. Yes, it should be language agnostic
  2. You would need to repeat fine-tuning your model, this time in a way that preserves timestamps. If you have timestamps in your target data, you can continue using these. If you don't have timestamps in your data, you can try training with LoRA. Using LoRAs reduces the amount of catastrophic forgetting, so even though we don't have timestamps in our fine-tuning data, the model remembers how to make timestamp'd predictions. You can see a guide on LoRA fine-tuning using the PEFT library here. Note that you want to run inference in half/full precision (not 8-bit), as outlined here

Note that the original post is a hypothesis for why timestamps reduces hallucinations. It would need to be tested and evaluated to confirm whether these findings hold more generally!

reacted to ylacombe's post with šŸ”„ 10 months ago
view post
Post
6946
Yesterday, we released Parler-TTS and Data-Speech, fully open-source reproduction of work from the paper: Natural language guidance of high-fidelity text-to-speech with synthetic annotations (2402.01912)

Parler-TTS is a lightweight text-to-speech (TTS) model that can generate high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc).

https://huggingface.co/collections/parler-tts/parler-tts-fully-open-source-high-quality-tts-models-66164ad285ba03e8ffde214c

Parler-TTS Mini v0.1, is the first iteration Parler-TTS model trained using 10k hours of narrated audiobooks. It generates high-quality speech with features that can be controlled using a simple text prompt (e.g. gender, background noise, speaking rate, pitch and reverberation).

To improve the prosody and naturalness of the speech further, we're scaling up the amount of training data to 50k hours of speech. The v1 release of the model will be trained on this data, as well as inference optimisations, such as flash attention and torch compile.

parler-tts/parler_tts_mini_v0.1

Data-Speech can be used for annotating speech characteristics in a large-scale setting.

parler-tts/open-source-speech-datasets-annotated-using-data-speech-661648ffa0d3d76bfa23d534

This work is both scalable and easily modifiable and will hopefully help the TTS research community explore new ways of conditionning speech synthesis.

All of the datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models.
reacted to their post with ā¤ļø 11 months ago
view post
Post
Why does returning timestamps help Whisper reduce hallucinations? šŸ§

Empirically, most practitioners have found that setting return_timestamps=True helps reduce hallucinations, particularly when doing long-form evaluation with Transformersā€™ ā€œchunkedā€ algorithm.

But why does this work?..

My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription:
The cat sat on the on the on the mat.

Where we have a repeated hallucination for ā€œon theā€. If we ask the model to predict timestamps, then the ā€œon theā€ has to contribute to the overall segment-level timing, e.g.:
<|0.00|> The cat sat on the on the on the mat.<|5.02|>

However, itā€™s impossible to fit 3 copies of ā€œon theā€ within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability:
<|0.00|> The cat sat on the mat.<|5.02|>

In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper Robust Speech Recognition via Large-Scale Weak Supervision (2212.04356) ā†’ it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences).

Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?
Ā·
posted an update 11 months ago
view post
Post
Why does returning timestamps help Whisper reduce hallucinations? šŸ§

Empirically, most practitioners have found that setting return_timestamps=True helps reduce hallucinations, particularly when doing long-form evaluation with Transformersā€™ ā€œchunkedā€ algorithm.

But why does this work?..

My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription:
The cat sat on the on the on the mat.

Where we have a repeated hallucination for ā€œon theā€. If we ask the model to predict timestamps, then the ā€œon theā€ has to contribute to the overall segment-level timing, e.g.:
<|0.00|> The cat sat on the on the on the mat.<|5.02|>

However, itā€™s impossible to fit 3 copies of ā€œon theā€ within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability:
<|0.00|> The cat sat on the mat.<|5.02|>

In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper Robust Speech Recognition via Large-Scale Weak Supervision (2212.04356) ā†’ it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences).

Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?
Ā·