Evaluation Dataset Size and Details

#3
by saeedzou - opened

Hi,

Thanks for creating this space. I want to know more about the subsets used for evaluation, especially if the common voice subset for evaluation uses the full test set.
Also, Why is it that the vhdm/whisper-large-fa-v1 have such a high WER? Is it because of hallucinations or does it generate empty transcripts?
In addition, I want to recommend adding a new benchmark PartAI/PSRB. The given link only contains 1 hour out of the full 10 hour benchmark, but the authors might be willing to share the full dataset for this benchmark space.

Sign up or log in to comment