This paper focuses on developing translation models and related applications
for 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj,
Bodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada,
Kangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili,
Malayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi,
Sanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu,
Telugu, and Urdu. Achieving this requires parallel and other types of corpora
for all 36 * 36 language pairs, addressing challenges like script variations,
phonetic differences, and syntactic diversity. For instance, languages like
Kashmiri and Sindhi, which use multiple scripts, demand script normalization
for alignment, while low-resource languages such as Khasi and Santali require
synthetic data augmentation to ensure sufficient coverage and quality.
To address these challenges, this work proposes strategies for corpus
creation by leveraging existing resources, developing parallel datasets,
generating domain-specific corpora, and utilizing synthetic data techniques.
Additionally, it evaluates machine translation across various dimensions,
including standard and discourse-level translation, domain-specific
translation, reference-based and reference-free evaluation, error analysis, and
automatic post-editing. By integrating these elements, the study establishes a
comprehensive framework to improve machine translation quality and enable
better cross-lingual communication in India's linguistically diverse ecosystem.