Image-Text-to-Text
Safetensors
mpt
custom_code
File size: 9,049 Bytes
c8e1cc1
 
fcea42a
 
c8e1cc1
 
 
 
 
 
 
 
 
4963d79
 
 
dd0171c
c8e1cc1
1cc7d08
0013e83
6e272c3
1cc7d08
 
 
 
 
7c752d0
9fe4ffd
1cc7d08
 
 
 
 
 
 
9fe4ffd
 
1cc7d08
 
 
 
c8e1cc1
 
 
 
 
4963d79
 
 
 
 
 
 
 
c8e1cc1
 
 
9fe4ffd
774ae62
 
 
 
 
9fe4ffd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1cc7d08
29cb477
5e47dbf
4963d79
c8e1cc1
1cc7d08
 
 
 
 
 
 
 
 
 
 
c8e1cc1
7a7669e
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
license: other
license_name: krutrim-community-license-agreement-version-1.0
license_link: LICENSE.md
language:
- hi
- bn
- ta
- te
- gu
- or
- en
- as
- ml
- mr
- kn
pipeline_tag: image-text-to-text
---
# Chitrarth: Bridging Vision and Language for a Billion People
[Paper Link👁️](https://arxiv.org/abs/2502.15392)
[![Static Badge](https://img.shields.io/badge/Huggingface-Chitrarth-yellow?logo=huggingface)](https://huggingface.co/krutrim-ai-labs/chitrarth)	[![Static Badge](https://img.shields.io/badge/Github-Chitrarth-green?logo=github)](https://github.com/ola-krutrim/Chitrarth)	[![Static Badge](https://img.shields.io/badge/Krutrim_Cloud-Chitrarth-orange?logo=)](https://cloud.olakrutrim.com/console/inference-service?section=models&modelName=Krutrim&artifactName=chitrarth&artifactType=model)	[![Static Badge](https://img.shields.io/badge/Krutrim_AI_Labs-Chitrarth-blue?logo=)](https://ai-labs.olakrutrim.com/models/Chitrarth-1)

## 1. Introduction

Chitrarth (Chitra: Image; Artha: Meaning) is a multilingual VLM that integrates a state-of-the-art multilingual Large Language Model (LLM) with a vision module. This model is trained primarily on multilingual image-text data and is designed to work across 10 prominent Indian languages, including Hindi, Bengali, Telugu, Tamil, Marathi, Gujarati, Kannada, Malayalam, Odia, and Assamese, as well as English

[![Chitrarth](https://img.youtube.com/vi/TmzEweLIgsc/0.jpg)](https://www.youtube.com/watch?v=TmzEweLIgsc)

## 2. Model Summary

### Key Features
- **Model:** Krutrim-1 as the base LLM, SigLIP as the visual encoder with 2 layer MLP
- **Languages Supported:** 10 Indic languages - Hindi, Bengali, Telugu, Tamil, Marathi, Gujarati, Kannada, Malayalam, Odia, and Assamese, as well as English
- **Usage:** General purpose VLM

![model](assets/model.png)


## 3. API Platform
Visit [Chitrarth Online](https://cloud.olakrutrim.com/console/inference-service?section=models&modelName=Krutrim&artifactName=chitrarth&artifactType=model) to access the model via the web interface. 


## 4. Inference code


```
git clone https://github.com/ola-krutrim/Chitrarth.git
conda create --name chitrarth python=3.10
conda activate chitrarth

cd Chitrarth 
pip install -e .

python chitrarth/inference.py --model-path "krutrim-ai-labs/chitrarth" --image-file "assets/govt_school.jpeg" --query "Explain the image. "
```

## 5. Evaluation Results


![model](assets/radar.png)

Performance against SOTA VLMs on different academic multimodal tasks. Our model consistently outperforms IDEFICS 2 (7B) and PALO 7B on different benchmarks while remaining competitive on TextVQA and Vizwiz.

We introduce **BharatBench**, a comprehensive evaluation benchmark suite designed for **10 under-resourced Indic languages** across **3 tasks**. The performance of **Chitrarth** on the BharatBench Evaluation framework sets a strong baseline for future research in this domain. Our model is unique in its ability to handle all included languages.

Below are the performance results of **Chitrarth** on BharatBench across three evaluation tasks: **POPE**, **LLaVA-Bench**, and **MMVet**.

| **Language**   | **POPE** | **LLaVA-Bench** | **MMVet** |
|----------------|----------|-----------------|-----------|
| **Telugu**     | 79.9     | 54.8            | 43.76     |
| **Hindi**      | 78.68    | 51.5            | 38.85     |
| **Bengali**    | 83.24    | 53.7            | 33.24     |
| **Malayalam**  | 85.29    | 55.5            | 25.36     |
| **Kannada**    | 85.52    | 58.1            | 46.19     |
| **Assamese**   | 55.59    | 59.1            | 37.29     |
| **Tamil**      | 83.28    | 58.3            | 34.31     |
| **Marathi**    | 79.17    | 52.8            | 40.96     |
| **Gujarati**   | 84.75    | 55.9            | 39.03     |
| **Odia**       | 82.03    | 62.8            | 19.67     |
| **English**    | 87.63    | 67.9            | 30.49     |

## 6. License
This code repository and the model weights are licensed under the [Krutrim Community License.](LICENSE.md)

## 7. Citation

```
@inproceedings{
  khan2024chitrarth,
  title={Chitrarth: Bridging Vision and Language for a Billion People},
  author={Shaharukh Khan, Ayush Tarun, Abhinav Ravi, Ali Faraz, Praveen Kumar Pokala, Anagha Bhangare, Raja Kolla, Chandra Khatri, Shubham Agarwal},
  booktitle={NeurIPS Multimodal Algorithmic Reasoning},
  year={2024},
}
```

## 8. Contact
Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.

## 9. Acknowledgement

Chitrarth is built with reference to the code of the following projects: [Transformers](https://github.com/huggingface/transformers), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!