shimmyshimmer commited on
Commit
29d077c
·
verified ·
1 Parent(s): fc1b713

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +176 -0
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: CohereForAI/aya-vision-32b
3
+ inference: false
4
+ library_name: transformers
5
+ language:
6
+ - en
7
+ - fr
8
+ - de
9
+ - es
10
+ - it
11
+ - pt
12
+ - ja
13
+ - ko
14
+ - zh
15
+ - ar
16
+ - el
17
+ - fa
18
+ - pl
19
+ - id
20
+ - cs
21
+ - he
22
+ - hi
23
+ - nl
24
+ - ro
25
+ - ru
26
+ - tr
27
+ - uk
28
+ - vi
29
+ license: cc-by-nc-4.0
30
+ extra_gated_prompt: >-
31
+ By submitting this form, you agree to the [License
32
+ Agreement](https://cohere.com/c4ai-cc-by-nc-license) and acknowledge that the
33
+ information you provide will be collected, used, and shared in accordance with
34
+ Cohere’s [Privacy Policy]( https://cohere.com/privacy). You’ll receive email
35
+ updates about C4AI and Cohere research, events, products and services. You can
36
+ unsubscribe at any time.
37
+ extra_gated_fields:
38
+ Name: text
39
+ Affiliation: text
40
+ Country: country
41
+ I agree to use this model for non-commercial use ONLY: checkbox
42
+ pipeline_tag: image-text-to-text
43
+ ---
44
+
45
+ # Model Card for Aya Vision 8B
46
+
47
+ <img src="aya-vision-8B.png" width="650" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
48
+
49
+ **C4AI Aya Vision 8B** is an open weights research release of an 8-billion parameter model with advanced capabilities optimized for a variety of vision-language use cases, including OCR, captioning, visual reasoning, summarization, question answering, code, and more.
50
+ It is a multilingual model trained to excel in 23 languages in vision and language.
51
+
52
+ This model card corresponds to the 8-billion version of the Aya Vision model. We also released a 32-billion version which you can find [here](https://huggingface.co/CohereForAI/aya-vision-32B).
53
+
54
+ - Developed by: [Cohere For AI](https://cohere.for.ai/)
55
+ - Point of Contact: Cohere For AI: [cohere.for.ai](https://cohere.for.ai/)
56
+ - License: [CC-BY-NC](https://cohere.com/c4ai-cc-by-nc-license), requires also adhering to [C4AI's Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy)
57
+ - Model: c4ai-aya-vision-8b
58
+ - Model Size: 8 billion parameters
59
+ - Context length: 16K
60
+
61
+ ## Try it: Aya Vision in Action
62
+
63
+ Before downloading the weights, you can try Aya Vision chat in the [Cohere playground](https://dashboard.cohere.com/playground/chat) or our dedicated [Hugging Face Space](https://huggingface.co/spaces/CohereForAI/aya_expanse) for interactive exploration.
64
+
65
+ ## WhatsApp Integration
66
+
67
+ You can also talk to Aya Vision through the popular messaging service WhatsApp. Use this [link](https://wa.me/14313028498) to open a WhatsApp chatbox with Aya Vision.
68
+
69
+ If you don’t have WhatsApp downloaded on your machine you might need to do that, or, if you have it on your phone, you can follow the on-screen instructions to link your phone and WhatsApp Web.
70
+ By the end, you should see a text window which you can use to chat with the model.
71
+ More details about our WhatsApp integration are available [here](https://docs.cohere.com/v2/docs/aya#aya-expanse-integration-with-whatsapp).
72
+
73
+ ## Example Notebook
74
+
75
+ You can also check out the following [notebook](https://colab.research.google.com/github/cohere-ai/cohere-developer-experience/blob/main/notebooks/guides/aya_vision_intro.ipynb) to understand how to use Aya Vision for different use cases.
76
+
77
+ ## How to Use Aya Vision
78
+
79
+ Please install `transformers` from the source repository that includes the necessary changes for this model:
80
+
81
+ ```python
82
+ # pip install 'git+https://github.com/huggingface/[email protected]'
83
+ from transformers import AutoProcessor, AutoModelForImageTextToText
84
+ import torch
85
+
86
+ model_id = "CohereForAI/aya-vision-8b"
87
+
88
+ processor = AutoProcessor.from_pretrained(model_id)
89
+ model = AutoModelForImageTextToText.from_pretrained(
90
+ model_id, device_map="auto", torch_dtype=torch.float16
91
+ )
92
+
93
+ # Format message with the aya-vision chat template
94
+ messages = [
95
+ {"role": "user",
96
+ "content": [
97
+ {"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
98
+ {"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
99
+ ]},
100
+ ]
101
+
102
+ inputs = processor.apply_chat_template(
103
+ messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
104
+ ).to(model.device)
105
+
106
+ gen_tokens = model.generate(
107
+ **inputs,
108
+ max_new_tokens=300,
109
+ do_sample=True,
110
+ temperature=0.3,
111
+ )
112
+
113
+ print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
114
+ ```
115
+
116
+
117
+ You can also use the model directly using transformers `pipeline` abstraction:
118
+
119
+ ```python
120
+ from transformers import pipeline
121
+
122
+ pipe = pipeline(model="CohereForAI/aya-vision-8b", task="image-text-to-text", device_map="auto")
123
+
124
+ # Format message with the aya-vision chat template
125
+ messages = [
126
+ {"role": "user",
127
+ "content": [
128
+ {"type": "image", "url": "https://media.istockphoto.com/id/458012057/photo/istanbul-turkey.jpg?s=612x612&w=0&k=20&c=qogAOVvkpfUyqLUMr_XJQyq-HkACXyYUSZbKhBlPrxo="},
129
+ {"type": "text", "text": "Bu resimde hangi anıt gösterilmektedir?"},
130
+ ]},
131
+ ]
132
+ outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)
133
+
134
+ print(outputs)
135
+ ```
136
+
137
+ ## Model Details
138
+
139
+ **Input:** Model accepts input text and images.
140
+
141
+ **Output:** Model generates text.
142
+
143
+ **Model Architecture:** This is a vision-language model that uses a multilingual language model based on [C4AI Command R7B](https://huggingface.co/CohereForAI/c4ai-command-r7b-12-2024) and further post-trained with the [Aya Expanse recipe](https://arxiv.org/abs/2412.04261), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.
144
+
145
+ **Image Processing:** We use **169 visual tokens** to encode an image tile with a resolution of **364x364 pixels**. Input images of arbitrary sizes are mapped to the nearest supported resolution based on the aspect ratio. Aya Vision uses up to 12 input tiles and a thumbnail (resized to 364x364) (2197 image tokens).
146
+
147
+ **Languages covered:** The model has been trained on 23 languages: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, Chinese (Simplified and Traditional), Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian.
148
+
149
+ **Context length**: Aya Vision 8B supports a context length of 16K.
150
+
151
+ For more details about how the model was trained, check out [our blogpost](https://huggingface.co/blog/aya-vision).
152
+
153
+
154
+ ## Evaluation
155
+
156
+ We evaluated Aya Vision 8B against [Pangea 7B](https://huggingface.co/neulab/Pangea-7B), [Llama-3.2 11B Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision), [Molmo-D 7B](https://huggingface.co/allenai/Molmo-7B-D-0924), [Qwen2.5-VL 7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), [Pixtral 12B](https://huggingface.co/mistralai/Pixtral-12B-2409), and [Gemini Flash 1.5 8B](https://developers.googleblog.com/en/gemini-15-flash-8b-is-now-generally-available-for-use/) using [Aya Vision Benchmark](https://huggingface.co/datasets/CohereForAI/AyaVisionBench) and [m-WildVision](https://huggingface.co/datasets/CohereForAI/m-WildVision).
157
+ Win-rates were determined using claude-3-7-sonnet-20250219 as a judge, based on the superior judge performance compared to other models.
158
+
159
+ We also evaluated Aya Vision 8B’s performance for text-only input against the same models using [m-ArenaHard](https://huggingface.co/datasets/CohereForAI/m-ArenaHard), a challenging open-ended generation evaluation, measured using win-rates using gpt-4o-2024-11-20 as a judge.
160
+
161
+ <!-- <img src="Aya_Vision_8B_Combined_Win_Rates.png" width="650" style="margin-left:'auto' margin-right:'auto' display:'block'"/> -->
162
+ <img src="AyaVision8BWinRates(AyaVisionBench).png" width="650" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
163
+ <img src="AyaVision8BWinRates(m-WildVision).png" width="650" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
164
+ <img src="Aya_Vision_8BvsPangea(AyaVisionBench).png" width="650" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
165
+ <img src="EfficiencyvsPerformance.png" width="650" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
166
+
167
+
168
+ ### Model Card Contact
169
+
170
+ For errors or additional questions about details in this model card, contact [email protected].
171
+
172
+ ### Terms of Use
173
+
174
+ We hope that the release of this model will make community-based research efforts more accessible by releasing the weights of a highly performant 8 billion parameter Vision-Language Model to researchers all over the world.
175
+
176
+ This model is governed by a [CC-BY-NC](https://cohere.com/c4ai-cc-by-nc-license) License with an acceptable use addendum, and also requires adhering to [C4AI's Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).