File size: 2,986 Bytes
cf39aa0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
language: en
tags:
- codestral
- vision-language
- code-generation
- multimodal
- mlx
license: other
library_name: mlx
inference: false
license_name: mnpl
license_link: https://mistral.ai/licences/MNPL-0.1.md
---

# Codestral-ViT

A multimodal code generation model that combines vision and language understanding. Built on MLX for Apple Silicon, it integrates CLIP's visual capabilities with Codestral's code generation abilities.

## Overview

Codestral-ViT extends the Codestral language model with visual understanding capabilities. It can:
- Generate code from text descriptions
- Understand and explain code from screenshots
- Suggest improvements to code based on visual context
- Process multiple images with advanced tiling strategies

## Technical Details

- **Base Models:**
  - Language: Codestral-22B (4-bit quantized)
  - Vision: CLIP ViT-Large/14
  - Framework: MLX (Apple Silicon)

- **Architecture:**
  - Vision encoder processes images into 512-dim embeddings
  - Learned projection layer maps vision features to language space
  - Dynamic RoPE scaling for 32K context window
  - Support for overlapping image crops and tiling

- **Input Processing:**
  - Images: 224x224 pixels, CLIP normalization
  - Text: Up to 32,768 tokens
  - Special tokens for image-text fusion

## Example Usage

```python
from PIL import Image
from src.model import MultimodalCodestral

model = MultimodalCodestral()

# Code generation from screenshot
image = Image.open("code_screenshot.png")
response = model.generate_with_images(
    prompt="Explain this code and suggest improvements",
    images=[image]
)

# Multiple image processing
images = [Image.open(f) for f in ["img1.png", "img2.png"]]
response = model.generate_with_images(
    prompt="Compare these code implementations",
    images=images
)
```

## Capabilities

- **Code Understanding:**
  - Analyzes code structure from screenshots
  - Identifies patterns and anti-patterns
  - Suggests contextual improvements

- **Image Processing:**
  - Handles multiple image inputs
  - Supports various image formats
  - Advanced crop and resize strategies

- **Generation Features:**
  - Context-aware code completion
  - Documentation generation
  - Code refactoring suggestions
  - Bug identification and fixes

## Requirements

- Apple Silicon hardware (M1/M2/M3)
- 32GB+ RAM recommended
- MLX framework
- Python 3.8+

## Limitations

- Apple Silicon only (no CPU/CUDA support)
- Memory intensive for large images/codebases
- Visual understanding bounded by CLIP's capabilities
- Generation quality depends on input clarity

## License

This model is released under the Mistral Non-Profit License (MNPL). See [license details](https://mistral.ai/licences/MNPL-0.1.md).

## Citation

```bibtex
@software{codestral-vit,
  author = {Mike Casale},
  title = {Codestral-ViT: A Vision-Language Model for Code Generation},
  year = {2023},
  publisher = {Hugging Face},
  url = {https://huggingface.co/casale-xyz/codestral-vit}
}
```