|
--- |
|
language: en |
|
tags: |
|
- codestral |
|
- vision-language |
|
- code-generation |
|
- multimodal |
|
- mlx |
|
license: other |
|
library_name: mlx |
|
inference: false |
|
license_name: mnpl |
|
license_link: https://mistral.ai/licences/MNPL-0.1.md |
|
--- |
|
|
|
# Codestral-ViT |
|
|
|
A multimodal code generation model that combines vision and language understanding. Built on MLX for Apple Silicon, it integrates CLIP's visual capabilities with Codestral's code generation abilities. |
|
|
|
## Overview |
|
|
|
Codestral-ViT extends the Codestral language model with visual understanding capabilities. It can: |
|
- Generate code from text descriptions |
|
- Understand and explain code from screenshots |
|
- Suggest improvements to code based on visual context |
|
- Process multiple images with advanced tiling strategies |
|
|
|
## Technical Details |
|
|
|
- **Base Models:** |
|
- Language: Codestral-22B (4-bit quantized) |
|
- Vision: CLIP ViT-Large/14 |
|
- Framework: MLX (Apple Silicon) |
|
|
|
- **Architecture:** |
|
- Vision encoder processes images into 512-dim embeddings |
|
- Learned projection layer maps vision features to language space |
|
- Dynamic RoPE scaling for 32K context window |
|
- Support for overlapping image crops and tiling |
|
|
|
- **Input Processing:** |
|
- Images: 224x224 pixels, CLIP normalization |
|
- Text: Up to 32,768 tokens |
|
- Special tokens for image-text fusion |
|
|
|
## Example Usage |
|
|
|
```python |
|
from PIL import Image |
|
from src.model import MultimodalCodestral |
|
|
|
model = MultimodalCodestral() |
|
|
|
# Code generation from screenshot |
|
image = Image.open("code_screenshot.png") |
|
response = model.generate_with_images( |
|
prompt="Explain this code and suggest improvements", |
|
images=[image] |
|
) |
|
|
|
# Multiple image processing |
|
images = [Image.open(f) for f in ["img1.png", "img2.png"]] |
|
response = model.generate_with_images( |
|
prompt="Compare these code implementations", |
|
images=images |
|
) |
|
``` |
|
|
|
## Capabilities |
|
|
|
- **Code Understanding:** |
|
- Analyzes code structure from screenshots |
|
- Identifies patterns and anti-patterns |
|
- Suggests contextual improvements |
|
|
|
- **Image Processing:** |
|
- Handles multiple image inputs |
|
- Supports various image formats |
|
- Advanced crop and resize strategies |
|
|
|
- **Generation Features:** |
|
- Context-aware code completion |
|
- Documentation generation |
|
- Code refactoring suggestions |
|
- Bug identification and fixes |
|
|
|
## Requirements |
|
|
|
- Apple Silicon hardware (M1/M2/M3) |
|
- 32GB+ RAM recommended |
|
- MLX framework |
|
- Python 3.8+ |
|
|
|
## Limitations |
|
|
|
- Apple Silicon only (no CPU/CUDA support) |
|
- Memory intensive for large images/codebases |
|
- Visual understanding bounded by CLIP's capabilities |
|
- Generation quality depends on input clarity |
|
|
|
## License |
|
|
|
This model is released under the Mistral Non-Profit License (MNPL). See [license details](https://mistral.ai/licences/MNPL-0.1.md). |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@software{codestral-vit, |
|
author = {Mike Casale}, |
|
title = {Codestral-ViT: A Vision-Language Model for Code Generation}, |
|
year = {2023}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/casale-xyz/codestral-vit} |
|
} |
|
``` |