casale-xyz
/

codestral-vit-mlx

vision-language

code-generation

Model card Files Files and versions Community

codestral-vit-mlx / README.md

Michaelq's picture

Update README.md

cf39aa0 verified 3 months ago

|

history blame contribute delete

2.99 kB

	---
	language: en
	tags:
	- codestral
	- vision-language
	- code-generation
	- multimodal
	- mlx
	license: other
	library_name: mlx
	inference: false
	license_name: mnpl
	license_link: https://mistral.ai/licences/MNPL-0.1.md
	---

	# Codestral-ViT

	A multimodal code generation model that combines vision and language understanding. Built on MLX for Apple Silicon, it integrates CLIP's visual capabilities with Codestral's code generation abilities.

	## Overview

	Codestral-ViT extends the Codestral language model with visual understanding capabilities. It can:
	- Generate code from text descriptions
	- Understand and explain code from screenshots
	- Suggest improvements to code based on visual context
	- Process multiple images with advanced tiling strategies

	## Technical Details

	- Base Models:
	- Language: Codestral-22B (4-bit quantized)
	- Vision: CLIP ViT-Large/14
	- Framework: MLX (Apple Silicon)

	- Architecture:
	- Vision encoder processes images into 512-dim embeddings
	- Learned projection layer maps vision features to language space
	- Dynamic RoPE scaling for 32K context window
	- Support for overlapping image crops and tiling

	- Input Processing:
	- Images: 224x224 pixels, CLIP normalization
	- Text: Up to 32,768 tokens
	- Special tokens for image-text fusion

	## Example Usage

	```python
	from PIL import Image
	from src.model import MultimodalCodestral

	model = MultimodalCodestral()

	# Code generation from screenshot
	image = Image.open("code_screenshot.png")
	response = model.generate_with_images(
	prompt="Explain this code and suggest improvements",
	images=[image]
	)

	# Multiple image processing
	images = [Image.open(f) for f in ["img1.png", "img2.png"]]
	response = model.generate_with_images(
	prompt="Compare these code implementations",
	images=images
	)
	```

	## Capabilities

	- Code Understanding:
	- Analyzes code structure from screenshots
	- Identifies patterns and anti-patterns
	- Suggests contextual improvements

	- Image Processing:
	- Handles multiple image inputs
	- Supports various image formats
	- Advanced crop and resize strategies

	- Generation Features:
	- Context-aware code completion
	- Documentation generation
	- Code refactoring suggestions
	- Bug identification and fixes

	## Requirements

	- Apple Silicon hardware (M1/M2/M3)
	- 32GB+ RAM recommended
	- MLX framework
	- Python 3.8+

	## Limitations

	- Apple Silicon only (no CPU/CUDA support)
	- Memory intensive for large images/codebases
	- Visual understanding bounded by CLIP's capabilities
	- Generation quality depends on input clarity

	## License

	This model is released under the Mistral Non-Profit License (MNPL). See [license details](https://mistral.ai/licences/MNPL-0.1.md).

	## Citation

	```bibtex
	@software{codestral-vit,
	author = {Mike Casale},
	title = {Codestral-ViT: A Vision-Language Model for Code Generation},
	year = {2023},
	publisher = {Hugging Face},
	url = {https://huggingface.co/casale-xyz/codestral-vit}
	}
	```