mehmetkeremturkcan
/

FemtoVLM-Tiny

Image-Text-to-Text

Model card Files Files and versions Community

mehmetkeremturkcan commited on Feb 18

Commit

27eca9b

·

verified ·

1 Parent(s): c532902

Update README.md

Files changed (1) hide show

README.md +7 -12

README.md CHANGED Viewed

@@ -18,37 +18,32 @@ tags:
   <img src="https://github.com/mkturkcan/femtovlm/blob/main/assets/logo.png?raw=true"  width="180" />
 </p>
 <h1 align="center">
-  <p>mehmetkeremturkcan/DeepSeek-LLaVA-Instruct</p>
 </h1>
 <h3 align="center">
-  <p>DeepSeer: Vision Language Models with Reasoning</p>
 </h3>
-Vision language models with chain-of-thought reasoning are just starting to emerge. This is a proof-of-concept to train a vision model with thinking-enabled chat templates based on DeepSeek-R1 models.
-Note that this model will not always use thinking tokens, due to the current lack of high-quality CoT data in non-science contexts.
 ## Setup
 ```bash
 pip install git+https://github.com/facebookresearch/schedule_free.git
 pip install peft
 git clone https://github.com/mkturkcan/seers.git
 cd seers/seers/
-git clone https://huggingface.co/mehmetkeremturkcan/DeepSeek-LLaVA-Instruct
 ```
 ## Test
 Run, in the seers/seers folder,
 ```bash
-python predict_llava.py
 ```
 ## Train
 [seers](https://github.com/mkturkcan/seers) training code is public! Run
 ```bash
-python train_cot_mixed.py
 ```
-## Training Details
-This model is a fine-tuned version of [deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) on the [5CD-AI/LLaVA-CoT-o1-Instruct](https://huggingface.co/datasets/5CD-AI/LLaVA-CoT-o1-Instruct) dataset.
-It has been trained using [seers](https://github.com/mkturkcan/seers).

   <img src="https://github.com/mkturkcan/femtovlm/blob/main/assets/logo.png?raw=true"  width="180" />
 </p>
 <h1 align="center">
+  <p>mehmetkeremturkcan/FemtoVLM-Tiny</p>
 </h1>
 <h3 align="center">
+  <p>FemtoVLM: Tiniest Vision Language Models</p>
 </h3>
+FemtoVLM is the smallest visual question answering/captioning model in the world. It accepts image and text inputs to produce text outputs. It's designed for efficiency. FemtoVLM can answer questions about images and describe visual content. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance.
+FemtoVLM comes in three sizes: 116M (femto), 143M (tiny), 160M (base), 225M (dino). All models are trained for image captioning and question answering in real-world contexts. FemtoVLM cannot perform optical character recognition (OCR), multi-turn question-answering, or scientific question answering.
 ## Setup
 ```bash
 pip install git+https://github.com/facebookresearch/schedule_free.git
 pip install peft
 git clone https://github.com/mkturkcan/seers.git
 cd seers/seers/
+git clone https://huggingface.co/mehmetkeremturkcan/FemtoVLM-Tiny
 ```
 ## Test
 Run, in the seers/seers folder,
 ```bash
+python femtovlm_inference.py
 ```
 ## Train
 [seers](https://github.com/mkturkcan/seers) training code is public! Run
 ```bash
+python femtovlm_train.py
 ```