Image-Text-to-Text
Transformers
Safetensors
vqa
vlm
mehmetkeremturkcan commited on
Commit
27eca9b
·
verified ·
1 Parent(s): c532902

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -12
README.md CHANGED
@@ -18,37 +18,32 @@ tags:
18
  <img src="https://github.com/mkturkcan/femtovlm/blob/main/assets/logo.png?raw=true" width="180" />
19
  </p>
20
  <h1 align="center">
21
- <p>mehmetkeremturkcan/DeepSeek-LLaVA-Instruct</p>
22
  </h1>
23
  <h3 align="center">
24
- <p>DeepSeer: Vision Language Models with Reasoning</p>
25
  </h3>
26
 
27
- Vision language models with chain-of-thought reasoning are just starting to emerge. This is a proof-of-concept to train a vision model with thinking-enabled chat templates based on DeepSeek-R1 models.
28
-
29
- Note that this model will not always use thinking tokens, due to the current lack of high-quality CoT data in non-science contexts.
30
 
 
31
  ## Setup
32
  ```bash
33
  pip install git+https://github.com/facebookresearch/schedule_free.git
34
  pip install peft
35
  git clone https://github.com/mkturkcan/seers.git
36
  cd seers/seers/
37
- git clone https://huggingface.co/mehmetkeremturkcan/DeepSeek-LLaVA-Instruct
38
  ```
39
  ## Test
40
  Run, in the seers/seers folder,
41
  ```bash
42
- python predict_llava.py
43
  ```
44
 
45
  ## Train
46
 
47
  [seers](https://github.com/mkturkcan/seers) training code is public! Run
48
  ```bash
49
- python train_cot_mixed.py
50
  ```
51
-
52
- ## Training Details
53
- This model is a fine-tuned version of [deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) on the [5CD-AI/LLaVA-CoT-o1-Instruct](https://huggingface.co/datasets/5CD-AI/LLaVA-CoT-o1-Instruct) dataset.
54
- It has been trained using [seers](https://github.com/mkturkcan/seers).
 
18
  <img src="https://github.com/mkturkcan/femtovlm/blob/main/assets/logo.png?raw=true" width="180" />
19
  </p>
20
  <h1 align="center">
21
+ <p>mehmetkeremturkcan/FemtoVLM-Tiny</p>
22
  </h1>
23
  <h3 align="center">
24
+ <p>FemtoVLM: Tiniest Vision Language Models</p>
25
  </h3>
26
 
27
+ FemtoVLM is the smallest visual question answering/captioning model in the world. It accepts image and text inputs to produce text outputs. It's designed for efficiency. FemtoVLM can answer questions about images and describe visual content. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance.
 
 
28
 
29
+ FemtoVLM comes in three sizes: 116M (femto), 143M (tiny), 160M (base), 225M (dino). All models are trained for image captioning and question answering in real-world contexts. FemtoVLM cannot perform optical character recognition (OCR), multi-turn question-answering, or scientific question answering.
30
  ## Setup
31
  ```bash
32
  pip install git+https://github.com/facebookresearch/schedule_free.git
33
  pip install peft
34
  git clone https://github.com/mkturkcan/seers.git
35
  cd seers/seers/
36
+ git clone https://huggingface.co/mehmetkeremturkcan/FemtoVLM-Tiny
37
  ```
38
  ## Test
39
  Run, in the seers/seers folder,
40
  ```bash
41
+ python femtovlm_inference.py
42
  ```
43
 
44
  ## Train
45
 
46
  [seers](https://github.com/mkturkcan/seers) training code is public! Run
47
  ```bash
48
+ python femtovlm_train.py
49
  ```