Update README.md
Browse files
README.md
CHANGED
|
@@ -9,19 +9,21 @@ language:
|
|
| 9 |
|
| 10 |
Below is the model card of LLaVa-NeXT-Video model 7b, which is copied from the original Llava model card that you can find [here](https://huggingface.co/liuhaotian/llava-v1.5-13b).
|
| 11 |
|
| 12 |
-
Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance: [](https://colab.research.google.com/drive/
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
-
## Model details
|
| 18 |
|
| 19 |
**Model type:**
|
| 20 |
<br>
|
| 21 |
-
LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data.
|
| 22 |
<br>
|
| 23 |
Base LLM: lmsys/vicuna-7b-v1.5
|
| 24 |
|
|
|
|
|
|
|
|
|
|
| 25 |
**Model date:**
|
| 26 |
<br>
|
| 27 |
LLaVA-Next-Video-7B was trained in April 2024.
|
|
@@ -31,7 +33,24 @@ LLaVA-Next-Video-7B was trained in April 2024.
|
|
| 31 |
https://github.com/LLaVA-VL/LLaVA-NeXT
|
| 32 |
|
| 33 |
|
| 34 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
First, make sure to have `transformers >= 4.42.0`.
|
| 37 |
The model supports multi-visual and multi-prompt generation. Meaning that you can pass multiple images/videos in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `<image>` or `<video>` to the location where you want to query images/videos:
|
|
@@ -39,17 +58,12 @@ The model supports multi-visual and multi-prompt generation. Meaning that you ca
|
|
| 39 |
Below is an example script to run generation in `float16` precision on a GPU device:
|
| 40 |
|
| 41 |
```python
|
| 42 |
-
import requests
|
| 43 |
-
from PIL import Image
|
| 44 |
import av
|
| 45 |
import torch
|
| 46 |
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
|
| 47 |
|
| 48 |
model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"
|
| 49 |
|
| 50 |
-
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
|
| 51 |
-
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
| 52 |
-
|
| 53 |
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
|
| 54 |
model_id,
|
| 55 |
torch_dtype=torch.float16,
|
|
@@ -82,7 +96,7 @@ prompt = "USER: <video>\nWhy is this video funny? ASSISTANT:"
|
|
| 82 |
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
|
| 83 |
container = av.open(video_path)
|
| 84 |
|
| 85 |
-
# sample uniformly 8 frames from the video
|
| 86 |
total_frames = container.streams.video[0].frames
|
| 87 |
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
|
| 88 |
clip = read_video_pyav(container, indices)
|
|
@@ -97,6 +111,12 @@ print(processor.decode(output[0][2:], skip_special_tokens=True))
|
|
| 97 |
To generate from images use the below code after loading the model as shown above:
|
| 98 |
|
| 99 |
```python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
raw_image = Image.open(requests.get(image_file, stream=True).raw)
|
| 101 |
inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
|
| 102 |
|
|
@@ -149,11 +169,12 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
|
|
| 149 |
).to(0)
|
| 150 |
```
|
| 151 |
|
| 152 |
-
|
|
|
|
| 153 |
Llama 2 is licensed under the LLAMA 2 Community License,
|
| 154 |
Copyright (c) Meta Platforms, Inc. All Rights Reserved.
|
| 155 |
|
| 156 |
-
## Intended use
|
| 157 |
**Primary intended uses:**
|
| 158 |
<br>
|
| 159 |
The primary use of LLaVA is research on large multimodal models and chatbots.
|
|
@@ -162,20 +183,27 @@ The primary use of LLaVA is research on large multimodal models and chatbots.
|
|
| 162 |
<br>
|
| 163 |
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
|
| 164 |
|
| 165 |
-
## Training dataset
|
| 166 |
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
- 158K GPT-generated multimodal instruction-following data.
|
| 170 |
-
- 500K academic-task-oriented VQA data mixture.
|
| 171 |
-
- 50K GPT-4V data mixture.
|
| 172 |
-
- 40K ShareGPT data.
|
| 173 |
-
|
| 174 |
-
### Video
|
| 175 |
-
- 100K VideoChatGPT-Instruct.
|
| 176 |
-
|
| 177 |
-
## Evaluation dataset
|
| 178 |
-
A collection of 4 benchmarks, including 3 academic VQA benchmarks and 1 captioning benchmark.
|
| 179 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
|
|
|
| 9 |
|
| 10 |
Below is the model card of LLaVa-NeXT-Video model 7b, which is copied from the original Llava model card that you can find [here](https://huggingface.co/liuhaotian/llava-v1.5-13b).
|
| 11 |
|
| 12 |
+
Check out also the Google Colab demo to run Llava on a free-tier Google Colab instance: [](https://colab.research.google.com/drive/1CZggLHrjxMReG-FNOmqSOdi4z7NPq6SO?usp=sharing)
|
| 13 |
|
| 14 |
+
Disclaimer: The team releasing LLaVa-NeXT-Video did not write a model card for this model so this model card has been written by the Hugging Face team.
|
| 15 |
|
| 16 |
+
## π Model details
|
|
|
|
| 17 |
|
| 18 |
**Model type:**
|
| 19 |
<br>
|
| 20 |
+
LLaVA-Next-Video is an open-source chatbot trained by fine-tuning LLM on multimodal instruction-following data. The model is buit on top of LLaVa-NeXT by tuning on a mix of video and image data. The videos were sampled uniformly to be 32 frames per clip.
|
| 21 |
<br>
|
| 22 |
Base LLM: lmsys/vicuna-7b-v1.5
|
| 23 |
|
| 24 |
+
<img src="http://drive.google.com/uc?export=view&id=1fVg-r5MU3NoHlTpD7_lYPEBWH9R8na_4">
|
| 25 |
+
|
| 26 |
+
|
| 27 |
**Model date:**
|
| 28 |
<br>
|
| 29 |
LLaVA-Next-Video-7B was trained in April 2024.
|
|
|
|
| 33 |
https://github.com/LLaVA-VL/LLaVA-NeXT
|
| 34 |
|
| 35 |
|
| 36 |
+
## π Training dataset
|
| 37 |
+
|
| 38 |
+
### Image
|
| 39 |
+
- 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
|
| 40 |
+
- 158K GPT-generated multimodal instruction-following data.
|
| 41 |
+
- 500K academic-task-oriented VQA data mixture.
|
| 42 |
+
- 50K GPT-4V data mixture.
|
| 43 |
+
- 40K ShareGPT data.
|
| 44 |
+
|
| 45 |
+
### Video
|
| 46 |
+
- 100K VideoChatGPT-Instruct.
|
| 47 |
+
|
| 48 |
+
## π Evaluation dataset
|
| 49 |
+
A collection of 4 benchmarks, including 3 academic VQA benchmarks and 1 captioning benchmark.
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
## π How to use the model
|
| 54 |
|
| 55 |
First, make sure to have `transformers >= 4.42.0`.
|
| 56 |
The model supports multi-visual and multi-prompt generation. Meaning that you can pass multiple images/videos in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `<image>` or `<video>` to the location where you want to query images/videos:
|
|
|
|
| 58 |
Below is an example script to run generation in `float16` precision on a GPU device:
|
| 59 |
|
| 60 |
```python
|
|
|
|
|
|
|
| 61 |
import av
|
| 62 |
import torch
|
| 63 |
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration
|
| 64 |
|
| 65 |
model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"
|
| 66 |
|
|
|
|
|
|
|
|
|
|
| 67 |
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
|
| 68 |
model_id,
|
| 69 |
torch_dtype=torch.float16,
|
|
|
|
| 96 |
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
|
| 97 |
container = av.open(video_path)
|
| 98 |
|
| 99 |
+
# sample uniformly 8 frames from the video, can sample more for longer videos
|
| 100 |
total_frames = container.streams.video[0].frames
|
| 101 |
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
|
| 102 |
clip = read_video_pyav(container, indices)
|
|
|
|
| 111 |
To generate from images use the below code after loading the model as shown above:
|
| 112 |
|
| 113 |
```python
|
| 114 |
+
import requests
|
| 115 |
+
from PIL import Image
|
| 116 |
+
|
| 117 |
+
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
|
| 118 |
+
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
| 119 |
+
|
| 120 |
raw_image = Image.open(requests.get(image_file, stream=True).raw)
|
| 121 |
inputs_image = processor(prompt, images=raw_image, return_tensors='pt').to(0, torch.float16)
|
| 122 |
|
|
|
|
| 169 |
).to(0)
|
| 170 |
```
|
| 171 |
|
| 172 |
+
|
| 173 |
+
## π License
|
| 174 |
Llama 2 is licensed under the LLAMA 2 Community License,
|
| 175 |
Copyright (c) Meta Platforms, Inc. All Rights Reserved.
|
| 176 |
|
| 177 |
+
## π― Intended use
|
| 178 |
**Primary intended uses:**
|
| 179 |
<br>
|
| 180 |
The primary use of LLaVA is research on large multimodal models and chatbots.
|
|
|
|
| 183 |
<br>
|
| 184 |
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
|
| 185 |
|
|
|
|
| 186 |
|
| 187 |
+
## βοΈ Citation
|
| 188 |
+
If you find our paper and code useful in your research:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
+
```BibTeX
|
| 191 |
+
@misc{zhang2024llavanextvideo,
|
| 192 |
+
title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
|
| 193 |
+
url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
|
| 194 |
+
author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
|
| 195 |
+
month={April},
|
| 196 |
+
year={2024}
|
| 197 |
+
}
|
| 198 |
+
```
|
| 199 |
|
| 200 |
+
```BibTeX
|
| 201 |
+
@misc{liu2024llavanext,
|
| 202 |
+
title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
|
| 203 |
+
url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
|
| 204 |
+
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
|
| 205 |
+
month={January},
|
| 206 |
+
year={2024}
|
| 207 |
+
}
|
| 208 |
+
```
|
| 209 |
|