KnutJaegersberg lbourdois commited on
Commit
7c96cdc
·
verified ·
1 Parent(s): 488268b

Improve language tag (#1)

Browse files

- Improve language tag (3f7f71544b709ef9c761ab9d8f8fcea7f38eb76d)


Co-authored-by: Loïck BOURDOIS <[email protected]>

Files changed (1) hide show
  1. README.md +586 -574
README.md CHANGED
@@ -1,575 +1,587 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
- base_model:
6
- - google/paligemma-3b-mix-448
7
- - Qwen/Qwen2.5-0.5B-Instruct
8
- - google/siglip-so400m-patch14-384
9
- base_model_relation: merge
10
- language:
11
- - multilingual
12
- tags:
13
- - eagle
14
- - VLM
15
- ---
16
-
17
-
18
- # Eagle-2
19
-
20
-
21
- [\[📂 GitHub\]](https://github.com/NVlabs/EAGLE) [\[📜 Eagle2 Tech Report\]](https://github.com/NVlabs/EAGLE/blob/main/Eagle2/Eagle2_report.pdf)
22
- [\[🗨️ Chat Demo\]](http://eagle-vlm.xyz/) [\[🤗 HF Demo\]](TODO)
23
- ## Introduction
24
-
25
- We are thrilled to release our latest Eagle2 series Vision-Language Model. Open-source Vision-Language Models (VLMs) have made significant strides in narrowing the gap with proprietary models. However, critical details about data strategies and implementation are often missing, limiting reproducibility and innovation. In this project, we focus on VLM post-training from a data-centric perspective, sharing insights into building effective data strategies from scratch. By combining these strategies with robust training recipes and model design, we introduce Eagle2, a family of performant VLMs. Our work aims to empower the open-source community to develop competitive VLMs with transparent processes.
26
-
27
-
28
- In this repo, we are open-sourcing Eagle2-1B, a compact and efficient model designed for scenarios requiring fast inference and minimal computational resources, without compromising essential performance
29
-
30
-
31
-
32
-
33
-
34
-
35
-
36
-
37
- ## Model Zoo
38
- We provide the following models:
39
-
40
- | model name | LLM | Vision | Max Length| HF Link|
41
- | ----------- | ------- |---------|-|-|
42
- | Eagle2-1B | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-1B)|
43
- | Eagle2-2B | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-2B)|
44
- | Eagle2-9B | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | Siglip+ConvNext | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-9B)|
45
-
46
- ## Benchmark Results
47
- | Benchmark | LLaVa-One-Vision-0.5B | InternVL2-1B | InternVL2.5-1B |Qwen2-VL-2B| Eagle2-1B|
48
- | :--------------------------: | :------------------: | :----------------: | :----------: |:----------: |:----------: |
49
- | DocVQA<sub>test</sub> | 70.0 | 81.7 | 84.8 |90.1|81.8|
50
- | ChartQA<sub>test</sub> | 61.4 | 72.9 | 75.9 |73.0|77.0|
51
- | InfoVQA<sub>test</sub> | 41.8 | 50.9 | 56.0 |65.5|54.8|
52
- | TextVQA<sub>val</sub> | - | 70.0 | 72.0 |79.7|76.6|
53
- | OCRBench | 565 | 754 | 785 |809|767|
54
- | MME<sub>sum</sub> | 1438.0 | 1794.4 | 1950.5 | 1872.0| 1790.2|
55
- | RealWorldQA | 55.6 | 50.3 | 57.5 |62.6|55.4|
56
- | AI2D<sub>test</sub> | 57.1 | 64.1 | 69.3 | 74.7 |70.9|
57
- | MMMU<sub>val</sub> | 31.4 | 36.7 | 40.9 |41.1|38.8|
58
- | MMVet<sub>GPT-4-Turbo</sub> | 32.2 | 32.7 | 48.8 | 49.5|40.9| HallBench<sub>avg</sub> | 27.9 | 34.0 | 39.0 |**41.7**|35.3
59
- | MathVista<sub>testmini</sub> | 33.8 | 37.7 | 43.2 |43.0|45.3|
60
- | MMstar | 37.7 | 45.7 | 50.1|48.0|48.5|
61
-
62
-
63
-
64
- ## Quick Start
65
-
66
-
67
-
68
- We provide a [inference script](./demo.py) to help you quickly start using the model. We support different input types:
69
- - pure text input
70
- - single image input
71
- - multiple image input
72
- - video input
73
-
74
- ### 0. Install the dependencies
75
-
76
- ```bash
77
- pip install transformers==4.37.2
78
- pip install flash-attn
79
- ```
80
- **Note**: Latest version of transformers is not compatible with the model.
81
-
82
- ### 1. Prepare the Model worker
83
-
84
- <details>
85
- <summary>Click to expand</summary>
86
-
87
- ```python
88
-
89
- """
90
- A model worker executes the model.
91
- Copied and modified from https://github.com/OpenGVLab/InternVL/blob/main/streamlit_demo/model_worker.py
92
- """
93
- # Importing torch before transformers can cause `segmentation fault`
94
- from transformers import AutoModel, AutoTokenizer, TextIteratorStreamer, AutoConfig
95
-
96
- import argparse
97
- import base64
98
- import json
99
- import os
100
- import decord
101
- import threading
102
- import time
103
- from io import BytesIO
104
- from threading import Thread
105
- import math
106
- import requests
107
- import torch
108
- import torchvision.transforms as T
109
- from PIL import Image
110
- from torchvision.transforms.functional import InterpolationMode
111
- import numpy as np
112
-
113
-
114
- IMAGENET_MEAN = (0.485, 0.456, 0.406)
115
- IMAGENET_STD = (0.229, 0.224, 0.225)
116
-
117
- SIGLIP_MEAN = (0.5, 0.5, 0.5)
118
- SIGLIP_STD = (0.5, 0.5, 0.5)
119
-
120
-
121
- def get_seq_frames(total_num_frames, desired_num_frames=-1, stride=-1):
122
- """
123
- Calculate the indices of frames to extract from a video.
124
-
125
- Parameters:
126
- total_num_frames (int): Total number of frames in the video.
127
- desired_num_frames (int): Desired number of frames to extract.
128
-
129
- Returns:
130
- list: List of indices of frames to extract.
131
- """
132
-
133
- assert desired_num_frames > 0 or stride > 0 and not (desired_num_frames > 0 and stride > 0)
134
-
135
- if stride > 0:
136
- return list(range(0, total_num_frames, stride))
137
-
138
- # Calculate the size of each segment from which a frame will be extracted
139
- seg_size = float(total_num_frames - 1) / desired_num_frames
140
-
141
- seq = []
142
- for i in range(desired_num_frames):
143
- # Calculate the start and end indices of each segment
144
- start = int(np.round(seg_size * i))
145
- end = int(np.round(seg_size * (i + 1)))
146
-
147
- # Append the middle index of the segment to the list
148
- seq.append((start + end) // 2)
149
-
150
- return seq
151
-
152
- def build_video_prompt(meta_list, num_frames, time_position=False):
153
- # if time_position is True, the frame_timestamp is used.
154
- # 1. pass time_position, 2. use env TIME_POSITION
155
- time_position = os.environ.get("TIME_POSITION", time_position)
156
- prefix = f"This is a video:\n"
157
- for i in range(num_frames):
158
- if time_position:
159
- frame_txt = f"Frame {i+1} sampled at {meta_list[i]:.2f} seconds: <image>\n"
160
- else:
161
- frame_txt = f"Frame {i+1}: <image>\n"
162
- prefix += frame_txt
163
- return prefix
164
-
165
- def load_video(video_path, num_frames=64, frame_cache_root=None):
166
- if isinstance(video_path, str):
167
- video = decord.VideoReader(video_path)
168
- elif isinstance(video_path, dict):
169
- assert False, 'we not support vidoe: "video_path" as input'
170
- fps = video.get_avg_fps()
171
- sampled_frames = get_seq_frames(len(video), num_frames)
172
- samepld_timestamps = [i / fps for i in sampled_frames]
173
- frames = video.get_batch(sampled_frames).asnumpy()
174
- images = [Image.fromarray(frame) for frame in frames]
175
-
176
- return images, build_video_prompt(samepld_timestamps, len(images), time_position=True)
177
-
178
- def load_image(image):
179
- if isinstance(image, str) and os.path.exists(image):
180
- return Image.open(image)
181
- elif isinstance(image, dict):
182
- if 'disk_path' in image:
183
- return Image.open(image['disk_path'])
184
- elif 'base64' in image:
185
- return Image.open(BytesIO(base64.b64decode(image['base64'])))
186
- elif 'url' in image:
187
- response = requests.get(image['url'])
188
- return Image.open(BytesIO(response.content))
189
- elif 'bytes' in image:
190
- return Image.open(BytesIO(image['bytes']))
191
- else:
192
- raise ValueError(f'Invalid image: {image}')
193
- else:
194
- raise ValueError(f'Invalid image: {image}')
195
-
196
- def build_transform(input_size, norm_type='imagenet'):
197
- if norm_type == 'imagenet':
198
- MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
199
- elif norm_type == 'siglip':
200
- MEAN, STD = SIGLIP_MEAN, SIGLIP_STD
201
-
202
- transform = T.Compose([
203
- T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
204
- T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
205
- T.ToTensor(),
206
- T.Normalize(mean=MEAN, std=STD)
207
- ])
208
- return transform
209
-
210
-
211
- def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
212
- """
213
- previous version mainly foucs on ratio.
214
- We also consider area ratio here.
215
- """
216
- best_factor = float('-inf')
217
- best_ratio = (1, 1)
218
- area = width * height
219
- for ratio in target_ratios:
220
- target_aspect_ratio = ratio[0] / ratio[1]
221
- ratio_diff = abs(aspect_ratio - target_aspect_ratio)
222
- area_ratio = (ratio[0]*ratio[1]*image_size*image_size)/ area
223
- """
224
- new area > 60% of original image area is enough.
225
- """
226
- factor_based_on_area_n_ratio = min((ratio[0]*ratio[1]*image_size*image_size)/ area, 0.6)* \
227
- min(target_aspect_ratio/aspect_ratio, aspect_ratio/target_aspect_ratio)
228
-
229
- if factor_based_on_area_n_ratio > best_factor:
230
- best_factor = factor_based_on_area_n_ratio
231
- best_ratio = ratio
232
-
233
- return best_ratio
234
-
235
-
236
- def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
237
- orig_width, orig_height = image.size
238
- aspect_ratio = orig_width / orig_height
239
-
240
- # calculate the existing image aspect ratio
241
- target_ratios = set(
242
- (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
243
- i * j <= max_num and i * j >= min_num)
244
- target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
245
-
246
- # find the closest aspect ratio to the target
247
- target_aspect_ratio = find_closest_aspect_ratio(
248
- aspect_ratio, target_ratios, orig_width, orig_height, image_size)
249
-
250
- # calculate the target width and height
251
- target_width = image_size * target_aspect_ratio[0]
252
- target_height = image_size * target_aspect_ratio[1]
253
- blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
254
-
255
- # resize the image
256
- resized_img = image.resize((target_width, target_height))
257
- processed_images = []
258
- for i in range(blocks):
259
- box = (
260
- (i % (target_width // image_size)) * image_size,
261
- (i // (target_width // image_size)) * image_size,
262
- ((i % (target_width // image_size)) + 1) * image_size,
263
- ((i // (target_width // image_size)) + 1) * image_size
264
- )
265
- # split the image
266
- split_img = resized_img.crop(box)
267
- processed_images.append(split_img)
268
- assert len(processed_images) == blocks
269
- if use_thumbnail and len(processed_images) != 1:
270
- thumbnail_img = image.resize((image_size, image_size))
271
- processed_images.append(thumbnail_img)
272
- return processed_images
273
-
274
- def split_model(model_path, device):
275
-
276
- device_map = {}
277
- world_size = torch.cuda.device_count()
278
- config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
279
- num_layers = config.llm_config.num_hidden_layers
280
-
281
- print('world_size', world_size)
282
- num_layers_per_gpu_ = math.floor(num_layers / (world_size - 1))
283
- num_layers_per_gpu = [num_layers_per_gpu_] * world_size
284
- num_layers_per_gpu[device] = num_layers - num_layers_per_gpu_ * (world_size-1)
285
- print(num_layers_per_gpu)
286
- layer_cnt = 0
287
- for i, num_layer in enumerate(num_layers_per_gpu):
288
- for j in range(num_layer):
289
- device_map[f'language_model.model.layers.{layer_cnt}'] = i
290
- layer_cnt += 1
291
- device_map['vision_model'] = device
292
- device_map['mlp1'] = device
293
- device_map['language_model.model.tok_embeddings'] = device
294
- device_map['language_model.model.embed_tokens'] = device
295
- device_map['language_model.output'] = device
296
- device_map['language_model.model.norm'] = device
297
- device_map['language_model.lm_head'] = device
298
- device_map['language_model.model.rotary_emb'] = device
299
- device_map[f'language_model.model.layers.{num_layers - 1}'] = device
300
- return device_map
301
-
302
- class ModelWorker:
303
- def __init__(self, model_path, model_name,
304
- load_8bit, device):
305
-
306
- if model_path.endswith('/'):
307
- model_path = model_path[:-1]
308
- if model_name is None:
309
- model_paths = model_path.split('/')
310
- if model_paths[-1].startswith('checkpoint-'):
311
- self.model_name = model_paths[-2] + '_' + model_paths[-1]
312
- else:
313
- self.model_name = model_paths[-1]
314
- else:
315
- self.model_name = model_name
316
-
317
- print(f'Loading the model {self.model_name}')
318
-
319
- tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
320
- tokens_to_keep = ['<box>', '</box>', '<ref>', '</ref>']
321
- tokenizer.additional_special_tokens = [item for item in tokenizer.additional_special_tokens if item not in tokens_to_keep]
322
- self.tokenizer = tokenizer
323
- config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
324
- model_type = config.vision_config.model_type
325
- self.device = torch.cuda.current_device()
326
- if model_type == 'siglip_vision_model':
327
- self.norm_type = 'siglip'
328
- elif model_type == 'MOB':
329
- self.norm_type = 'siglip'
330
- else:
331
- self.norm_type = 'imagenet'
332
-
333
- if any(x in model_path.lower() for x in ['34b']):
334
- device_map = split_model(model_path, self.device)
335
- else:
336
- device_map = None
337
-
338
- if device_map is not None:
339
- self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
340
- low_cpu_mem_usage=True,
341
- device_map=device_map,
342
- trust_remote_code=True,
343
- load_in_8bit=load_8bit).eval()
344
- else:
345
- self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
346
- trust_remote_code=True,
347
- load_in_8bit=load_8bit).eval()
348
-
349
- if not load_8bit and device_map is None:
350
- self.model = self.model.to(device)
351
- self.load_8bit = load_8bit
352
-
353
- self.model_path = model_path
354
- self.image_size = self.model.config.force_image_size
355
- self.context_len = tokenizer.model_max_length
356
- self.per_tile_len = 256
357
-
358
- def reload_model(self):
359
- del self.model
360
- torch.cuda.empty_cache()
361
- if self.device == 'auto':
362
- os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
363
- # This can make distributed deployment work properly
364
- self.model = AutoModel.from_pretrained(
365
- self.model_path,
366
- load_in_8bit=self.load_8bit,
367
- torch_dtype=torch.bfloat16,
368
- device_map=self.device_map,
369
- trust_remote_code=True).eval()
370
- else:
371
- self.model = AutoModel.from_pretrained(
372
- self.model_path,
373
- load_in_8bit=self.load_8bit,
374
- torch_dtype=torch.bfloat16,
375
- trust_remote_code=True).eval()
376
- if not self.load_8bit and not self.device == 'auto':
377
- self.model = self.model.cuda()
378
-
379
- @torch.inference_mode()
380
- def generate(self, params):
381
- system_message = params['prompt'][0]['content']
382
- send_messages = params['prompt'][1:]
383
- max_input_tiles = params['max_input_tiles']
384
- temperature = params['temperature']
385
- top_p = params['top_p']
386
- max_new_tokens = params['max_new_tokens']
387
- repetition_penalty = params['repetition_penalty']
388
- video_frame_num = params.get('video_frame_num', 64)
389
- do_sample = True if temperature > 0.0 else False
390
-
391
- global_image_cnt = 0
392
- history, pil_images, max_input_tile_list = [], [], []
393
- for message in send_messages:
394
- if message['role'] == 'user':
395
- prefix = ''
396
- if 'image' in message:
397
- for image_data in message['image']:
398
- pil_images.append(load_image(image_data))
399
- prefix = prefix + f'<image {global_image_cnt + 1}><image>\n'
400
- global_image_cnt += 1
401
- max_input_tile_list.append(max_input_tiles)
402
- if 'video' in message:
403
- for video_data in message['video']:
404
- video_frames, tmp_prefix = load_video(video_data, num_frames=video_frame_num)
405
- pil_images.extend(video_frames)
406
- prefix = prefix + tmp_prefix
407
- global_image_cnt += len(video_frames)
408
- max_input_tile_list.extend([1] * len(video_frames))
409
- content = prefix + message['content']
410
- history.append([content, ])
411
- else:
412
- history[-1].append(message['content'])
413
- question, history = history[-1][0], history[:-1]
414
-
415
- if global_image_cnt == 1:
416
- question = question.replace('<image 1><image>\n', '<image>\n')
417
- history = [[item[0].replace('<image 1><image>\n', '<image>\n'), item[1]] for item in history]
418
-
419
-
420
- try:
421
- assert len(max_input_tile_list) == len(pil_images), 'The number of max_input_tile_list and pil_images should be the same.'
422
- except Exception as e:
423
- from IPython import embed; embed()
424
- exit()
425
- print(f'Error: {e}')
426
- print(f'max_input_tile_list: {max_input_tile_list}, pil_images: {pil_images}')
427
- # raise e
428
-
429
- old_system_message = self.model.system_message
430
- self.model.system_message = system_message
431
-
432
- transform = build_transform(input_size=self.image_size, norm_type=self.norm_type)
433
- if len(pil_images) > 0:
434
- max_input_tiles_limited_by_contect = params['max_input_tiles']
435
- while True:
436
- image_tiles = []
437
- for current_max_input_tiles, pil_image in zip(max_input_tile_list, pil_images):
438
- if self.model.config.dynamic_image_size:
439
- tiles = dynamic_preprocess(
440
- pil_image, image_size=self.image_size, max_num=min(current_max_input_tiles, max_input_tiles_limited_by_contect),
441
- use_thumbnail=self.model.config.use_thumbnail)
442
- else:
443
- tiles = [pil_image]
444
- image_tiles += tiles
445
- if (len(image_tiles) * self.per_tile_len < self.context_len):
446
- break
447
- else:
448
- max_input_tiles_limited_by_contect -= 2
449
-
450
- if max_input_tiles_limited_by_contect < 1:
451
- break
452
-
453
- pixel_values = [transform(item) for item in image_tiles]
454
- pixel_values = torch.stack(pixel_values).to(self.model.device, dtype=torch.bfloat16)
455
- print(f'Split images to {pixel_values.shape}')
456
- else:
457
- pixel_values = None
458
-
459
- generation_config = dict(
460
- num_beams=1,
461
- max_new_tokens=max_new_tokens,
462
- do_sample=do_sample,
463
- temperature=temperature,
464
- repetition_penalty=repetition_penalty,
465
- max_length=self.context_len,
466
- top_p=top_p,
467
- )
468
-
469
- response = self.model.chat(
470
- tokenizer=self.tokenizer,
471
- pixel_values=pixel_values,
472
- question=question,
473
- history=history,
474
- return_history=False,
475
- generation_config=generation_config,
476
- )
477
- self.model.system_message = old_system_message
478
- return {'text': response, 'error_code': 0}
479
-
480
-
481
-
482
-
483
-
484
- if __name__ == '__main__':
485
- parser = argparse.ArgumentParser()
486
- parser.add_argument('--model-path', type=str, default='NVIDIA/Eagle-2-1B')
487
- parser.add_argument('--model-name', type=str, default='Eagle-2-1B')
488
- parser.add_argument('--device', type=str, default='cuda')
489
- parser.add_argument('--load-8bit', action='store_true')
490
- args = parser.parse_args()
491
- print(f'args: {args}')
492
-
493
- worker = ModelWorker(
494
- args.model_path,
495
- args.model_name,
496
- args.load_8bit,
497
- args.device)
498
- ```
499
- </details>
500
-
501
-
502
- ### 2. Prepare the Prompt
503
-
504
- - Single image input
505
- ```python
506
- prompt = [
507
- {'role': 'system', 'content': 'You are a helpful assistant.'},
508
- {'role': 'user', 'content': 'Describe this image in details.',
509
- 'image':[
510
- {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]'}
511
- ],
512
- }
513
- ]
514
- ```
515
-
516
- - Multiple image input
517
- ```python
518
- prompt = [
519
- {'role': 'system', 'content': 'You are a helpful assistant.'},
520
- {'role': 'user', 'content': 'Describe these two images in details.',
521
- 'image':[
522
- {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]'},
523
- {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]'}
524
- ],
525
- }
526
- ]
527
- ```
528
-
529
- - Video input
530
- ```python
531
- prompt = [
532
- {'role': 'system', 'content': 'You are a helpful assistant.'},
533
- {'role': 'user', 'content': 'Describe this video in details.',
534
- 'video':[
535
- 'path/to/your/video.mp4'
536
- ],
537
- }
538
- ]
539
- ```
540
-
541
- ### 3. Generate the response
542
- ```python
543
- params = {
544
- 'prompt': prompt,
545
- 'max_input_tiles': 24,
546
- 'temperature': 0.7,
547
- 'top_p': 1.0,
548
- 'max_new_tokens': 4096,
549
- 'repetition_penalty': 1.0,
550
- }
551
- worker.generate(params)
552
- ```
553
-
554
- ## TODO
555
- - [ ] Support vLLM Inference
556
- - [ ] Provide AWQ Quantization Weights
557
- - [ ] Provide fine-tuning scripts
558
-
559
-
560
- ## License/Terms of Use
561
- - The code is released under the Apache 2.0 license as found in the [LICENSE](https://huggingface.co/NVEagle/Eagle-X5-13B-Chat/blob/main/LICENSE) file.
562
- - The pretrained model weights are released under the [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0) <br>
563
- - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
564
- - Model License of Qwen2.5-0.5B-Instruct: [Apache-2.0](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/blob/main/LICENSE)
565
- - Model License of PaliGemma: [Gemma license](https://ai.google.dev/gemma/terms)
566
-
567
-
568
-
569
- ## Citation
570
-
571
- ## Ethical Considerations
572
- NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
573
-
574
- Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
 
 
 
 
 
 
 
 
 
 
 
 
575
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ base_model:
6
+ - google/paligemma-3b-mix-448
7
+ - Qwen/Qwen2.5-0.5B-Instruct
8
+ - google/siglip-so400m-patch14-384
9
+ base_model_relation: merge
10
+ language:
11
+ - zho
12
+ - eng
13
+ - fra
14
+ - spa
15
+ - por
16
+ - deu
17
+ - ita
18
+ - rus
19
+ - jpn
20
+ - kor
21
+ - vie
22
+ - tha
23
+ - ara
24
+ tags:
25
+ - eagle
26
+ - VLM
27
+ ---
28
+
29
+
30
+ # Eagle-2
31
+
32
+
33
+ [\[📂 GitHub\]](https://github.com/NVlabs/EAGLE) [\[📜 Eagle2 Tech Report\]](https://github.com/NVlabs/EAGLE/blob/main/Eagle2/Eagle2_report.pdf)
34
+ [\[🗨️ Chat Demo\]](http://eagle-vlm.xyz/) [\[🤗 HF Demo\]](TODO)
35
+ ## Introduction
36
+
37
+ We are thrilled to release our latest Eagle2 series Vision-Language Model. Open-source Vision-Language Models (VLMs) have made significant strides in narrowing the gap with proprietary models. However, critical details about data strategies and implementation are often missing, limiting reproducibility and innovation. In this project, we focus on VLM post-training from a data-centric perspective, sharing insights into building effective data strategies from scratch. By combining these strategies with robust training recipes and model design, we introduce Eagle2, a family of performant VLMs. Our work aims to empower the open-source community to develop competitive VLMs with transparent processes.
38
+
39
+
40
+ In this repo, we are open-sourcing Eagle2-1B, a compact and efficient model designed for scenarios requiring fast inference and minimal computational resources, without compromising essential performance
41
+
42
+
43
+
44
+
45
+
46
+
47
+
48
+
49
+ ## Model Zoo
50
+ We provide the following models:
51
+
52
+ | model name | LLM | Vision | Max Length| HF Link|
53
+ | ----------- | ------- |---------|-|-|
54
+ | Eagle2-1B | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-1B)|
55
+ | Eagle2-2B | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-2B)|
56
+ | Eagle2-9B | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | Siglip+ConvNext | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-9B)|
57
+
58
+ ## Benchmark Results
59
+ | Benchmark | LLaVa-One-Vision-0.5B | InternVL2-1B | InternVL2.5-1B |Qwen2-VL-2B| Eagle2-1B|
60
+ | :--------------------------: | :------------------: | :----------------: | :----------: |:----------: |:----------: |
61
+ | DocVQA<sub>test</sub> | 70.0 | 81.7 | 84.8 |90.1|81.8|
62
+ | ChartQA<sub>test</sub> | 61.4 | 72.9 | 75.9 |73.0|77.0|
63
+ | InfoVQA<sub>test</sub> | 41.8 | 50.9 | 56.0 |65.5|54.8|
64
+ | TextVQA<sub>val</sub> | - | 70.0 | 72.0 |79.7|76.6|
65
+ | OCRBench | 565 | 754 | 785 |809|767|
66
+ | MME<sub>sum</sub> | 1438.0 | 1794.4 | 1950.5 | 1872.0| 1790.2|
67
+ | RealWorldQA | 55.6 | 50.3 | 57.5 |62.6|55.4|
68
+ | AI2D<sub>test</sub> | 57.1 | 64.1 | 69.3 | 74.7 |70.9|
69
+ | MMMU<sub>val</sub> | 31.4 | 36.7 | 40.9 |41.1|38.8|
70
+ | MMVet<sub>GPT-4-Turbo</sub> | 32.2 | 32.7 | 48.8 | 49.5|40.9| HallBench<sub>avg</sub> | 27.9 | 34.0 | 39.0 |**41.7**|35.3
71
+ | MathVista<sub>testmini</sub> | 33.8 | 37.7 | 43.2 |43.0|45.3|
72
+ | MMstar | 37.7 | 45.7 | 50.1|48.0|48.5|
73
+
74
+
75
+
76
+ ## Quick Start
77
+
78
+
79
+
80
+ We provide a [inference script](./demo.py) to help you quickly start using the model. We support different input types:
81
+ - pure text input
82
+ - single image input
83
+ - multiple image input
84
+ - video input
85
+
86
+ ### 0. Install the dependencies
87
+
88
+ ```bash
89
+ pip install transformers==4.37.2
90
+ pip install flash-attn
91
+ ```
92
+ **Note**: Latest version of transformers is not compatible with the model.
93
+
94
+ ### 1. Prepare the Model worker
95
+
96
+ <details>
97
+ <summary>Click to expand</summary>
98
+
99
+ ```python
100
+
101
+ """
102
+ A model worker executes the model.
103
+ Copied and modified from https://github.com/OpenGVLab/InternVL/blob/main/streamlit_demo/model_worker.py
104
+ """
105
+ # Importing torch before transformers can cause `segmentation fault`
106
+ from transformers import AutoModel, AutoTokenizer, TextIteratorStreamer, AutoConfig
107
+
108
+ import argparse
109
+ import base64
110
+ import json
111
+ import os
112
+ import decord
113
+ import threading
114
+ import time
115
+ from io import BytesIO
116
+ from threading import Thread
117
+ import math
118
+ import requests
119
+ import torch
120
+ import torchvision.transforms as T
121
+ from PIL import Image
122
+ from torchvision.transforms.functional import InterpolationMode
123
+ import numpy as np
124
+
125
+
126
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
127
+ IMAGENET_STD = (0.229, 0.224, 0.225)
128
+
129
+ SIGLIP_MEAN = (0.5, 0.5, 0.5)
130
+ SIGLIP_STD = (0.5, 0.5, 0.5)
131
+
132
+
133
+ def get_seq_frames(total_num_frames, desired_num_frames=-1, stride=-1):
134
+ """
135
+ Calculate the indices of frames to extract from a video.
136
+
137
+ Parameters:
138
+ total_num_frames (int): Total number of frames in the video.
139
+ desired_num_frames (int): Desired number of frames to extract.
140
+
141
+ Returns:
142
+ list: List of indices of frames to extract.
143
+ """
144
+
145
+ assert desired_num_frames > 0 or stride > 0 and not (desired_num_frames > 0 and stride > 0)
146
+
147
+ if stride > 0:
148
+ return list(range(0, total_num_frames, stride))
149
+
150
+ # Calculate the size of each segment from which a frame will be extracted
151
+ seg_size = float(total_num_frames - 1) / desired_num_frames
152
+
153
+ seq = []
154
+ for i in range(desired_num_frames):
155
+ # Calculate the start and end indices of each segment
156
+ start = int(np.round(seg_size * i))
157
+ end = int(np.round(seg_size * (i + 1)))
158
+
159
+ # Append the middle index of the segment to the list
160
+ seq.append((start + end) // 2)
161
+
162
+ return seq
163
+
164
+ def build_video_prompt(meta_list, num_frames, time_position=False):
165
+ # if time_position is True, the frame_timestamp is used.
166
+ # 1. pass time_position, 2. use env TIME_POSITION
167
+ time_position = os.environ.get("TIME_POSITION", time_position)
168
+ prefix = f"This is a video:\n"
169
+ for i in range(num_frames):
170
+ if time_position:
171
+ frame_txt = f"Frame {i+1} sampled at {meta_list[i]:.2f} seconds: <image>\n"
172
+ else:
173
+ frame_txt = f"Frame {i+1}: <image>\n"
174
+ prefix += frame_txt
175
+ return prefix
176
+
177
+ def load_video(video_path, num_frames=64, frame_cache_root=None):
178
+ if isinstance(video_path, str):
179
+ video = decord.VideoReader(video_path)
180
+ elif isinstance(video_path, dict):
181
+ assert False, 'we not support vidoe: "video_path" as input'
182
+ fps = video.get_avg_fps()
183
+ sampled_frames = get_seq_frames(len(video), num_frames)
184
+ samepld_timestamps = [i / fps for i in sampled_frames]
185
+ frames = video.get_batch(sampled_frames).asnumpy()
186
+ images = [Image.fromarray(frame) for frame in frames]
187
+
188
+ return images, build_video_prompt(samepld_timestamps, len(images), time_position=True)
189
+
190
+ def load_image(image):
191
+ if isinstance(image, str) and os.path.exists(image):
192
+ return Image.open(image)
193
+ elif isinstance(image, dict):
194
+ if 'disk_path' in image:
195
+ return Image.open(image['disk_path'])
196
+ elif 'base64' in image:
197
+ return Image.open(BytesIO(base64.b64decode(image['base64'])))
198
+ elif 'url' in image:
199
+ response = requests.get(image['url'])
200
+ return Image.open(BytesIO(response.content))
201
+ elif 'bytes' in image:
202
+ return Image.open(BytesIO(image['bytes']))
203
+ else:
204
+ raise ValueError(f'Invalid image: {image}')
205
+ else:
206
+ raise ValueError(f'Invalid image: {image}')
207
+
208
+ def build_transform(input_size, norm_type='imagenet'):
209
+ if norm_type == 'imagenet':
210
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
211
+ elif norm_type == 'siglip':
212
+ MEAN, STD = SIGLIP_MEAN, SIGLIP_STD
213
+
214
+ transform = T.Compose([
215
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
216
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
217
+ T.ToTensor(),
218
+ T.Normalize(mean=MEAN, std=STD)
219
+ ])
220
+ return transform
221
+
222
+
223
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
224
+ """
225
+ previous version mainly foucs on ratio.
226
+ We also consider area ratio here.
227
+ """
228
+ best_factor = float('-inf')
229
+ best_ratio = (1, 1)
230
+ area = width * height
231
+ for ratio in target_ratios:
232
+ target_aspect_ratio = ratio[0] / ratio[1]
233
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
234
+ area_ratio = (ratio[0]*ratio[1]*image_size*image_size)/ area
235
+ """
236
+ new area > 60% of original image area is enough.
237
+ """
238
+ factor_based_on_area_n_ratio = min((ratio[0]*ratio[1]*image_size*image_size)/ area, 0.6)* \
239
+ min(target_aspect_ratio/aspect_ratio, aspect_ratio/target_aspect_ratio)
240
+
241
+ if factor_based_on_area_n_ratio > best_factor:
242
+ best_factor = factor_based_on_area_n_ratio
243
+ best_ratio = ratio
244
+
245
+ return best_ratio
246
+
247
+
248
+ def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
249
+ orig_width, orig_height = image.size
250
+ aspect_ratio = orig_width / orig_height
251
+
252
+ # calculate the existing image aspect ratio
253
+ target_ratios = set(
254
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
255
+ i * j <= max_num and i * j >= min_num)
256
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
257
+
258
+ # find the closest aspect ratio to the target
259
+ target_aspect_ratio = find_closest_aspect_ratio(
260
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
261
+
262
+ # calculate the target width and height
263
+ target_width = image_size * target_aspect_ratio[0]
264
+ target_height = image_size * target_aspect_ratio[1]
265
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
266
+
267
+ # resize the image
268
+ resized_img = image.resize((target_width, target_height))
269
+ processed_images = []
270
+ for i in range(blocks):
271
+ box = (
272
+ (i % (target_width // image_size)) * image_size,
273
+ (i // (target_width // image_size)) * image_size,
274
+ ((i % (target_width // image_size)) + 1) * image_size,
275
+ ((i // (target_width // image_size)) + 1) * image_size
276
+ )
277
+ # split the image
278
+ split_img = resized_img.crop(box)
279
+ processed_images.append(split_img)
280
+ assert len(processed_images) == blocks
281
+ if use_thumbnail and len(processed_images) != 1:
282
+ thumbnail_img = image.resize((image_size, image_size))
283
+ processed_images.append(thumbnail_img)
284
+ return processed_images
285
+
286
+ def split_model(model_path, device):
287
+
288
+ device_map = {}
289
+ world_size = torch.cuda.device_count()
290
+ config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
291
+ num_layers = config.llm_config.num_hidden_layers
292
+
293
+ print('world_size', world_size)
294
+ num_layers_per_gpu_ = math.floor(num_layers / (world_size - 1))
295
+ num_layers_per_gpu = [num_layers_per_gpu_] * world_size
296
+ num_layers_per_gpu[device] = num_layers - num_layers_per_gpu_ * (world_size-1)
297
+ print(num_layers_per_gpu)
298
+ layer_cnt = 0
299
+ for i, num_layer in enumerate(num_layers_per_gpu):
300
+ for j in range(num_layer):
301
+ device_map[f'language_model.model.layers.{layer_cnt}'] = i
302
+ layer_cnt += 1
303
+ device_map['vision_model'] = device
304
+ device_map['mlp1'] = device
305
+ device_map['language_model.model.tok_embeddings'] = device
306
+ device_map['language_model.model.embed_tokens'] = device
307
+ device_map['language_model.output'] = device
308
+ device_map['language_model.model.norm'] = device
309
+ device_map['language_model.lm_head'] = device
310
+ device_map['language_model.model.rotary_emb'] = device
311
+ device_map[f'language_model.model.layers.{num_layers - 1}'] = device
312
+ return device_map
313
+
314
+ class ModelWorker:
315
+ def __init__(self, model_path, model_name,
316
+ load_8bit, device):
317
+
318
+ if model_path.endswith('/'):
319
+ model_path = model_path[:-1]
320
+ if model_name is None:
321
+ model_paths = model_path.split('/')
322
+ if model_paths[-1].startswith('checkpoint-'):
323
+ self.model_name = model_paths[-2] + '_' + model_paths[-1]
324
+ else:
325
+ self.model_name = model_paths[-1]
326
+ else:
327
+ self.model_name = model_name
328
+
329
+ print(f'Loading the model {self.model_name}')
330
+
331
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
332
+ tokens_to_keep = ['<box>', '</box>', '<ref>', '</ref>']
333
+ tokenizer.additional_special_tokens = [item for item in tokenizer.additional_special_tokens if item not in tokens_to_keep]
334
+ self.tokenizer = tokenizer
335
+ config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
336
+ model_type = config.vision_config.model_type
337
+ self.device = torch.cuda.current_device()
338
+ if model_type == 'siglip_vision_model':
339
+ self.norm_type = 'siglip'
340
+ elif model_type == 'MOB':
341
+ self.norm_type = 'siglip'
342
+ else:
343
+ self.norm_type = 'imagenet'
344
+
345
+ if any(x in model_path.lower() for x in ['34b']):
346
+ device_map = split_model(model_path, self.device)
347
+ else:
348
+ device_map = None
349
+
350
+ if device_map is not None:
351
+ self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
352
+ low_cpu_mem_usage=True,
353
+ device_map=device_map,
354
+ trust_remote_code=True,
355
+ load_in_8bit=load_8bit).eval()
356
+ else:
357
+ self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
358
+ trust_remote_code=True,
359
+ load_in_8bit=load_8bit).eval()
360
+
361
+ if not load_8bit and device_map is None:
362
+ self.model = self.model.to(device)
363
+ self.load_8bit = load_8bit
364
+
365
+ self.model_path = model_path
366
+ self.image_size = self.model.config.force_image_size
367
+ self.context_len = tokenizer.model_max_length
368
+ self.per_tile_len = 256
369
+
370
+ def reload_model(self):
371
+ del self.model
372
+ torch.cuda.empty_cache()
373
+ if self.device == 'auto':
374
+ os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
375
+ # This can make distributed deployment work properly
376
+ self.model = AutoModel.from_pretrained(
377
+ self.model_path,
378
+ load_in_8bit=self.load_8bit,
379
+ torch_dtype=torch.bfloat16,
380
+ device_map=self.device_map,
381
+ trust_remote_code=True).eval()
382
+ else:
383
+ self.model = AutoModel.from_pretrained(
384
+ self.model_path,
385
+ load_in_8bit=self.load_8bit,
386
+ torch_dtype=torch.bfloat16,
387
+ trust_remote_code=True).eval()
388
+ if not self.load_8bit and not self.device == 'auto':
389
+ self.model = self.model.cuda()
390
+
391
+ @torch.inference_mode()
392
+ def generate(self, params):
393
+ system_message = params['prompt'][0]['content']
394
+ send_messages = params['prompt'][1:]
395
+ max_input_tiles = params['max_input_tiles']
396
+ temperature = params['temperature']
397
+ top_p = params['top_p']
398
+ max_new_tokens = params['max_new_tokens']
399
+ repetition_penalty = params['repetition_penalty']
400
+ video_frame_num = params.get('video_frame_num', 64)
401
+ do_sample = True if temperature > 0.0 else False
402
+
403
+ global_image_cnt = 0
404
+ history, pil_images, max_input_tile_list = [], [], []
405
+ for message in send_messages:
406
+ if message['role'] == 'user':
407
+ prefix = ''
408
+ if 'image' in message:
409
+ for image_data in message['image']:
410
+ pil_images.append(load_image(image_data))
411
+ prefix = prefix + f'<image {global_image_cnt + 1}><image>\n'
412
+ global_image_cnt += 1
413
+ max_input_tile_list.append(max_input_tiles)
414
+ if 'video' in message:
415
+ for video_data in message['video']:
416
+ video_frames, tmp_prefix = load_video(video_data, num_frames=video_frame_num)
417
+ pil_images.extend(video_frames)
418
+ prefix = prefix + tmp_prefix
419
+ global_image_cnt += len(video_frames)
420
+ max_input_tile_list.extend([1] * len(video_frames))
421
+ content = prefix + message['content']
422
+ history.append([content, ])
423
+ else:
424
+ history[-1].append(message['content'])
425
+ question, history = history[-1][0], history[:-1]
426
+
427
+ if global_image_cnt == 1:
428
+ question = question.replace('<image 1><image>\n', '<image>\n')
429
+ history = [[item[0].replace('<image 1><image>\n', '<image>\n'), item[1]] for item in history]
430
+
431
+
432
+ try:
433
+ assert len(max_input_tile_list) == len(pil_images), 'The number of max_input_tile_list and pil_images should be the same.'
434
+ except Exception as e:
435
+ from IPython import embed; embed()
436
+ exit()
437
+ print(f'Error: {e}')
438
+ print(f'max_input_tile_list: {max_input_tile_list}, pil_images: {pil_images}')
439
+ # raise e
440
+
441
+ old_system_message = self.model.system_message
442
+ self.model.system_message = system_message
443
+
444
+ transform = build_transform(input_size=self.image_size, norm_type=self.norm_type)
445
+ if len(pil_images) > 0:
446
+ max_input_tiles_limited_by_contect = params['max_input_tiles']
447
+ while True:
448
+ image_tiles = []
449
+ for current_max_input_tiles, pil_image in zip(max_input_tile_list, pil_images):
450
+ if self.model.config.dynamic_image_size:
451
+ tiles = dynamic_preprocess(
452
+ pil_image, image_size=self.image_size, max_num=min(current_max_input_tiles, max_input_tiles_limited_by_contect),
453
+ use_thumbnail=self.model.config.use_thumbnail)
454
+ else:
455
+ tiles = [pil_image]
456
+ image_tiles += tiles
457
+ if (len(image_tiles) * self.per_tile_len < self.context_len):
458
+ break
459
+ else:
460
+ max_input_tiles_limited_by_contect -= 2
461
+
462
+ if max_input_tiles_limited_by_contect < 1:
463
+ break
464
+
465
+ pixel_values = [transform(item) for item in image_tiles]
466
+ pixel_values = torch.stack(pixel_values).to(self.model.device, dtype=torch.bfloat16)
467
+ print(f'Split images to {pixel_values.shape}')
468
+ else:
469
+ pixel_values = None
470
+
471
+ generation_config = dict(
472
+ num_beams=1,
473
+ max_new_tokens=max_new_tokens,
474
+ do_sample=do_sample,
475
+ temperature=temperature,
476
+ repetition_penalty=repetition_penalty,
477
+ max_length=self.context_len,
478
+ top_p=top_p,
479
+ )
480
+
481
+ response = self.model.chat(
482
+ tokenizer=self.tokenizer,
483
+ pixel_values=pixel_values,
484
+ question=question,
485
+ history=history,
486
+ return_history=False,
487
+ generation_config=generation_config,
488
+ )
489
+ self.model.system_message = old_system_message
490
+ return {'text': response, 'error_code': 0}
491
+
492
+
493
+
494
+
495
+
496
+ if __name__ == '__main__':
497
+ parser = argparse.ArgumentParser()
498
+ parser.add_argument('--model-path', type=str, default='NVIDIA/Eagle-2-1B')
499
+ parser.add_argument('--model-name', type=str, default='Eagle-2-1B')
500
+ parser.add_argument('--device', type=str, default='cuda')
501
+ parser.add_argument('--load-8bit', action='store_true')
502
+ args = parser.parse_args()
503
+ print(f'args: {args}')
504
+
505
+ worker = ModelWorker(
506
+ args.model_path,
507
+ args.model_name,
508
+ args.load_8bit,
509
+ args.device)
510
+ ```
511
+ </details>
512
+
513
+
514
+ ### 2. Prepare the Prompt
515
+
516
+ - Single image input
517
+ ```python
518
+ prompt = [
519
+ {'role': 'system', 'content': 'You are a helpful assistant.'},
520
+ {'role': 'user', 'content': 'Describe this image in details.',
521
+ 'image':[
522
+ {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]'}
523
+ ],
524
+ }
525
+ ]
526
+ ```
527
+
528
+ - Multiple image input
529
+ ```python
530
+ prompt = [
531
+ {'role': 'system', 'content': 'You are a helpful assistant.'},
532
+ {'role': 'user', 'content': 'Describe these two images in details.',
533
+ 'image':[
534
+ {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]'},
535
+ {'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png'}
536
+ ],
537
+ }
538
+ ]
539
+ ```
540
+
541
+ - Video input
542
+ ```python
543
+ prompt = [
544
+ {'role': 'system', 'content': 'You are a helpful assistant.'},
545
+ {'role': 'user', 'content': 'Describe this video in details.',
546
+ 'video':[
547
+ 'path/to/your/video.mp4'
548
+ ],
549
+ }
550
+ ]
551
+ ```
552
+
553
+ ### 3. Generate the response
554
+ ```python
555
+ params = {
556
+ 'prompt': prompt,
557
+ 'max_input_tiles': 24,
558
+ 'temperature': 0.7,
559
+ 'top_p': 1.0,
560
+ 'max_new_tokens': 4096,
561
+ 'repetition_penalty': 1.0,
562
+ }
563
+ worker.generate(params)
564
+ ```
565
+
566
+ ## TODO
567
+ - [ ] Support vLLM Inference
568
+ - [ ] Provide AWQ Quantization Weights
569
+ - [ ] Provide fine-tuning scripts
570
+
571
+
572
+ ## License/Terms of Use
573
+ - The code is released under the Apache 2.0 license as found in the [LICENSE](https://huggingface.co/NVEagle/Eagle-X5-13B-Chat/blob/main/LICENSE) file.
574
+ - The pretrained model weights are released under the [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0) <br>
575
+ - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
576
+ - Model License of Qwen2.5-0.5B-Instruct: [Apache-2.0](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/blob/main/LICENSE)
577
+ - Model License of PaliGemma: [Gemma license](https://ai.google.dev/gemma/terms)
578
+
579
+
580
+
581
+ ## Citation
582
+
583
+ ## Ethical Considerations
584
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
585
+
586
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
587