Image-Text-to-Text
Transformers
Safetensors
Cosmos
English
qwen2_5_vl
image-to-text
nvidia
conversational
text-generation-inference
harrim-nv commited on
Commit
25940bb
·
verified ·
1 Parent(s): 8fe96c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -4
README.md CHANGED
@@ -25,11 +25,18 @@ tags:
25
 
26
  ## Description:
27
 
28
- **Cosmos-Reason1 Models**: Physical AI models understand physical common sense and generate appropriate embodied decisions in natural language through long chain-of-thought reasoning processes.
29
 
30
- The Cosmos-Reason1 models are post-trained with physical common sense and embodied reasoning data with supervised fine-tuning and reinforcement learning. These are Physical AI models that can understand space, time, and fundamental physics, and can serve as planning models to reason about the next steps of an embodied agent.
31
 
32
- The models are ready for commercial use.
 
 
 
 
 
 
 
33
 
34
  **Model Developer**: NVIDIA
35
 
@@ -344,7 +351,7 @@ We value you, the datasets, the diversity they represent, and what we have been
344
  | Model Type: | Transformer |
345
  | Intended Users: | Physical AI developers |
346
  | Output: | Text |
347
- | Describe how the model works: | Generates text answers based on input text prompt and video |
348
  | Technical Limitations: | The model may not follow the video or text input accurately in challenging cases, where the input video shows complex scene composition and temporal dynamics. Examples of challenging scenes include: fast camera movements, overlapping human-object interactions, low lighting with high motion blur, and multiple people performing different actions simultaneously. |
349
  | Verified to have met prescribed NVIDIA quality standards: | Yes |
350
  | Performance Metrics: | Quantitative and Qualitative Evaluation. Cosmos-Reason1 proposes the embodied reasoning benchmark and physical common sense benchmark to evaluate accuracy with visual question answering. |
 
25
 
26
  ## Description:
27
 
28
+ NVIDIA Cosmos Reason – an open, customizable, 7B-parameter reasoning vision language model (VLM) for physical AI and robotics - enables robots and vision AI agents to reason like humans, using prior knowledge, physics understanding and common sense to understand and act in the real world. This model understands space, time, and fundamental physics, and can serve as a planning model to reason what steps an embodied agent might take next.
29
 
30
+ Cosmos Reason excels at navigating the long tail of diverse scenarios of the physical world with spatial-temporal understanding. Cosmos Reason is post-trained with physical common sense and embodied reasoning data with supervised fine-tuning and reinforcement learning. It uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations.
31
 
32
+ Given a video/image and a text prompt, the model first converts the video/image into tokens using a vision encoder and a special translator called a projector. These video tokens are combined with the text prompt and fed into the core model, which uses a mix of LLM modules and techniques. This enables the model to think step-by-step and provide detailed, logical responses.
33
+
34
+ Cosmos Reason can be used for robotics and physical AI applications including:
35
+ - Data curation and annotation — Enable developers to automate high-quality curation and annotation of massive, diverse training datasets.
36
+ - Robot planning and reasoning — Act as the brain for deliberate, methodical decision-making in a robot vision language action (VLA) model. Now robots such as humanoids and autonomous vehicles can interpret environments and given complex commands, break them down into tasks and execute them using common sense, even in unfamiliar environments.
37
+ - Video analytics AI agents — Extract valuable insights and perform root-cause analysis on massive volumes of video data. These agents can be used to analyze and understand recorded or live video streams across city and industrial operations.
38
+
39
+ The model is ready for commercial use.
40
 
41
  **Model Developer**: NVIDIA
42
 
 
351
  | Model Type: | Transformer |
352
  | Intended Users: | Physical AI developers |
353
  | Output: | Text |
354
+ | Describe how the model works: | Given a video/image and a text prompt, the model first converts the video/image into tokens using a vision encoder and a special translator called a projector. These video tokens are combined with the text prompt and fed into the core model, which uses a mix of LLM modules and techniques. This enables the model to think step-by-step and provide detailed, logical responses. |
355
  | Technical Limitations: | The model may not follow the video or text input accurately in challenging cases, where the input video shows complex scene composition and temporal dynamics. Examples of challenging scenes include: fast camera movements, overlapping human-object interactions, low lighting with high motion blur, and multiple people performing different actions simultaneously. |
356
  | Verified to have met prescribed NVIDIA quality standards: | Yes |
357
  | Performance Metrics: | Quantitative and Qualitative Evaluation. Cosmos-Reason1 proposes the embodied reasoning benchmark and physical common sense benchmark to evaluate accuracy with visual question answering. |