Update README.md
Browse files
README.md
CHANGED
@@ -25,11 +25,18 @@ tags:
|
|
25 |
|
26 |
## Description:
|
27 |
|
28 |
-
|
29 |
|
30 |
-
|
31 |
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
**Model Developer**: NVIDIA
|
35 |
|
@@ -344,7 +351,7 @@ We value you, the datasets, the diversity they represent, and what we have been
|
|
344 |
| Model Type: | Transformer |
|
345 |
| Intended Users: | Physical AI developers |
|
346 |
| Output: | Text |
|
347 |
-
| Describe how the model works: |
|
348 |
| Technical Limitations: | The model may not follow the video or text input accurately in challenging cases, where the input video shows complex scene composition and temporal dynamics. Examples of challenging scenes include: fast camera movements, overlapping human-object interactions, low lighting with high motion blur, and multiple people performing different actions simultaneously. |
|
349 |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
|
350 |
| Performance Metrics: | Quantitative and Qualitative Evaluation. Cosmos-Reason1 proposes the embodied reasoning benchmark and physical common sense benchmark to evaluate accuracy with visual question answering. |
|
|
|
25 |
|
26 |
## Description:
|
27 |
|
28 |
+
NVIDIA Cosmos Reason – an open, customizable, 7B-parameter reasoning vision language model (VLM) for physical AI and robotics - enables robots and vision AI agents to reason like humans, using prior knowledge, physics understanding and common sense to understand and act in the real world. This model understands space, time, and fundamental physics, and can serve as a planning model to reason what steps an embodied agent might take next.
|
29 |
|
30 |
+
Cosmos Reason excels at navigating the long tail of diverse scenarios of the physical world with spatial-temporal understanding. Cosmos Reason is post-trained with physical common sense and embodied reasoning data with supervised fine-tuning and reinforcement learning. It uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations.
|
31 |
|
32 |
+
Given a video/image and a text prompt, the model first converts the video/image into tokens using a vision encoder and a special translator called a projector. These video tokens are combined with the text prompt and fed into the core model, which uses a mix of LLM modules and techniques. This enables the model to think step-by-step and provide detailed, logical responses.
|
33 |
+
|
34 |
+
Cosmos Reason can be used for robotics and physical AI applications including:
|
35 |
+
- Data curation and annotation — Enable developers to automate high-quality curation and annotation of massive, diverse training datasets.
|
36 |
+
- Robot planning and reasoning — Act as the brain for deliberate, methodical decision-making in a robot vision language action (VLA) model. Now robots such as humanoids and autonomous vehicles can interpret environments and given complex commands, break them down into tasks and execute them using common sense, even in unfamiliar environments.
|
37 |
+
- Video analytics AI agents — Extract valuable insights and perform root-cause analysis on massive volumes of video data. These agents can be used to analyze and understand recorded or live video streams across city and industrial operations.
|
38 |
+
|
39 |
+
The model is ready for commercial use.
|
40 |
|
41 |
**Model Developer**: NVIDIA
|
42 |
|
|
|
351 |
| Model Type: | Transformer |
|
352 |
| Intended Users: | Physical AI developers |
|
353 |
| Output: | Text |
|
354 |
+
| Describe how the model works: | Given a video/image and a text prompt, the model first converts the video/image into tokens using a vision encoder and a special translator called a projector. These video tokens are combined with the text prompt and fed into the core model, which uses a mix of LLM modules and techniques. This enables the model to think step-by-step and provide detailed, logical responses. |
|
355 |
| Technical Limitations: | The model may not follow the video or text input accurately in challenging cases, where the input video shows complex scene composition and temporal dynamics. Examples of challenging scenes include: fast camera movements, overlapping human-object interactions, low lighting with high motion blur, and multiple people performing different actions simultaneously. |
|
356 |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
|
357 |
| Performance Metrics: | Quantitative and Qualitative Evaluation. Cosmos-Reason1 proposes the embodied reasoning benchmark and physical common sense benchmark to evaluate accuracy with visual question answering. |
|