AIDC-AI/Ovis2.5-9B · need a demo for the visual grounding task

Sep 2, 2025

I need a demo for the visual grounding task. The coordinates I obtained using the prompt "Please provide the bounding box coordinates. (for boxes)" cannot be correctly mapped back to the image.

runninglsy

AIDC-AI org Sep 2, 2025

Could your provide the prompt and model response?

zanepoe

Sep 2, 2025

•

edited Sep 2, 2025

my prompt:
"Find the human body parts ,Include face, eye, nose, mouth,breast,hand, foot, leg in the image. Please provide the bounding box coordinates.Coordinates are normalized to [0,1) with the origin (0,0) at the top-left corner of the image.output format json example: [ { "score": 0.7, "bbox_2d": [0.401,0.526,0.430,0.557], "label": "eye" }, { "score": 0.6, "bbox_2d": [0.489,0.494, 0.516,0.526], "label": "eye" }, { "score": 0.5, "bbox_2d": [0.296,0.529, 0.324,0.576], "label": "face" } ]
"
The code for converting to pixel coordinates is as follows:pixel_x1 = int(x1 * image_width)

zanepoe

Sep 3, 2025

Could your provide the prompt and model response?

my prompt:
"Find the human body parts ,Include face, eye, nose, mouth,breast,hand, foot, leg in the image. Please provide the bounding box coordinates.Coordinates are normalized to [0,1) with the origin (0,0) at the top-left corner of the image.output format json example: [ { "score": 0.7, "bbox_2d": [0.401,0.526,0.430,0.557], "label": "eye" }, { "score": 0.6, "bbox_2d": [0.489,0.494, 0.516,0.526], "label": "eye" }, { "score": 0.5, "bbox_2d": [0.296,0.529, 0.324,0.576], "label": "face" } ]
"
The code for converting to pixel coordinates is as follows:pixel_x1 = int(x1 * image_width)

runninglsy

AIDC-AI org Sep 3, 2025

•

edited Sep 3, 2025

Try like this:

Find the human body parts ,Include face, eye, nose, mouth,breast,hand, foot, leg in the image. Please provide the bounding box coordinates. Respond in JSON format like [{"label": "eye", "bbox": "<box>...</box>"}, ...].

liamtoran

Sep 6, 2025

@runninglsy

Try like this:

Find the human body parts ,Include face, eye, nose, mouth,breast,hand, foot, leg in the image. Please provide the bounding box coordinates. Respond in JSON format like [{"label": "eye", "bbox": "<box>...</box>"}, ...].

How should we convert the output bbox back to the image ? The README guide says to use like this:

But when I try output is between (0,1000) it seems.

Thank you for the clarification

runninglsy

AIDC-AI org Sep 8, 2025

@liamtoran Could you provide the image, prompt, and generation config?

liamtoran

Sep 8, 2025

•

edited Sep 8, 2025

Image:

Ovis 9B Space , prompt: Find the human hands in the image. Please provide the bounding box coordinates.

Prompt like you said:

I saw the same thing in Ovis-2B as well

But sometimes it works. I think sometimes its in [0,1000] range, sometimes [0,1] range but it can mix both for the same answer
It's okay to change with if x > 1 then x = x / 1000 I am guessing

runninglsy

AIDC-AI org Sep 9, 2025

@liamtoran The thinking training data does not include grounding-related datasets. Therefore, please disable the thinking feature when using Ovis 2.5 for grounding tasks.