---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2-VL-7B-Instruct
pipeline_tag: visual-question-answering
datasets:
- DiagramAgent/DiagramGenBenchmark
---

[📑paper link](https://arxiv.org/abs/2411.11916)


## Model Card: DiagramAgent/Diagram_to_Code_Agent

### 1. Model Overview

-   **Name**: DiagramAgent/Diagram_to_Code_Agent
-   **Description**:
     This agent is tasked with converting a given diagram (visual representation) into its corresponding structured code.

### 2. Intended Use

-   Primary Tasks:
    -   Convert existing diagrams into structured code representations.
    -   Support diagram editing workflows by providing a reliable code basis for modifications.
    -   Capture and preserve implicit logical structures and visual details of diagrams.
-   Application Scenarios:
    -   Automated diagram editing: Transforming a diagram into code to enable subsequent modifications.
    -   Reverse engineering of visual diagrams for analysis and reusability.
    -   Enhancing data visualization tools by integrating code-based diagram representations.

### 3. Architecture and Training Details

-   **Base Model**: Utilizes the Qwen2-VL-7B model, which is a vision-language fusion model.
-   Training Process:
    -   Trained on diverse diagram samples from the DiagramGenBenchmark dataset.
    -   Aims to generate code that is highly consistent with a reference code, ensuring that all diagram elements are accurately captured.
    -   Uses a specialized loss function to reduce the edit distance between the generated and reference code.
-   **Module Interaction**:
     Works closely with the Check Agent, which validates the generated code and provides feedback for further refinement.

### 4. Usage Examples

```py
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "DiagramAgent/Diagram_to_Code_Agent", torch_dtype="auto", device_map="auto"
)

# default processer
processor = AutoProcessor.from_pretrained("DiagramAgent/Diagram_to_Code_Agent")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "your input",
            },
            {"type": "text", "text": "image path"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

```

### 5. Citation

If you find our work helpful, feel free to give us a cite.


```
@inproceedings{wei2024wordsstructuredvisualsbenchmark,
  title={From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing},
  author={Jingxuan Wei and Cheng Tan and Qi Chen and Gaowei Wu and Siyuan Li and Zhangyang Gao and Linzhuang Sun and Bihui Yu and Ruifeng Guo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}
```