Visual Question Answering
Safetensors
English
GowiFly's picture
Update README.md
f6a8a50 verified
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2-VL-7B-Instruct
pipeline_tag: visual-question-answering
datasets:
- DiagramAgent/DiagramGenBenchmark
---
[📑paper link](https://arxiv.org/abs/2411.11916)
## Model Card: DiagramAgent/Diagram_to_Code_Agent
### 1. Model Overview
- **Name**: DiagramAgent/Diagram_to_Code_Agent
- **Description**:
This agent is tasked with converting a given diagram (visual representation) into its corresponding structured code.
### 2. Intended Use
- Primary Tasks:
- Convert existing diagrams into structured code representations.
- Support diagram editing workflows by providing a reliable code basis for modifications.
- Capture and preserve implicit logical structures and visual details of diagrams.
- Application Scenarios:
- Automated diagram editing: Transforming a diagram into code to enable subsequent modifications.
- Reverse engineering of visual diagrams for analysis and reusability.
- Enhancing data visualization tools by integrating code-based diagram representations.
### 3. Architecture and Training Details
- **Base Model**: Utilizes the Qwen2-VL-7B model, which is a vision-language fusion model.
- Training Process:
- Trained on diverse diagram samples from the DiagramGenBenchmark dataset.
- Aims to generate code that is highly consistent with a reference code, ensuring that all diagram elements are accurately captured.
- Uses a specialized loss function to reduce the edit distance between the generated and reference code.
- **Module Interaction**:
Works closely with the Check Agent, which validates the generated code and provides feedback for further refinement.
### 4. Usage Examples
```py
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"DiagramAgent/Diagram_to_Code_Agent", torch_dtype="auto", device_map="auto"
)
# default processer
processor = AutoProcessor.from_pretrained("DiagramAgent/Diagram_to_Code_Agent")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "your input",
},
{"type": "text", "text": "image path"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
### 5. Citation
If you find our work helpful, feel free to give us a cite.
```
@inproceedings{wei2024wordsstructuredvisualsbenchmark,
title={From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing},
author={Jingxuan Wei and Cheng Tan and Qi Chen and Gaowei Wu and Siyuan Li and Zhangyang Gao and Linzhuang Sun and Bihui Yu and Ruifeng Guo},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
```