Visual Question Answering
Safetensors
English
GowiFly's picture
Update README.md
f6a8a50 verified
metadata
license: apache-2.0
language:
  - en
base_model:
  - Qwen/Qwen2-VL-7B-Instruct
pipeline_tag: visual-question-answering
datasets:
  - DiagramAgent/DiagramGenBenchmark

📑paper link

Model Card: DiagramAgent/Diagram_to_Code_Agent

1. Model Overview

  • Name: DiagramAgent/Diagram_to_Code_Agent
  • Description: This agent is tasked with converting a given diagram (visual representation) into its corresponding structured code.

2. Intended Use

  • Primary Tasks:
    • Convert existing diagrams into structured code representations.
    • Support diagram editing workflows by providing a reliable code basis for modifications.
    • Capture and preserve implicit logical structures and visual details of diagrams.
  • Application Scenarios:
    • Automated diagram editing: Transforming a diagram into code to enable subsequent modifications.
    • Reverse engineering of visual diagrams for analysis and reusability.
    • Enhancing data visualization tools by integrating code-based diagram representations.

3. Architecture and Training Details

  • Base Model: Utilizes the Qwen2-VL-7B model, which is a vision-language fusion model.
  • Training Process:
    • Trained on diverse diagram samples from the DiagramGenBenchmark dataset.
    • Aims to generate code that is highly consistent with a reference code, ensuring that all diagram elements are accurately captured.
    • Uses a specialized loss function to reduce the edit distance between the generated and reference code.
  • Module Interaction: Works closely with the Check Agent, which validates the generated code and provides feedback for further refinement.

4. Usage Examples

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "DiagramAgent/Diagram_to_Code_Agent", torch_dtype="auto", device_map="auto"
)

# default processer
processor = AutoProcessor.from_pretrained("DiagramAgent/Diagram_to_Code_Agent")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "your input",
            },
            {"type": "text", "text": "image path"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

5. Citation

If you find our work helpful, feel free to give us a cite.

@inproceedings{wei2024wordsstructuredvisualsbenchmark,
  title={From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing},
  author={Jingxuan Wei and Cheng Tan and Qi Chen and Gaowei Wu and Siyuan Li and Zhangyang Gao and Linzhuang Sun and Bihui Yu and Ruifeng Guo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}