arxiv:2508.18106

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Published on Aug 25

· Submitted by

wanng on Sep 1

#3 Paper of the day

Upvote

Authors:

Keke Lian ,

Junjie Wang ,

Jiazheng Quan ,

Abstract

A.S.E is a benchmark for evaluating the security of code generated by large language models using real-world repositories and expert-defined rules, revealing insights into model performance and decoding strategies.

AI-generated summary

The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks are inadequate, as they focus on isolated code snippets, employ unstable evaluation methods that lack reproducibility, and fail to connect the quality of input context with the security of the output. To address these gaps, we introduce A.S.E (AI Code Generation Security Evaluation), a benchmark for repository-level secure code generation. A.S.E constructs tasks from real-world repositories with documented CVEs, preserving full repository context like build systems and cross-file dependencies. Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability. Our evaluation of leading LLMs on A.S.E reveals three key findings: (1) Claude-3.7-Sonnet achieves the best overall performance. (2) The security gap between proprietary and open-source models is narrow; Qwen3-235B-A22B-Instruct attains the top security score. (3) Concise, ``fast-thinking'' decoding strategies consistently outperform complex, ``slow-thinking'' reasoning for security patching.

View arXiv page View PDF GitHub 134 Add to collection

Community

wanng

Paper author Paper submitter 1 day ago

•

edited 1 day ago

🤖 AI is revolutionizing how we write code, with LLMs acting as tireless coding partners! But with this incredible speed comes a critical question: is the code they generate truly secure? 🛡️

Many security benchmarks only scratch the surface 🧐, testing code in isolated snippets. This approach misses the real-world vulnerabilities that can lurk in the complex interactions across an entire project.

🚀 Enter A.S.E, a pioneering repository-level benchmark that's changing the game! Instead of just looking at a single file, A.S.E evaluates security across a whole codebase, providing a much more realistic and challenging test for our AI models.

This is a huge step forward ⏩ in building a future where AI assistants are not just powerful coders, but also vigilant security partners. It's time to push for safer, more reliable AI-generated code!

#AISecurity 🔐 #SecureCoding 💻 #LLMs 🧠 #CodeGeneration ⌨️