Papers
arxiv:2508.18106

A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

Published on Aug 25
ยท Submitted by wanng on Sep 1
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A.S.E is a benchmark for evaluating the security of code generated by large language models using real-world repositories and expert-defined rules, revealing insights into model performance and decoding strategies.

AI-generated summary

The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks are inadequate, as they focus on isolated code snippets, employ unstable evaluation methods that lack reproducibility, and fail to connect the quality of input context with the security of the output. To address these gaps, we introduce A.S.E (AI Code Generation Security Evaluation), a benchmark for repository-level secure code generation. A.S.E constructs tasks from real-world repositories with documented CVEs, preserving full repository context like build systems and cross-file dependencies. Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability. Our evaluation of leading LLMs on A.S.E reveals three key findings: (1) Claude-3.7-Sonnet achieves the best overall performance. (2) The security gap between proprietary and open-source models is narrow; Qwen3-235B-A22B-Instruct attains the top security score. (3) Concise, ``fast-thinking'' decoding strategies consistently outperform complex, ``slow-thinking'' reasoning for security patching.

Community

Paper author Paper submitter
โ€ข
edited 1 day ago

๐Ÿค– AI is revolutionizing how we write code, with LLMs acting as tireless coding partners! But with this incredible speed comes a critical question: is the code they generate truly secure? ๐Ÿ›ก๏ธ

Many security benchmarks only scratch the surface ๐Ÿง, testing code in isolated snippets. This approach misses the real-world vulnerabilities that can lurk in the complex interactions across an entire project.

๐Ÿš€ Enter A.S.E, a pioneering repository-level benchmark that's changing the game! Instead of just looking at a single file, A.S.E evaluates security across a whole codebase, providing a much more realistic and challenging test for our AI models.

This is a huge step forward โฉ in building a future where AI assistants are not just powerful coders, but also vigilant security partners. It's time to push for safer, more reliable AI-generated code!

#AISecurity ๐Ÿ” #SecureCoding ๐Ÿ’ป #LLMs ๐Ÿง  #CodeGeneration โŒจ๏ธ

๐Ÿ‘

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.18106 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.18106 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.18106 in a Space README.md to link it from this page.

Collections including this paper 4