GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities

Community Article Published August 14, 2025

A meticulously curated dataset to assess how well AI models generate code compatible with specific Python library versions

Diganta Misra Nizar Islah Victor May Brice Rauby Zihan Wang⁶ Justine Gehring Antonio Orvieto Muawiz Chaudhary Eilif B. Muller Irina Rish Samira Ebrahimi Kahou Massimo Caccia

Correspondence: [email protected], [email protected]

📄 Preprint: arXiv:2507.12367

Abstract: The rapid evolution of software libraries presents a considerable challenge for AI-assisted code generation. Existing benchmarks either overlook execution-based validation or focus on migration rather than targeted generation for specific versions. To address this, we introduce GitChameleon 2.0, a novel dataset of 328 Python coding problems each associated with explicit library versions. Problems are validated with executable unit tests to ensure correctness under version constraints. Our evaluation shows that state-of-the-art systems achieve only 48–51% success rates, highlighting the difficulty of generating code compatible with specific versions. GitChameleon 2.0 enables more robust analysis of AI coding tools in modern and legacy stack environments.

Introduction

Large language models (LLMs) are becoming increasingly integral to software development workflows. Their capabilities in code completion, explanation, and debugging are well-documented and growing rapidly. Despite continual improvements in inference efficiency and reasoning, many advanced LLMs struggle in a crucial real-world scenario: generating code that works with specific library versions.

Library versioning is an ever-present constraint in real software projects, especially in production environments where specific versions can’t be easily upgraded. Without version-aware generation, AI tools risk introducing bugs or producing deprecated/invalid syntax, such as:

Version-specific code error example — Figure 1: Example of a version-specific error with seaborn 0.13.0 compatibility

Benchmark

GitChameleon 2.0 consists of 328 Python-based version-conditioned problems based on 26 libraries spanning scientific computing, data science and web development. The samples were collected from version releases over a period from the year 2014 to 2023.

By design, the samples were collected from a date range that should be prior to the knowledge cutoff date of the evaluated models - we show that even version-controlled generation is difficult even for version that were present in the model training data.

To evaluate performance on GitChameleon 2.0, each problem is accompanied by a suite of assertion-based unit tests, enabling a thorough execution-based assessment of potential solutions.

Evaluation Paradigms in Code Generation

To highlight the difference between GitChameleon 2.0 and existing benchmarks, consider the two evaluation paradigms shown in Figure 2.

VCG vs OOD — Figure 2: An illustration of two evaluation paradigms for code generation models. VCG (Version-Conditioned Generation) focuses on the practical ability to generate code for specific, in-distribution library versions that the model has seen before. Code Evolution evaluates models against out-of-distribution data, using library versions or new libraries not encountered during training

Evaluation Results

Our experiments reveal that even top enterprise-grade LLMs achieve only 48–51% success rate on GitChameleon 2.0, underscoring the difficulty of version-aware code generation.

Providing error feedback (execution traces) can improve success rates, but most models still fail on subtle version-specific constraints. This indicates that current code generation systems lack robust mechanisms for reasoning about historical API changes, and that version awareness remains an open research challenge.

Conclusion & Availability

GitChameleon 2.0 introduces a new benchmarking approach to version-specific code generation that reflects the dynamic nature of real software stacks. The dataset, along with evaluation scripts, is publicly available on GitHub:

https://github.com/mrcabbage972/GitChameleonBenchmark

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote