-
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 22 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 4 -
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
Collections
Discover the best community collections!
Collections including paper arxiv:2310.11248
-
Creative Robot Tool Use with Large Language Models
Paper • 2310.13065 • Published • 9 -
CodeCoT and Beyond: Learning to Program and Test like a Developer
Paper • 2308.08784 • Published • 5 -
Lemur: Harmonizing Natural Language and Code for Language Agents
Paper • 2310.06830 • Published • 32 -
CodePlan: Repository-level Coding using LLMs and Planning
Paper • 2309.12499 • Published • 75
-
KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval
Paper • 2310.15511 • Published • 5 -
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Paper • 2310.14566 • Published • 26 -
SmartPlay : A Benchmark for LLMs as Intelligent Agents
Paper • 2310.01557 • Published • 12 -
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Paper • 2310.03214 • Published • 18
-
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
Paper • 2310.11248 • Published • 3 -
Textbooks Are All You Need II: phi-1.5 technical report
Paper • 2309.05463 • Published • 87 -
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale
Paper • 2309.04564 • Published • 15 -
What's In My Big Data?
Paper • 2310.20707 • Published • 11
-
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 22 -
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 8 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 4 -
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Paper • 2102.04664 • Published • 2
-
Chain-of-Verification Reduces Hallucination in Large Language Models
Paper • 2309.11495 • Published • 37 -
CodePlan: Repository-level Coding using LLMs and Planning
Paper • 2309.12499 • Published • 75 -
SCREWS: A Modular Framework for Reasoning with Revisions
Paper • 2309.13075 • Published • 15 -
Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
Paper • 2309.08532 • Published • 53