-
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 22 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 4 -
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
Collections
Discover the best community collections!
Collections including paper arxiv:2311.09204
-
Fusion-Eval: Integrating Evaluators with LLMs
Paper • 2311.09204 • Published • 6 -
Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer
Paper • 2311.06720 • Published • 8 -
Safurai 001: New Qualitative Approach for Code LLM Evaluation
Paper • 2309.11385 • Published • 2 -
Assessment of Pre-Trained Models Across Languages and Grammars
Paper • 2309.11165 • Published • 1
-
Contrastive Chain-of-Thought Prompting
Paper • 2311.09277 • Published • 35 -
Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying
Paper • 2311.09578 • Published • 15 -
Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation
Paper • 2311.08877 • Published • 7 -
Fusion-Eval: Integrating Evaluators with LLMs
Paper • 2311.09204 • Published • 6
-
Language Models can be Logical Solvers
Paper • 2311.06158 • Published • 19 -
Fusion-Eval: Integrating Evaluators with LLMs
Paper • 2311.09204 • Published • 6 -
Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation
Paper • 2311.08877 • Published • 7 -
Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?
Paper • 2311.07587 • Published • 4
-
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
Paper • 2310.08491 • Published • 54 -
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Paper • 2310.11511 • Published • 76 -
Calibrating LLM-Based Evaluator
Paper • 2309.13308 • Published • 12 -
Fusion-Eval: Integrating Evaluators with LLMs
Paper • 2311.09204 • Published • 6