neural-mesh-v2 / Update /2025-01-25_humaneval_fixes.md
hjkim00's picture
Restore all essential files - code, configs, and MBPP/HumanEval data
24c2665 verified

TestTime RLVR-v2 HumanEval ํ‰๊ฐ€ ์ˆ˜์ • ์‚ฌํ•ญ

๋‚ ์งœ: 2025-01-25

๊ฐœ์š”

HumanEval ๋ฒค์น˜๋งˆํฌ์—์„œ 0% ์ •ํ™•๋„ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ์ „์ฒด์ ์ธ ์ˆ˜์ • ์ž‘์—…์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ฃผ์š” ๋ฌธ์ œ์  ๋ฐ ํ•ด๊ฒฐ ๋ฐฉ์•ˆ

1. Import ๋ฌธ ๋ˆ„๋ฝ ๋ฌธ์ œ

๋ฌธ์ œ: HumanEval ์†”๋ฃจ์…˜์—์„œ from typing import List ๋“ฑ์˜ import ๋ฌธ์ด ๋ˆ„๋ฝ๋˜์–ด ์‹คํ–‰ ์‹คํŒจ ํ•ด๊ฒฐ:

  • EvalPlus ๋ฐฉ์‹๊ณผ ๋™์ผํ•˜๊ฒŒ ํ”„๋กฌํ”„ํŠธ์—์„œ import ๋ฌธ์„ ์ถ”์ถœํ•˜์—ฌ ์ž๋™ ์ถ”๊ฐ€
  • _add_imports_from_prompt() ๋ฉ”์„œ๋“œ ์ถ”๊ฐ€
  • ์ž๋™์œผ๋กœ import๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ์น˜ํŒ… ๋ฐฉ์‹ ์ œ๊ฑฐ

2. IPO Triple ์ถ”์ถœ ๋ฌธ์ œ

๋ฌธ์ œ:

  • base_input์˜ ์ฒซ ๋ฒˆ์งธ ํ•ญ๋ชฉ๋งŒ ์‚ฌ์šฉ
  • HumanEval์—์„œ ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ IPO ์ƒ์„ฑ (์น˜ํŒ…) ํ•ด๊ฒฐ:
  • HumanEval์€ docstring ์˜ˆ์ œ๋งŒ ์‚ฌ์šฉํ•˜๋„๋ก ๋ณ€๊ฒฝ
  • _extract_docstring_examples() ๋ฉ”์„œ๋“œ ์ถ”๊ฐ€
  • ์ž…๋ ฅ ํ˜•์‹ ๋ถ„๋ฆฌ: ํ‰๊ฐ€์šฉ ์ธ์ž์™€ ํ‘œ์‹œ์šฉ ์ „์ฒด ํ•จ์ˆ˜ ํ˜ธ์ถœ

3. ํ”„๋กฌํ”„ํŠธ ์ผ๊ด€์„ฑ ๋ฌธ์ œ

๋ฌธ์ œ:

  • batch_evaluate_testtime.py์˜ ํ•˜๋“œ์ฝ”๋”ฉ๋œ ํ”„๋กฌํ”„ํŠธ๊ฐ€ solution_generator.py์™€ ๋ถˆ์ผ์น˜
  • HumanEval/50๊ณผ ๊ฐ™์€ ๋‹ค์ค‘ ํ•จ์ˆ˜ ๋ฌธ์ œ ์ฒ˜๋ฆฌ ๋ฏธํก ํ•ด๊ฒฐ:
  • ๋ชจ๋“  ํ”„๋กฌํ”„ํŠธ๋ฅผ solution_generator.py์™€ ์ผ์น˜ํ•˜๋„๋ก ์ˆ˜์ •
  • ๋‹ค์ค‘ ํ•จ์ˆ˜ ์ผ€์ด์Šค๋ฅผ ์œ„ํ•œ ํŠน๋ณ„ ์ฒ˜๋ฆฌ ์ถ”๊ฐ€

4. Task ์ƒ์„ฑ ์‹œ ๋ฌธ์ œ

๋ฌธ์ œ:

  • HumanEval์—์„œ doctest ์˜ˆ์‹œ๊ฐ€ ํฌํ•จ๋˜์–ด ์น˜ํŒ… ๋ฐœ์ƒ
  • Induction task์˜ message๊ฐ€ ์ผ๋ฐ˜์ ์ธ ๋ฉ”์‹œ์ง€ ์‚ฌ์šฉ ํ•ด๊ฒฐ:
  • _remove_doctest_examples() ๋ฉ”์„œ๋“œ๋กœ doctest ์ œ๊ฑฐ
  • HumanEval์˜ ๊ฒฝ์šฐ ํ•จ์ˆ˜ ์„ค๋ช…์„ ์ถ”์ถœํ•˜์—ฌ message๋กœ ์‚ฌ์šฉ

5. ํ‰๊ฐ€ ์‹คํŒจ ๋ฌธ์ œ

๋ฌธ์ œ:

  • Induction: ์ „์ฒด ํ•จ์ˆ˜ ํ˜ธ์ถœ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ฐ€ ์‹คํŒจ
  • Abduction: ์ธ์ž๋งŒ ์ €์žฅ๋˜์–ด MBPP์™€ ๋‹ค๋ฅธ ํ˜•์‹์œผ๋กœ ํ‰๊ฐ€ ํ•ด๊ฒฐ:
  • IPO triple์— input(์ธ์ž)์™€ full_input_str(์ „์ฒด ํ˜ธ์ถœ) ๋ถ„๋ฆฌ ์ €์žฅ
  • Abduction expected_solution์„ full_input_str ์‚ฌ์šฉํ•˜๋„๋ก ์ˆ˜์ •

์ˆ˜์ •๋œ ํŒŒ์ผ ๋ชฉ๋ก

1. /home/ubuntu/RLVR/TestTime-RLVR-v2/absolute_zero_reasoner/testtime/solution_generator.py

  • _add_imports_from_prompt() ๋ฉ”์„œ๋“œ ์ถ”๊ฐ€
  • _add_missing_imports() ์ œ๊ฑฐ (์น˜ํŒ… ๋ฐฉ์ง€)
  • HumanEval์šฉ ํ”„๋กฌํ”„ํŠธ ๊ฐœ์„ 
  • ๋‹ค์ค‘ ํ•จ์ˆ˜ ์ฒ˜๋ฆฌ ๋กœ์ง ์ถ”๊ฐ€

2. /home/ubuntu/RLVR/TestTime-RLVR-v2/absolute_zero_reasoner/testtime/ipo_extractor.py

  • _extract_docstring_examples() ๋ฉ”์„œ๋“œ ์ถ”๊ฐ€
  • HumanEval์€ docstring ์˜ˆ์ œ๋งŒ ์‚ฌ์šฉํ•˜๋„๋ก ์ˆ˜์ •
  • ์ž…๋ ฅ ํ˜•์‹ ๋ถ„๋ฆฌ (ํ‰๊ฐ€์šฉ/ํ‘œ์‹œ์šฉ)

3. /home/ubuntu/RLVR/TestTime-RLVR-v2/absolute_zero_reasoner/testtime/task_generator.py

  • _remove_doctest_examples() ๋ฉ”์„œ๋“œ ์ถ”๊ฐ€
  • _extract_function_description() ๋ฉ”์„œ๋“œ ์ถ”๊ฐ€
  • HumanEval induction message ๊ฐœ์„ 
  • Abduction expected_solution์„ ์ „์ฒด ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ์ˆ˜์ •

4. /home/ubuntu/RLVR/TestTime-RLVR-v2/test/batch_evaluate_testtime.py

  • ํ•˜๋“œ์ฝ”๋”ฉ๋œ ํ”„๋กฌํ”„ํŠธ๋ฅผ solution_generator.py์™€ ์ผ์น˜ํ•˜๋„๋ก ์ˆ˜์ •
  • ์ „์ฒด LLM ํ”„๋กฌํ”„ํŠธ ๋กœ๊น… ์ถ”๊ฐ€

๊ธฐ์ˆ ์  ์„ธ๋ถ€์‚ฌํ•ญ

IPO Triple ํ˜•์‹ ์ฐจ์ด

// MBPP (๊ธฐ์กด)
{
  "input": "intersperse([], 4)",
  "full_input_str": "intersperse([], 4)"
}

// HumanEval (์ˆ˜์ •๋จ)
{
  "input": "[], 4",  // ํ‰๊ฐ€์šฉ (์ธ์ž๋งŒ)
  "full_input_str": "intersperse([], 4)"  // ํ‘œ์‹œ์šฉ (์ „์ฒด ํ˜ธ์ถœ)
}

Import ์ถ”์ถœ ๋กœ์ง

def _add_imports_from_prompt(self, prompt: str, solution: str) -> str:
    # ํ”„๋กฌํ”„ํŠธ์—์„œ import ๋ฌธ ์ถ”์ถœ
    # solution ์•ž์— import ๋ฌธ ์ถ”๊ฐ€
    # EvalPlus์™€ ๋™์ผํ•œ ๋ฐฉ์‹

Doctest ์ œ๊ฑฐ

def _remove_doctest_examples(self, code: str) -> str:
    # docstring ๋‚ด์˜ >>> ์˜ˆ์‹œ ์ œ๊ฑฐ
    # ํ•จ์ˆ˜ ์„ค๋ช…์€ ์œ ์ง€

์„ฑ๊ณผ

  • HumanEval ํ‰๊ฐ€๊ฐ€ ์ •์ƒ์ ์œผ๋กœ ์ž‘๋™
  • ์น˜ํŒ… ์—†์ด ๊ณต์ •ํ•œ ํ‰๊ฐ€ ์ˆ˜ํ–‰
  • MBPP์™€ ์ผ๊ด€๋œ ํ‰๊ฐ€ ๋ฐฉ์‹ ์œ ์ง€
  • EvalPlus์™€ ํ˜ธํ™˜๋˜๋Š” import ์ฒ˜๋ฆฌ

ํ–ฅํ›„ ๊ฐœ์„ ์‚ฌํ•ญ

  • ๋” ๋งŽ์€ HumanEval ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ…Œ์ŠคํŠธ ํ•„์š”
  • ๋‹ค์–‘ํ•œ edge case ์ฒ˜๋ฆฌ ๊ฐœ์„ 
  • ์„ฑ๋Šฅ ์ตœ์ ํ™”