neural-mesh-v2 / Update /unified_ttrlvr_architecture.md
hjkim00's picture
Restore all essential files - code, configs, and MBPP/HumanEval data
24c2665 verified

TTRLVR Unified Architecture - 상세 μž‘λ™ 방식

λͺ©μ°¨

  1. κ°œμš”
  2. 전체 μ•„ν‚€ν…μ²˜
  3. μ‹€ν–‰ 흐름
  4. 핡심 μ»΄ν¬λ„ŒνŠΈ
  5. Phase별 상세 λ™μž‘
  6. 동기화 λ©”μ»€λ‹ˆμ¦˜
  7. 데이터 흐름
  8. κ΅¬ν˜„ 세뢀사항

1. κ°œμš”

1.1 λͺ©μ 

TTRLVR UnifiedλŠ” κΈ°μ‘΄ TTRLVR의 λΆ„λ¦¬λœ ꡬ쑰λ₯Ό ν•˜λ‚˜μ˜ ν†΅ν•©λœ VeRL μ„Έμ…˜μœΌλ‘œ μž¬κ΅¬μ„±ν•˜μ—¬ 동기화 문제λ₯Ό ν•΄κ²°ν•˜κ³  μ„±λŠ₯을 ν–₯μƒμ‹œν‚¨ λ²„μ „μž…λ‹ˆλ‹€.

1.2 핡심 κ°œμ„ μ‚¬ν•­

  • 단일 vLLM μΈμŠ€ν„΄μŠ€: 전체 ν•™μŠ΅ κ³Όμ •μ—μ„œ ν•˜λ‚˜μ˜ vLLM만 μ‚¬μš©
  • 동기화 문제 ν•΄κ²°: dummy_dtensor μ‚¬μš© κ°€λŠ₯
  • μ„±λŠ₯ ν–₯상: vLLM μž¬μƒμ„± μ˜€λ²„ν—€λ“œ 제거둜 30-40% 속도 ν–₯상
  • λ©”λͺ¨λ¦¬ 효율: 반볡적인 ν• λ‹Ή/ν•΄μ œ μ—†μŒ

1.3 μ£Όμš” 파일

  • train_ttrlvr_azr_unified.py: 메인 μ‹€ν–‰ 슀크립트
  • test/trainer/unified_ttrlvr_trainer.py: 톡합 Trainer 클래슀
  • test/configs/ttrlvr_azr_unified_4gpu.yaml: VeRL μ„€μ • 파일

2. 전체 μ•„ν‚€ν…μ²˜

2.1 κΈ°μ‘΄ vs 톡합 ꡬ쑰

κΈ°μ‘΄ TTRLVR (λΆ„λ¦¬ν˜•)

Round 1:
β”œβ”€β”€ Phase 1-4: RemoteTestTimePipeline (독립 vLLM #1)
β”‚   └── ray.kill(pipeline)  # vLLM μ‚­μ œ
└── Phase 5: VeRL Training (μƒˆ vLLM #2)
    └── trainer.init_workers()  # λ§€ λΌμš΄λ“œλ§ˆλ‹€

Round 2: (μƒˆλ‘œμš΄ vLLM μΈμŠ€ν„΄μŠ€λ“€...)

Unified TTRLVR (ν†΅ν•©ν˜•)

μ΄ˆκΈ°ν™”:
└── trainer.init_workers()  # 1번만!

Round 1-N:
β”œβ”€β”€ Phase 1-4: 데이터 생성 (같은 vLLM)
└── Phase 5: PPO ν•™μŠ΅ (같은 vLLM)

2.2 μ»΄ν¬λ„ŒνŠΈ 관계도

train_ttrlvr_azr_unified.py
    β”‚
    β”œβ”€β”€ ν™˜κ²½ μ„€μ • & 인자 νŒŒμ‹±
    β”‚
    β”œβ”€β”€ VeRL generate_main() 호좜
    β”‚   β”‚
    β”‚   └── UnifiedTTRLVRTrainer 생성
    β”‚       β”‚
    β”‚       β”œβ”€β”€ CompleteTestTimePipeline (Phase 1-4)
    β”‚       β”‚   β”œβ”€β”€ 벀치마크 문제 λ‘œλ”©
    β”‚       β”‚   β”œβ”€β”€ ν”„λ‘œκ·Έλž¨ 생성 (diverse_programs)
    β”‚       β”‚   β”œβ”€β”€ IPO μΆ”μΆœ (IPOTripleExtractor)
    β”‚       β”‚   β”œβ”€β”€ Task 생성 (TestTimeTaskGenerator)
    β”‚       β”‚   └── 검증 및 필터링
    β”‚       β”‚
    β”‚       └── VeRL PPO Training (Phase 5)
    β”‚           β”œβ”€β”€ 데이터 ν˜•μ‹ λ³€ν™˜
    β”‚           β”œβ”€β”€ Response 생성
    β”‚           β”œβ”€β”€ Reward 계산
    β”‚           └── Policy μ—…λ°μ΄νŠΈ

3. μ‹€ν–‰ 흐름

3.1 슀크립트 μ‹€ν–‰

python train_ttrlvr_azr_unified.py --benchmark mbpp --problems 10 --rounds 30 --gpu 0,1,2,3

3.2 μ΄ˆκΈ°ν™” 단계

Step 1: 인자 νŒŒμ‹±

def main():
    # λͺ…λ Ήν–‰ 인자 νŒŒμ‹±
    args = parse_arguments()
    
    # ν™˜κ²½ μ„€μ • (GPU, 경둜 λ“±)
    setup_environment(args.gpu)

Step 2: 문제 리슀트 생성

# λ²€μΉ˜λ§ˆν¬μ—μ„œ 문제 ID μΆ”μΆœ
problem_ids = create_problem_list(args.benchmark, args.problems, args.problem_id)
# 예: ['Mbpp/1', 'Mbpp/2', 'Mbpp/3', ...]

Step 3: ν™˜κ²½ λ³€μˆ˜ μ„€μ •

# VeRL이 UnifiedTTRLVRTrainer에 전달할 μ„€μ •
os.environ['TTRLVR_PROBLEM_IDS'] = json.dumps(problem_ids)
os.environ['TTRLVR_TOTAL_ROUNDS'] = str(args.rounds)
os.environ['TTRLVR_OUTPUT_DIR'] = output_dir
os.environ['TTRLVR_CONFIG'] = json.dumps(ttrlvr_config)

Step 4: VeRL μ‹€ν–‰

# VeRL의 main_generation 호좜
verl_args = [
    'train_ttrlvr_azr_unified.py',
    f'--config-path={config_path}',
    '--config-name=ttrlvr_azr_unified_4gpu',
    f'trainer.project_name=ttrlvr_unified_{args.benchmark}',
    f'trainer.total_epochs={args.rounds}',  # 각 λΌμš΄λ“œλ₯Ό epoch둜 λ§€ν•‘
]

sys.argv = verl_args
generate_main()  # VeRL 메인 ν•¨μˆ˜ μ‹€ν–‰

3.3 VeRL μ΄ˆκΈ°ν™”

VeRL의 generate_main()이 μ‹€ν–‰λ˜λ©΄:

  1. Config λ‘œλ”©: ttrlvr_azr_unified_4gpu.yaml νŒŒμ‹±
  2. Ray ν΄λŸ¬μŠ€ν„° μ΄ˆκΈ°ν™”: λΆ„μ‚° 처리 ν™˜κ²½ μ„€μ •
  3. UnifiedTTRLVRTrainer 생성: 섀정에 λͺ…μ‹œλœ 클래슀 λ‘œλ“œ
  4. Worker μ΄ˆκΈ°ν™”: trainer.init_workers() 호좜 (1번만!)

4. 핡심 μ»΄ν¬λ„ŒνŠΈ

4.1 UnifiedTTRLVRTrainer

class UnifiedTTRLVRTrainer(ReasonRLRayPPOTrainer):
    """
    TTRLVR의 λͺ¨λ“  Phaseλ₯Ό ν•˜λ‚˜μ˜ VeRL μ„Έμ…˜μ—μ„œ μ²˜λ¦¬ν•˜λŠ” 톡합 Trainer
    """
    
    def __init__(self, ttrlvr_config, problem_ids, total_rounds, ...):
        super().__init__(...)
        
        # TTRLVR νŠΉν™” μ„€μ •
        self.ttrlvr_config = ttrlvr_config
        self.problem_ids = problem_ids
        self.total_rounds = total_rounds
        self.current_round = 0
        
        # CompleteTestTimePipeline μ΄ˆκΈ°ν™” (λ‚˜μ€‘μ—)
        self.ttrlvr_pipeline = None

4.2 CompleteTestTimePipeline 톡합

def _init_ttrlvr_pipeline(self):
    """CompleteTestTimePipeline을 VeRL의 vLLM으둜 μ΄ˆκΈ°ν™”"""
    
    # VeRL의 λͺ¨λΈ μ‚¬μš©
    self.ttrlvr_pipeline = CompleteTestTimePipeline(
        model=None,  # VeRL wrapper 톡해 μ ‘κ·Ό
        tokenizer=self.tokenizer,
        config=self.testtime_config,
        logger=self.ttrlvr_logger
    )
    
    # VeRL의 vLLM을 μ‚¬μš©ν•˜λ„λ‘ μ„€μ •
    self.ttrlvr_pipeline.generate_with_verl = self._generate_with_vllm

5. Phase별 상세 λ™μž‘

5.1 fit() λ©”μ„œλ“œ - 메인 ν•™μŠ΅ 루프

def fit(self):
    """전체 ν•™μŠ΅ 루프 관리"""
    
    # 둜거 μ΄ˆκΈ°ν™”
    logger = ReasonRLTracking(...)
    
    # 체크포인트 λ‘œλ“œ (있으면)
    self._load_checkpoint()
    
    # λΌμš΄λ“œλ³„ 반볡
    for round_num in range(1, self.total_rounds + 1):
        self.current_round = round_num
        
        # ====== Phase 1-4: 데이터 생성 ======
        round_data = self._generate_round_data()
        
        # ====== Phase 5: PPO ν•™μŠ΅ ======
        metrics = self._train_one_round(round_data, logger)
        
        # 체크포인트 μ €μž₯ (5λΌμš΄λ“œλ§ˆλ‹€)
        if round_num % 5 == 0:
            self._save_checkpoint()

5.2 Phase 1-4: 데이터 생성

5.2.1 _generate_round_data() ꡬ쑰

def _generate_round_data(self) -> List[Dict[str, Any]]:
    """Phase 1-4 μ‹€ν–‰"""
    
    # Pipeline μ΄ˆκΈ°ν™” (처음만)
    if self.ttrlvr_pipeline is None:
        self._init_ttrlvr_pipeline()
    
    all_tasks = []
    
    for problem_id in self.problem_ids:
        # CompleteTestTimePipeline μ‹€ν–‰
        result = self.ttrlvr_pipeline.run_complete_pipeline(
            benchmark_config=benchmark_config,
            problem_id=problem_id,
            round_num=self.current_round,
            session_timestamp=session_timestamp
        )
        
        if result['success']:
            tasks = result['final_tasks']
            all_tasks.extend(tasks)
    
    return all_tasks

5.2.2 CompleteTestTimePipeline λ‚΄λΆ€ λ™μž‘

Phase 1: λ‹€μ–‘ν•œ ν”„λ‘œκ·Έλž¨ 생성

# 1. 벀치마크 문제 λ‘œλ“œ
problem = benchmark_loader.load_problem(benchmark_config, problem_id)

# 2. Baseline 평가
baseline_results = self._evaluate_baseline_performance(problem)

# 3. λ‹€μ–‘ν•œ ν”„λ‘œκ·Έλž¨ 생성
diverse_programs = self._generate_diverse_programs_and_ipo(problem)
# λ‚΄λΆ€μ μœΌλ‘œ:
# - μ •κ΅ν•œ ν”„λ‘¬ν”„νŠΈ ν…œν”Œλ¦Ώ μ‚¬μš©
# - Temperature 쑰절둜 λ‹€μ–‘μ„± 확보
# - 문법 검증

Phase 2: I/O 쌍 μΆ”μΆœ

# IPOTripleExtractor μ‚¬μš©
ipo_extractor = IPOTripleExtractor(config, logger, model, tokenizer)

for program in diverse_programs:
    # μž…λ ₯ 생성
    inputs = ipo_extractor.generate_inputs(program)
    
    # 좜λ ₯ 계산
    for input in inputs:
        output = executor.execute(program, input)
        ipo_buffer.add_triple(input, program, output)

Phase 3: Task 생성

# TestTimeTaskGenerator μ‚¬μš©
task_generator = TestTimeTaskGenerator(config, logger)

# Induction: I/O β†’ Program
induction_tasks = task_generator.create_induction_tasks(ipo_triples)

# Deduction: Program + Input β†’ Output
deduction_tasks = task_generator.create_deduction_tasks(ipo_triples)

# Abduction: Program + Output β†’ Input
abduction_tasks = task_generator.create_abduction_tasks(ipo_triples)

Phase 4: 검증 및 필터링

# 각 task 검증
valid_tasks = []
for task in all_tasks:
    if validator.is_valid(task):
        valid_tasks.append(task)

5.3 Phase 5: PPO ν•™μŠ΅

5.3.1 _train_one_round() ꡬ쑰

def _train_one_round(self, round_data: List[Dict], logger) -> Dict[str, float]:
    """Phase 5: PPO ν•™μŠ΅"""
    
    # 1. 데이터 λ³€ν™˜
    train_dataset = self._convert_to_verl_dataset(round_data)
    
    # 2. DataLoader 생성
    self.train_dataloader = self._create_dataloader(
        train_dataset,
        batch_size=self.config.data.train_batch_size
    )
    
    # 3. 1 epoch ν•™μŠ΅
    epoch_metrics = {}
    for step, batch in enumerate(self.train_dataloader):
        # PPO Step 1: Response 생성
        gen_batch_output = self.actor_rollout_wg.generate_sequences(batch)
        
        # PPO Step 2: Reward 계산
        reward_tensor = self.reward_fn(batch.union(gen_batch_output))
        
        # PPO Step 3: Policy μ—…λ°μ΄νŠΈ
        update_metrics = self._ppo_update(batch, reward_tensor)
        
        # λ©”νŠΈλ¦­ μˆ˜μ§‘
        for k, v in update_metrics.items():
            epoch_metrics[k].append(v)
    
    return {k: np.mean(v) for k, v in epoch_metrics.items()}

5.3.2 데이터 λ³€ν™˜ κ³Όμ •

def _convert_to_verl_dataset(self, round_data: List[Dict]) -> Any:
    """TTRLVR ν˜•μ‹ β†’ VeRL ν˜•μ‹"""
    
    converted_data = []
    for task in round_data:
        # 토큰화
        prompt_ids = self.tokenizer(
            task['prompt'],
            max_length=self.config.data.max_prompt_length
        ).input_ids
        
        # VeRL DataProto ν˜•μ‹
        verl_item = {
            'input_ids': prompt_ids,
            'prompt': task['prompt'],
            'target': task['target'],
            'task_type': task['task_type'],
            'problem_id': task['problem_id']
        }
        converted_data.append(verl_item)
    
    return converted_data

6. 동기화 λ©”μ»€λ‹ˆμ¦˜

6.1 문제의 핡심

κΈ°μ‘΄ TTRLVR은 λ§€ λΌμš΄λ“œλ§ˆλ‹€ μƒˆ vLLM을 μƒμ„±ν–ˆκΈ° λ•Œλ¬Έμ— dummy_dtensor μ‚¬μš© μ‹œ 동기화가 λ˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€.

6.2 ν•΄κ²° 방법

6.2.1 단일 vLLM μΈμŠ€ν„΄μŠ€

# μ΄ˆκΈ°ν™” (1번만)
trainer.init_workers()
β”œβ”€β”€ FSDP workers 생성
β”œβ”€β”€ vLLM workers 생성
└── 초기 동기화 (sync_model_weights)

# 이후 λͺ¨λ“  λΌμš΄λ“œμ—μ„œ 같은 μΈμŠ€ν„΄μŠ€ μ‚¬μš©
Round 1: Phase 1-4 β†’ Phase 5 (같은 vLLM)
Round 2: Phase 1-4 β†’ Phase 5 (같은 vLLM)
...

6.2.2 동기화 κ³Όμ •

# FSDPVLLMShardingManager의 λ™μž‘
class FSDPVLLMShardingManager:
    def __enter__(self):
        if not self.base_sync_done:
            # 첫 번째 호좜: FSDP β†’ vLLM 동기화
            sync_model_weights(actor_weights, load_format='dummy_dtensor')
            self.base_sync_done = True
        # 이후: λ©”λͺ¨λ¦¬ 참쑰둜 μžλ™ 동기화

6.3 λ©”λͺ¨λ¦¬ μ°Έμ‘° λ©”μ»€λ‹ˆμ¦˜

FSDP λͺ¨λΈ (GPU 0-3)          vLLM λͺ¨λΈ (GPU 0-1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Parameter A β”‚ ─────────→   β”‚ Parameter A β”‚ (같은 λ©”λͺ¨λ¦¬ μ°Έμ‘°)
β”‚ Parameter B β”‚ ─────────→   β”‚ Parameter B β”‚
β”‚ Parameter C β”‚ ─────────→   β”‚ Parameter C β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

PPO μ—…λ°μ΄νŠΈ β†’ FSDP νŒŒλΌλ―Έν„° λ³€κ²½ β†’ vLLM도 μžλ™μœΌλ‘œ μƒˆ κ°’ μ‚¬μš©

7. 데이터 흐름

7.1 Round 1 상세 흐름

1. Problem: Mbpp/2 (예: "두 수의 합을 κ΅¬ν•˜λŠ” ν•¨μˆ˜ μž‘μ„±")
   β”‚
   β”œβ”€β”€ Phase 1: ν”„λ‘œκ·Έλž¨ 생성
   β”‚   β”œβ”€β”€ Prompt: "Generate 4 different solutions..."
   β”‚   β”œβ”€β”€ vLLM 생성 (동기화 λ°œμƒ)
   β”‚   └── Output: [prog1, prog2, prog3, prog4]
   β”‚
   β”œβ”€β”€ Phase 2: I/O μΆ”μΆœ
   β”‚   β”œβ”€β”€ 각 ν”„λ‘œκ·Έλž¨μ— λŒ€ν•΄ μž…λ ₯ 생성
   β”‚   β”œβ”€β”€ vLLM μ‚¬μš© (동기화 κ±΄λ„ˆλœ€)
   β”‚   └── Output: [(input1, output1), (input2, output2), ...]
   β”‚
   β”œβ”€β”€ Phase 3: Task 생성
   β”‚   β”œβ”€β”€ Induction: (1, 3) β†’ "def add(a,b): return a+b"
   β”‚   β”œβ”€β”€ Deduction: (prog, 5) β†’ 8
   β”‚   └── Abduction: (prog, 10) β†’ (4, 6)
   β”‚
   β”œβ”€β”€ Phase 4: 검증
   β”‚   └── μœ νš¨ν•œ task만 필터링
   β”‚
   └── Phase 5: PPO ν•™μŠ΅
       β”œβ”€β”€ 배치 생성
       β”œβ”€β”€ Response 생성 (같은 vLLM)
       β”œβ”€β”€ Reward 계산
       └── FSDP λͺ¨λΈ μ—…λ°μ΄νŠΈ

7.2 데이터 ν˜•μ‹ λ³€ν™˜

# TTRLVR Task ν˜•μ‹
{
    'problem_id': 'Mbpp/2',
    'task_type': 'induction',
    'input': 5,
    'output': 10,
    'target': 'def multiply_by_two(x): return x * 2',
    'prompt': 'Given input 5 produces output 10, write the function:'
}

# ↓ λ³€ν™˜

# VeRL DataProto ν˜•μ‹
{
    'input_ids': tensor([1, 234, 567, ...]),  # ν† ν°ν™”λœ prompt
    'attention_mask': tensor([1, 1, 1, ...]),
    'prompt': 'Given input 5 produces output 10...',
    'target': 'def multiply_by_two(x): return x * 2',
    'meta_info': {
        'task_type': 'induction',
        'problem_id': 'Mbpp/2'
    }
}

8. κ΅¬ν˜„ 세뢀사항

8.1 VeRL과의 톡합

8.1.1 _generate_with_vllm λ©”μ„œλ“œ

def _generate_with_vllm(self, prompt: str, temperature: float = 0.7):
    """VeRL의 vLLM을 μ‚¬μš©ν•œ ν…μŠ€νŠΈ 생성"""
    
    # 1. 토큰화
    input_ids = self.tokenizer(prompt, ...).input_ids
    
    # 2. DataProto 생성
    prompts_proto = DataProto.from_dict({
        "input_ids": input_ids.cuda(),
        "attention_mask": torch.ones_like(input_ids).cuda(),
    })
    
    # 3. 메타 정보 μ„€μ •
    prompts_proto.meta_info = {
        "eos_token_id": self.tokenizer.eos_token_id,
        "temperature": temperature,
        "do_sample": True,
        "response_length": 256
    }
    
    # 4. VeRL의 vLLM으둜 생성
    outputs = self.actor_rollout_wg.generate_sequences(prompts_proto)
    
    # 5. λ””μ½”λ”© 및 λ°˜ν™˜
    return self.tokenizer.decode(outputs.batch["input_ids"][0])

8.1.2 CompleteTestTimePipeline μˆ˜μ •

# CompleteTestTimePipeline이 VeRL의 vLLM을 μ‚¬μš©ν•˜λ„λ‘
self.ttrlvr_pipeline.generate_with_verl = self._generate_with_vllm

# 이제 Pipeline λ‚΄λΆ€μ—μ„œ:
# response = self.generate_with_verl(prompt)  # VeRL의 vLLM μ‚¬μš©

8.2 λ©”λͺ¨λ¦¬ 관리

8.2.1 λΌμš΄λ“œ κ°„ λ©”λͺ¨λ¦¬ 정리

def _manage_memory_between_rounds(self):
    """λΌμš΄λ“œ κ°„ λ©”λͺ¨λ¦¬ 정리 (μΈμŠ€ν„΄μŠ€λŠ” μœ μ§€)"""
    
    # GPU μΊμ‹œλ§Œ 정리
    torch.cuda.empty_cache()
    
    # vLLM KV μΊμ‹œ 정리 (선택적)
    if hasattr(self.actor_rollout_wg, 'clear_kv_cache'):
        self.actor_rollout_wg.clear_kv_cache()
    
    # Garbage collection
    import gc
    gc.collect()

8.2.2 λ©”λͺ¨λ¦¬ λͺ¨λ‹ˆν„°λ§

def _monitor_memory(self):
    """λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰ λͺ¨λ‹ˆν„°λ§"""
    for i in range(torch.cuda.device_count()):
        allocated = torch.cuda.memory_allocated(i) / 1024**3
        reserved = torch.cuda.memory_reserved(i) / 1024**3
        print(f"GPU {i}: Allocated={allocated:.2f}GB, Reserved={reserved:.2f}GB")

8.3 μ—λŸ¬ 처리 및 볡ꡬ

def _safe_generate(self, prompt: str, max_retries: int = 3):
    """μ•ˆμ „ν•œ 생성 with μž¬μ‹œλ„"""
    for attempt in range(max_retries):
        try:
            return self._generate_with_vllm(prompt)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            torch.cuda.empty_cache()
            time.sleep(1)

8.4 체크포인트 관리

def _save_checkpoint(self):
    """체크포인트 μ €μž₯"""
    checkpoint = {
        'round': self.current_round,
        'model_state_dict': self.actor_rollout_wg.state_dict(),
        'optimizer_state_dict': self.optimizer.state_dict(),
        'metrics': self.accumulated_metrics,
        'timestamp': datetime.now().isoformat()
    }
    
    path = f"{self.checkpoint_dir}/round_{self.current_round}.pt"
    torch.save(checkpoint, path)

9. μ„±λŠ₯ μ΅œμ ν™”

9.1 배치 처리

  • Phase 1-4μ—μ„œ κ°€λŠ₯ν•œ ν•œ 배치둜 처리
  • vLLM의 continuous batching ν™œμš©

9.2 GPU ν™œμš©

  • vLLM: GPU 0-1 (tensor parallel)
  • FSDP: GPU 0-3 (data parallel)
  • 효율적인 GPU λ©”λͺ¨λ¦¬ ν™œμš©

9.3 I/O μ΅œμ ν™”

  • Parquet ν˜•μ‹μœΌλ‘œ 쀑간 데이터 μ €μž₯
  • 비동기 I/O 처리

10. 디버깅 및 λͺ¨λ‹ˆν„°λ§

10.1 λ‘œκΉ… ꡬ쑰

/home/ubuntu/RLVR/TestTime-RLVR-v2/logs/
β”œβ”€β”€ ttrlvr_unified_20241107_120000.log  # 메인 둜그
β”œβ”€β”€ round_1/
β”‚   β”œβ”€β”€ phase_1_4.log  # 데이터 생성 둜그
β”‚   └── phase_5.log    # ν•™μŠ΅ 둜그
└── metrics/
    └── tensorboard/   # ν•™μŠ΅ λ©”νŠΈλ¦­

10.2 μ£Όμš” λͺ¨λ‹ˆν„°λ§ μ§€ν‘œ

  • λΌμš΄λ“œλ³„ μ†Œμš” μ‹œκ°„
  • μƒμ„±λœ task 수
  • 평균 reward
  • GPU λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰
  • 동기화 λ°œμƒ 횟수

11. 문제 ν•΄κ²° κ°€μ΄λ“œ

11.1 OOM (Out of Memory)

  • gpu_memory_utilization μ‘°μ • (κΈ°λ³Έ: 0.35)
  • max_num_seqs κ°μ†Œ
  • 배치 크기 κ°μ†Œ

11.2 동기화 문제

  • load_format이 dummy_dtensor인지 확인
  • vLLM μΈμŠ€ν„΄μŠ€κ°€ μž¬μƒμ„±λ˜μ§€ μ•ŠλŠ”μ§€ 확인

11.3 느린 μ„±λŠ₯

  • GPU ν™œμš©λ₯  확인
  • 배치 크기 증가
  • enforce_eager=False 확인 (CUDA graph μ‚¬μš©)

12. κ²°λ‘ 

TTRLVR UnifiedλŠ” κΈ°μ‘΄ TTRLVR의 λͺ¨λ“  κΈ°λŠ₯을 μœ μ§€ν•˜λ©΄μ„œ λ‹€μŒμ„ λ‹¬μ„±ν–ˆμŠ΅λ‹ˆλ‹€:

  1. ꡬ쑰적 κ°œμ„ : λΆ„λ¦¬λœ Phase듀을 ν•˜λ‚˜μ˜ μ„Έμ…˜μœΌλ‘œ 톡합
  2. μ„±λŠ₯ ν–₯상: vLLM μž¬μƒμ„± μ˜€λ²„ν—€λ“œ 제거둜 30-40% 속도 ν–₯상
  3. μ•ˆμ •μ„± ν–₯상: 동기화 문제 μ™„μ „ ν•΄κ²°
  4. ν™•μž₯μ„±: 더 큰 λͺ¨λΈκ³Ό 더 λ§Žμ€ λΌμš΄λ“œ 지원 κ°€λŠ₯

이 μ•„ν‚€ν…μ²˜λŠ” TTRLVR의 μ •κ΅ν•œ 데이터 생성 λŠ₯λ ₯κ³Ό VeRL의 효율적인 PPO ν•™μŠ΅μ„ μ™„λ²½ν•˜κ²Œ κ²°ν•©ν–ˆμŠ΅λ‹ˆλ‹€.