Overview

This reward model is trained to predict human preferences between pairs of responses to various prompts. It is designed to be used as part of a Reinforcement Learning from Human Feedback (RLHF) pipeline.

Model Architecture

  • Base Model: Llama3-8B with SFT & DPO
  • Output: Single scalar reward value
  • Parameters: 8B
  • Training Framework: DeepSpeed + TRL

Example Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

device = 'cuda:0'
model_name = "Nagi-ovo/Llama-3-8B-RM"

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    load_in_4bit=True, 
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

SYSTEM_PROMPT = "You are a helpful assistant"

def format_prompt_answer(prompt, answer):
    """Format the input for reward model evaluation"""
    return f"###System: {SYSTEM_PROMPT}\n###Question: {prompt}\n###Answer: {answer}<|end_of_text|>"

def get_reward_score(prompt, answer):
    """Get reward score for a given prompt-answer pair"""
    formatted_input = format_prompt_answer(prompt, answer)
    inputs = tokenizer(formatted_input, return_tensors='pt').to(device)
    
    with torch.no_grad():
        output = model(inputs['input_ids']).logits
    
    return output.item()

prompt = "How are you?"
answer = "I'm doing great! Thank you for asking. How can I help you today?"
    
score = get_reward_score(prompt, answer)
print(f"Prompt: {prompt}")
print(f"Answer: {answer}")
print(f"Reward Score: {score}")
Downloads last month
103
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for Nagi-ovo/Llama-3-8B-RM

Finetuned
(2)
this model
Quantizations
2 models

Dataset used to train Nagi-ovo/Llama-3-8B-RM

Collection including Nagi-ovo/Llama-3-8B-RM