thea-3b-50r-u1 / README.md
lunahr's picture
OpenLLM satisfaction information
34371d8 verified
metadata
language:
  - en
license: llama3.2
tags:
  - text-generation-inference
  - transformers
  - llama
  - trl
  - sft
  - reasoning
  - llama-3
base_model: CreitinGameplays/Llama-3.2-3b-Instruct-uncensored-refinetune
datasets:
  - KingNish/reasoning-base-20k
pipeline_tag: text-generation
library_name: transformers

Model Description

An uncensored reasoning Llama 3.2 3B model trained on reasoning data.

It has been trained using improved training code, and gives an improved performance.

This is a Thea 3B Update 1 model. The new features are:

  • Trained on more examples than the original Thea model.
  • Based off a different base model, with some of the lost accuracy points (hopefully) restored.

This model has not been tested in a GGUF setting yet. Try it in a GGUF setting yourself by using the GGUF My Repo space.

Here is what inference code you should use:

from transformers import AutoModelForCausalLM, AutoTokenizer

MAX_REASONING_TOKENS = 1024
MAX_RESPONSE_TOKENS = 512

model_name = "lunahr/thea-3b-50r-u1"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Which is greater 9.9 or 9.11 ??"
messages = [
    {"role": "user", "content": prompt}
]

# Generate reasoning
reasoning_template = tokenizer.apply_chat_template(messages, tokenize=False, add_reasoning_prompt=True)
reasoning_inputs = tokenizer(reasoning_template, return_tensors="pt").to(model.device)
reasoning_ids = model.generate(**reasoning_inputs, max_new_tokens=MAX_REASONING_TOKENS)
reasoning_output = tokenizer.decode(reasoning_ids[0, reasoning_inputs.input_ids.shape[1]:], skip_special_tokens=True)

print("REASONING: " + reasoning_output)

# Generate answer
messages.append({"role": "reasoning", "content": reasoning_output})
response_template = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response_inputs = tokenizer(response_template, return_tensors="pt").to(model.device)
response_ids = model.generate(**response_inputs, max_new_tokens=MAX_RESPONSE_TOKENS)
response_output = tokenizer.decode(response_ids[0, response_inputs.input_ids.shape[1]:], skip_special_tokens=True)

print("ANSWER: " + response_output)

Intended Use

This model is intended as an OpenAI o1 replacement for weaker hardware, mimicking o1 in the response formatting.

Limitations

This Llama model was trained faster than Unsloth using custom training code.

Visit https://www.kaggle.com/code/piotr25691/distributed-llama-training-with-2xt4 to find out how you can finetune your models using BOTH of the Kaggle provided GPUs.