Evaluation of Absolute Zero Reasoner (AZR) on Math Benchmarks

Requirements

You can install the required packages with the following command:

# Create a new conda environment
conda create -n azr_eval python=3.10.16 -y
conda activate azr_eval

# Install latex2sympy
cd evaluation/math_eval
tar -xzvf latex2sympy.tar.gz 
cd eval/latex2sympy
pip install -e .
cd ../..
# Install other packages. Note the `requirements.txt` doesn't limit packages versions. You can use `freezed_requirements.txt` to install all freezed versions but might include some unused packages.
pip install -r requirements.txt

# Install flash-attn
pip install flash_attn==2.7.4.post1

Evaluation

First log into huggingface and download the models to be evaluated (if you have not downloaded them yet):

# Download 3B Coder model
huggingface-cli download andrewzh/Absolute_Zero_Reasoner-Coder-3b --local-dir-use-symlinks False

# Download 7B Coder model  
huggingface-cli download andrewzh/Absolute_Zero_Reasoner-Coder-7b --local-dir-use-symlinks False

# Download 7B Base model
huggingface-cli download andrewzh2/Absolute_Zero_Reasoner-Base-7b --local-dir-use-symlinks False

# Download 14B Coder model
huggingface-cli download andrewzh/Absolute_Zero_Reasoner-Coder-14b --local-dir-use-symlinks False

# Download 14B Base model
huggingface-cli download andrewzh2/Absolute_Zero_Reasoner-Base-14b --local-dir-use-symlinks False

Use the following script to evaluate AZR 7b on 6 benchmark with greedy decoding. There is also a run.sh script to evaluate all models on all benchmarks.

bash eval_math_nodes.sh \
    --run_name azr_base_7b_seed2 \
    --init_model $(ls -d ~/.cache/huggingface/hub/models--andrewzh2--Absolute_Zero_Reasoner-Base-7b/snapshots/*) \
    --template azr \
    --tp_size 1 \
    --add_step_0 true \
    --temperature 0 \
    --top_p 0.95 \
    --max_tokens 16000 \
    --benchmarks aime24,aime25,amc23,math500,olympiadbench,minerva_math \
    --n_sampling 1 \
    --just_wandb false \
    --seed 2

Notes:

The --init_model must be the absolute path to your model directory. If you have downloaded them in a different directory, you should change it (be careful wiht "andrewzh" and "andrewzh2" in the path).
You should change --template if you are testing other models. It controls the prompt template used for the evaluation.
Full list of benchmarks tested: aime24,aime25,amc23,math500,olympiadbench,minerva_math. See dataset under data/ for other possible benchmarks.
You can change --benchmarks to test other benchmarks.

Acknowledgement

The codebase is adapted from simpleRL-reason, which was based on math-evaluation-harness.