|
# Evaluation of Absolute Zero Reasoner (AZR) on Math Benchmarks |
|
|
|
### Requirements |
|
You can install the required packages with the following command: |
|
```bash |
|
# Create a new conda environment |
|
conda create -n azr_eval python=3.10.16 -y |
|
conda activate azr_eval |
|
|
|
# Install latex2sympy |
|
cd evaluation/math_eval |
|
tar -xzvf latex2sympy.tar.gz |
|
cd eval/latex2sympy |
|
pip install -e . |
|
cd ../.. |
|
# Install other packages. Note the `requirements.txt` doesn't limit packages versions. You can use `freezed_requirements.txt` to install all freezed versions but might include some unused packages. |
|
pip install -r requirements.txt |
|
|
|
# Install flash-attn |
|
pip install flash_attn==2.7.4.post1 |
|
``` |
|
|
|
|
|
### Evaluation |
|
|
|
First log into huggingface and download the models to be evaluated (if you have not downloaded them yet): |
|
|
|
```bash |
|
# Download 3B Coder model |
|
huggingface-cli download andrewzh/Absolute_Zero_Reasoner-Coder-3b --local-dir-use-symlinks False |
|
|
|
# Download 7B Coder model |
|
huggingface-cli download andrewzh/Absolute_Zero_Reasoner-Coder-7b --local-dir-use-symlinks False |
|
|
|
# Download 7B Base model |
|
huggingface-cli download andrewzh2/Absolute_Zero_Reasoner-Base-7b --local-dir-use-symlinks False |
|
|
|
# Download 14B Coder model |
|
huggingface-cli download andrewzh/Absolute_Zero_Reasoner-Coder-14b --local-dir-use-symlinks False |
|
|
|
# Download 14B Base model |
|
huggingface-cli download andrewzh2/Absolute_Zero_Reasoner-Base-14b --local-dir-use-symlinks False |
|
|
|
``` |
|
|
|
Use the following script to evaluate AZR 7b on 6 benchmark with greedy decoding. There is also a `run.sh` script to evaluate all models on all benchmarks. |
|
|
|
```bash |
|
bash eval_math_nodes.sh \ |
|
--run_name azr_base_7b_seed2 \ |
|
--init_model $(ls -d ~/.cache/huggingface/hub/models--andrewzh2--Absolute_Zero_Reasoner-Base-7b/snapshots/*) \ |
|
--template azr \ |
|
--tp_size 1 \ |
|
--add_step_0 true \ |
|
--temperature 0 \ |
|
--top_p 0.95 \ |
|
--max_tokens 16000 \ |
|
--benchmarks aime24,aime25,amc23,math500,olympiadbench,minerva_math \ |
|
--n_sampling 1 \ |
|
--just_wandb false \ |
|
--seed 2 |
|
``` |
|
|
|
|
|
**Notes:** |
|
- The `--init_model` must be the **absolute path** to your model directory. If you have downloaded them in a different directory, you should change it (be careful wiht "andrewzh" and "andrewzh2" in the path). |
|
- You should change `--template` if you are testing other models. It controls the prompt template used for the evaluation. |
|
- Full list of benchmarks tested: `aime24,aime25,amc23,math500,olympiadbench,minerva_math`. See dataset under `data/` for other possible benchmarks. |
|
- You can change `--benchmarks` to test other benchmarks. |
|
|
|
|
|
## Acknowledgement |
|
The codebase is adapted from [simpleRL-reason](https://github.com/hkust-nlp/simpleRL-reason), which was based on [math-evaluation-harness](https://github.com/ZubinGou/math-evaluation-harness). |