neural-mesh-v2 / evaluation /math_eval /README.md

Restore all essential files - code, configs, and MBPP/HumanEval data

24c2665 verified 8 days ago

2.81 kB

	# Evaluation of Absolute Zero Reasoner (AZR) on Math Benchmarks

	### Requirements
	You can install the required packages with the following command:
	```bash
	# Create a new conda environment
	conda create -n azr_eval python=3.10.16 -y
	conda activate azr_eval

	# Install latex2sympy
	cd evaluation/math_eval
	tar -xzvf latex2sympy.tar.gz
	cd eval/latex2sympy
	pip install -e .
	cd ../..
	# Install other packages. Note the `requirements.txt` doesn't limit packages versions. You can use `freezed_requirements.txt` to install all freezed versions but might include some unused packages.
	pip install -r requirements.txt

	# Install flash-attn
	pip install flash_attn==2.7.4.post1
	```


	### Evaluation

	First log into huggingface and download the models to be evaluated (if you have not downloaded them yet):

	```bash
	# Download 3B Coder model
	huggingface-cli download andrewzh/Absolute_Zero_Reasoner-Coder-3b --local-dir-use-symlinks False

	# Download 7B Coder model
	huggingface-cli download andrewzh/Absolute_Zero_Reasoner-Coder-7b --local-dir-use-symlinks False

	# Download 7B Base model
	huggingface-cli download andrewzh2/Absolute_Zero_Reasoner-Base-7b --local-dir-use-symlinks False

	# Download 14B Coder model
	huggingface-cli download andrewzh/Absolute_Zero_Reasoner-Coder-14b --local-dir-use-symlinks False

	# Download 14B Base model
	huggingface-cli download andrewzh2/Absolute_Zero_Reasoner-Base-14b --local-dir-use-symlinks False

	```

	Use the following script to evaluate AZR 7b on 6 benchmark with greedy decoding. There is also a `run.sh` script to evaluate all models on all benchmarks.

	```bash
	bash eval_math_nodes.sh \
	--run_name azr_base_7b_seed2 \
	--init_model $(ls -d ~/.cache/huggingface/hub/models--andrewzh2--Absolute_Zero_Reasoner-Base-7b/snapshots/*) \
	--template azr \
	--tp_size 1 \
	--add_step_0 true \
	--temperature 0 \
	--top_p 0.95 \
	--max_tokens 16000 \
	--benchmarks aime24,aime25,amc23,math500,olympiadbench,minerva_math \
	--n_sampling 1 \
	--just_wandb false \
	--seed 2
	```


	Notes:
	- The `--init_model` must be the absolute path to your model directory. If you have downloaded them in a different directory, you should change it (be careful wiht "andrewzh" and "andrewzh2" in the path).
	- You should change `--template` if you are testing other models. It controls the prompt template used for the evaluation.
	- Full list of benchmarks tested: `aime24,aime25,amc23,math500,olympiadbench,minerva_math`. See dataset under `data/` for other possible benchmarks.
	- You can change `--benchmarks` to test other benchmarks.


	## Acknowledgement
	The codebase is adapted from [simpleRL-reason](https://github.com/hkust-nlp/simpleRL-reason), which was based on [math-evaluation-harness](https://github.com/ZubinGou/math-evaluation-harness).