Spaces:
Running
Running
File size: 4,954 Bytes
c391984 74f23b3 3e79bcc 74f23b3 c391984 c2ba4d5 77811e3 c2ba4d5 21fc7e5 c2ba4d5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
title: VARCO Arena
emoji: 🔥
colorFrom: pink
colorTo: yellow
sdk: streamlit
sdk_version: 1.40.2
app_file: app.py
pinned: false
license: cc-by-4.0
short_description: VARCO Arena is a reference-free LLM benchmarking approach
---
# Varco Arena
Varco Arena conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.
For more information, the followings may help understanding how it works.
* [Paper](https://huggingface.co/papers/2411.01281)
* [Blog Post (KR)](https://ncsoft.github.io/ncresearch/12cc62c1ea0d981971a8923401e8fe6a0f18563d)
## Quickstart
### Running Web Demo locally (streamlit, Recommended!)
```bash
git clone [THIS_REPO]
# install requirements below. we recommend miniforge to manage environment
cd streamlit_app_local
bash run.sh
```
For more details, see `[THIS_REPO]/streamlit_app_local/README.md`
### CLI use
* located at
* `varco_arena/`
* debug configurations for vscode at
* `varco_arena/.vscode`
```bash
## gpt-4o-mini as a judge
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -m tournament -e "gpt-4o-mini"
## vllm-openai served LLM as a judge
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e SOME_MODEL_NAME_SERVED -m tournament -u "http://url_to/your/vllm_openai_server:someport"
# dbg lines
## openai api judge dbg
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
## other testing lines
python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
## dummy judge dbg (checking errors without api requests)
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e debug
```
## Requirements
We tested this on `python = 3.11.9` env: `requirements.txt`
```
openai>=1.17.0
munch
pandas
numpy
tqdm>=4.48.0
plotly
scikit-learn
kaleido
tiktoken>=0.7.0
pyyaml
transformers
streamlit>=1.40.2
openpyxl
fire==0.6.0
git+https://github.com/shobrook/openlimit.git#egg=openlimit # do not install this by pypi
# Linux
uvloop
# Windows
winloop
```
#### Argument
- -i, --input : directory path which contains input jsonlines files (llm outputs)
- -o, --output_dir : directory where results to be put
- -e, --evaluation : judge model specification (e.g. "gpt-4o-2024-05-13", "gpt-4o-mini", \[vllm-served-model-name\])
- -k, --openai_api_key : OpenAI API Key
- -u, --openai_url: URL to openai_styled_llm_server (requested by openai sdk)
#### advanced
- -j, --n_jobs : n jobs to be put to `asyncio.semaphore(n=)`
- -p, --evalprompt : [see the directory](./varco_arena/prompts/*.yaml)
- -lr, --limit_requests : vLLM OpenAI server request limit (default: 7,680)
- -lt, --limit_tokens : vLLM OpenAI server token limit (default: 15,728,640)
#### Input Data Format
[input jsonl guides](./streamlit_app_local/guide_mds/input_jsonls_en.md)
## Contributing & Customizing
#### Do this after git clone and installation
```bash
pip install pre-commit
pre-commit install
```
#### before commit
```bash
bash precommit.sh # black formatter will reformat the codes
```
## FAQ
* I want to apply my custom judge prompt to run Varco Arena
* [`./varco_arena/prompts/`](./varco_arena/prompts/__init__.py) defines the prompts with `yaml` file and the class objects for those. Edit those as your need.
* I want tailored judge prompts for each line of the test set row (i.e. ~100th row - `prompt1`, 101st~ - `prompt2`)
* You could see `load_prompt` at the above link receives `promptname` + `task` as a parameters to load the prompt. The function is called at [`./varco_arena/manager.py:async_run`](./varco_arena/manager.py).
* I want more fields for my llm outputs jsonl files for tailored use, i.e. want more fields beyond `instruction`, `source`, `generated`.
* It's going to get tricky but let me briefly guide you about this.
* You might have to edit `varco_arena/eval_utils.py`:`async_eval_w_prompt` (this part calls `PROMPT_OBJ.complete_prompt()`)
* And all the related codes will require revision.
## Special Thanks to (contributors)
- Minho Lee (@Dialogue Model Team, NCSOFT) [github](https://github.com/minolee/)
- query wrapper
- rag prompt
- Jumin Oh (@Generation Model Team, NCSOFT)
- overall prototyping of the system in haste
## Citation
If you found our work helpful, consider citing our paper!
[arxiv](https://arxiv.org/abs/2411.19103v1)
```
@misc{son2024varcoarenatournamentapproach,
title={Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
year={2024},
eprint={2411.01281},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.01281},
}
```
|