Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,221 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
# lmarena-ai/p2l-360m-rk-01132025
|
5 |
+
|
6 |
+
Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance.
|
7 |
+
To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt.
|
8 |
+
The core idea is to train an LLM taking natural language prompts as input to output a vector of coefficients which are then used to predict the human preference vote.
|
9 |
+
The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses.
|
10 |
+
Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard.
|
11 |
+
|
12 |
+
**Paper**: [Prompt-to-Leaderboard](https://arxiv.org/abs/2502.14855)
|
13 |
+
|
14 |
+
**Code**: [lmarena/p2l](https://github.com/lmarena/p2l)
|
15 |
+
|
16 |
+
This particular P2L model has a *Rao-Kupper* regression head, which we define below:
|
17 |
+
|
18 |
+
$$
|
19 |
+
\begin{equation}
|
20 |
+
g_{\theta^*(z)}(y ; x) =
|
21 |
+
\begin{cases}
|
22 |
+
\sigma((x,-1)^\top \theta^*(z)) & y = \mathsf{B}, \\
|
23 |
+
\sigma((-x,-1)^\top \theta^*(z)) & y = \mathsf{A}, \\
|
24 |
+
1 - \sigma((-x,-1)^\top \theta^*(z)) - \sigma((x,-1)^\top \theta^*(z)) & y = \mathsf{tie}.
|
25 |
+
\end{cases}
|
26 |
+
\end{equation}
|
27 |
+
$$
|
28 |
+
|
29 |
+
More simply, given a prompt, P2L will output a vector of coefficient:
|
30 |
+
$\vec{\beta}$ and $\hat{\eta}$. Then the probability that model $i$ beats model $j$, $P(i \succ j) = \sigma(\vec{\beta}_i - \vec{\beta}_j - \eta)$, $P(j \succ i) = \sigma(\vec{\beta}_j - \vec{\beta}_i - \eta)$, $P(i = j) = 1 - P(i \succ j) - P(j \succ i)$ where $\eta = \log(1 + e^{(\hat{\eta} - 22.5)/\beta})$.
|
31 |
+
|
32 |
+
|
33 |
+
See section 2.2 in our paper for more details on various regression heads.
|
34 |
+
|
35 |
+
## Serving
|
36 |
+
To serve a P2L model, please see our documentation on GitHub: [Serving P2L](https://github.com/lmarena/p2l?tab=readme-ov-file#serving-p2l).
|
37 |
+
|
38 |
+
Note: the P2L model outputs with this structure:
|
39 |
+
|
40 |
+
|
41 |
+
```python
|
42 |
+
class P2LOutputs(ModelOutput):
|
43 |
+
coefs: torch.FloatTensor = None # "betas" as described above
|
44 |
+
eta: Optional[torch.FloatTensor] = None # tie coefficent (not used for BT head)
|
45 |
+
last_hidden_state: torch.FloatTensor = None # last hidden state from the transformer
|
46 |
+
```
|
47 |
+
|
48 |
+
To understand which coefficient index corresponds with which model, see the [`model_list.json`](./model_list.json) found in the repo of each P2L model. As a general rule, the models will always be in sorted order.
|
49 |
+
|
50 |
+
The easiest way to get this list from inside code is with the following:
|
51 |
+
|
52 |
+
```python
|
53 |
+
import json
|
54 |
+
from huggingface_hub import hf_hub_download
|
55 |
+
|
56 |
+
fname = hf_hub_download(
|
57 |
+
repo_id="lmarena-ai/p2l-360m-rk-01132025", filename="model_list.json", repo_type="model"
|
58 |
+
)
|
59 |
+
|
60 |
+
with open(fname) as fin:
|
61 |
+
model_list = json.load(fin)
|
62 |
+
```
|
63 |
+
|
64 |
+
|
65 |
+
|
66 |
+
### Loading from Pretrained
|
67 |
+
|
68 |
+
To define and load the model:
|
69 |
+
|
70 |
+
```python
|
71 |
+
|
72 |
+
import torch
|
73 |
+
from transformers import (
|
74 |
+
Qwen2Model,
|
75 |
+
Qwen2PreTrainedModel,
|
76 |
+
LlamaModel,
|
77 |
+
LlamaPreTrainedModel,
|
78 |
+
PreTrainedModel,
|
79 |
+
AutoTokenizer,
|
80 |
+
)
|
81 |
+
from transformers import AutoTokenizer
|
82 |
+
from transformers.utils import ModelOutput
|
83 |
+
from dataclasses import dataclass
|
84 |
+
import torch.nn as nn
|
85 |
+
import torch.nn.functional as F
|
86 |
+
from typing import Dict, Tuple, Callable, Optional
|
87 |
+
from huggingface_hub import hf_hub_download
|
88 |
+
import json
|
89 |
+
|
90 |
+
|
91 |
+
@dataclass
|
92 |
+
class HeadOutputs(ModelOutput):
|
93 |
+
coefs: torch.FloatTensor = None
|
94 |
+
eta: Optional[torch.FloatTensor] = None
|
95 |
+
gamma: Optional[torch.FloatTensor] = None
|
96 |
+
|
97 |
+
|
98 |
+
@dataclass
|
99 |
+
class P2LOutputs(ModelOutput):
|
100 |
+
coefs: torch.FloatTensor = None
|
101 |
+
eta: Optional[torch.FloatTensor] = None
|
102 |
+
gamma: Optional[torch.FloatTensor] = None
|
103 |
+
loss: Optional[torch.FloatTensor] = None
|
104 |
+
last_hidden_state: torch.FloatTensor = None
|
105 |
+
|
106 |
+
class BTHead(nn.Module):
|
107 |
+
def __init__(
|
108 |
+
self, input_dim, output_dim, linear_head_downsize_factor=None, **kwargs
|
109 |
+
) -> None:
|
110 |
+
super().__init__()
|
111 |
+
|
112 |
+
if linear_head_downsize_factor:
|
113 |
+
inner_dim = int(output_dim // linear_head_downsize_factor)
|
114 |
+
self.head = nn.Sequential(
|
115 |
+
nn.Linear(in_features=input_dim, out_features=inner_dim, bias=True),
|
116 |
+
nn.Linear(in_features=inner_dim, out_features=output_dim, bias=True),
|
117 |
+
)
|
118 |
+
else:
|
119 |
+
self.head = nn.Linear(
|
120 |
+
in_features=input_dim, out_features=output_dim, bias=True
|
121 |
+
)
|
122 |
+
|
123 |
+
def forward(self, last_hidden_dim: torch.Tensor):
|
124 |
+
coefs = self.head(last_hidden_dim)
|
125 |
+
return HeadOutputs(coefs=coefs)
|
126 |
+
|
127 |
+
class P2LModel(LlamaPreTrainedModel):
|
128 |
+
def __init__(
|
129 |
+
self,
|
130 |
+
config,
|
131 |
+
CLS_id,
|
132 |
+
num_models,
|
133 |
+
head_kwargs={},
|
134 |
+
**kwargs,
|
135 |
+
):
|
136 |
+
super().__init__(config)
|
137 |
+
|
138 |
+
self.num_models = num_models
|
139 |
+
self.cls_token_id = CLS_id
|
140 |
+
|
141 |
+
self.model = LlamaModel(config)
|
142 |
+
|
143 |
+
self.head = BTHead(
|
144 |
+
input_dim=config.hidden_size,
|
145 |
+
output_dim=self.num_models,
|
146 |
+
**head_kwargs,
|
147 |
+
)
|
148 |
+
|
149 |
+
self.post_init()
|
150 |
+
|
151 |
+
def freeze_transformer(self):
|
152 |
+
for param in self.model.parameters():
|
153 |
+
param.requires_grad = False
|
154 |
+
|
155 |
+
def get_input_embeddings(self):
|
156 |
+
return self.model.embed_tokens
|
157 |
+
|
158 |
+
def set_input_embeddings(self, value):
|
159 |
+
self.model.embed_tokens = value
|
160 |
+
|
161 |
+
def forward(self, input_ids, attention_mask, labels=None, weights=None):
|
162 |
+
batch_size = input_ids.shape[0]
|
163 |
+
|
164 |
+
hidden_outputs = self.model(
|
165 |
+
input_ids=input_ids,
|
166 |
+
attention_mask=attention_mask,
|
167 |
+
output_hidden_states=False,
|
168 |
+
).last_hidden_state # (bs, num_token, embed_dim)
|
169 |
+
|
170 |
+
cls_mask = input_ids == self.cls_token_id
|
171 |
+
|
172 |
+
# double check this is getting the current CLS token
|
173 |
+
cls_hidden_dim = hidden_outputs[cls_mask]
|
174 |
+
|
175 |
+
assert (
|
176 |
+
cls_hidden_dim.shape[0] == batch_size
|
177 |
+
), f"input ids {input_ids.shape}, cls_mask {cls_mask.shape}, cls_logit {cls_hidden_dim.shape}"
|
178 |
+
|
179 |
+
head_output = self.head(cls_hidden_dim)
|
180 |
+
|
181 |
+
|
182 |
+
outputs = P2LOutputs(
|
183 |
+
coefs=head_output.coefs,
|
184 |
+
last_hidden_state=cls_hidden_dim,
|
185 |
+
eta=head_output.eta,
|
186 |
+
gamma=head_output.gamma,
|
187 |
+
)
|
188 |
+
|
189 |
+
return outputs
|
190 |
+
|
191 |
+
|
192 |
+
fname = hf_hub_download(
|
193 |
+
repo_id="lmarena-ai/p2l-360m-rk-01132025", filename="model_list.json", repo_type="model"
|
194 |
+
)
|
195 |
+
|
196 |
+
with open(fname) as fin:
|
197 |
+
model_list = json.load(fin)
|
198 |
+
|
199 |
+
tokenizer = AutoTokenizer.from_pretrained("lmarena-ai/p2l-360m-rk-01132025")
|
200 |
+
model = P2LModel.from_pretrained(
|
201 |
+
"lmarena-ai/p2l-360m-rk-01132025",
|
202 |
+
CLS_id=tokenizer.cls_token_id,
|
203 |
+
num_models=len(model_list),
|
204 |
+
torch_dtype=torch.bfloat16,
|
205 |
+
)
|
206 |
+
|
207 |
+
```
|
208 |
+
|
209 |
+
## Citation
|
210 |
+
|
211 |
+
```
|
212 |
+
@misc{frick2025prompttoleaderboard,
|
213 |
+
title={Prompt-to-Leaderboard},
|
214 |
+
author={Evan Frick and Connor Chen and Joseph Tennyson and Tianle Li and Wei-Lin Chiang and Anastasios N. Angelopoulos and Ion Stoica},
|
215 |
+
year={2025},
|
216 |
+
eprint={2502.14855},
|
217 |
+
archivePrefix={arXiv},
|
218 |
+
primaryClass={cs.LG},
|
219 |
+
url={https://arxiv.org/abs/2502.14855},
|
220 |
+
}
|
221 |
+
```
|