Spaces:
Running
Running
Jae-Won Chung
commited on
Commit
·
55aeee4
1
Parent(s):
8595b18
More explanations, default plot, compute average
Browse files- LEADERBOARD.md +24 -11
- app.py +30 -7
- data/2023-06-17/score.csv +21 -21
LEADERBOARD.md
CHANGED
|
@@ -6,14 +6,16 @@ The energy consumption of running inference on a model will depends on factors s
|
|
| 6 |
However, even if we run models with the exact same architecture and size on the same GPU, the average energy consumption **per prompt** is different because different models have **different verbosity**.
|
| 7 |
That is, when asked the same thing, different models answer in different lengths.
|
| 8 |
|
| 9 |
-
##
|
| 10 |
|
| 11 |
-
- `gpu`: NVIDIA GPU model name
|
| 12 |
- `task`: Name of the task. See *Tasks* below for details.
|
|
|
|
|
|
|
|
|
|
| 13 |
- `throughput` (token/s): The average number of tokens generated per second.
|
| 14 |
-
- `response_length` (token): The average number of tokens in the model's response.
|
| 15 |
- `latency` (s): The average time it took for the model to generate a response.
|
| 16 |
-
- `
|
| 17 |
- `parameters`: The number of parameters the model has, in units of billion.
|
| 18 |
|
| 19 |
## Tasks
|
|
@@ -27,6 +29,9 @@ For each task, every model uses the same system prompt. We still account for dif
|
|
| 27 |
| instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. |
|
| 28 |
| instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. |
|
| 29 |
|
|
|
|
|
|
|
|
|
|
| 30 |
## Setup
|
| 31 |
|
| 32 |
Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).
|
|
@@ -34,12 +39,14 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
|
|
| 34 |
### Software
|
| 35 |
|
| 36 |
- PyTorch 2.0.1
|
| 37 |
-
- [
|
| 38 |
-
- [
|
|
|
|
| 39 |
|
| 40 |
### Hardware
|
| 41 |
|
| 42 |
- NVIDIA A40 GPU
|
|
|
|
| 43 |
|
| 44 |
### Parameters
|
| 45 |
|
|
@@ -61,18 +68,24 @@ We used identical system prompts for all models (while respecting their own *rol
|
|
| 61 |
A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
|
| 62 |
```
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
## Upcoming
|
| 65 |
|
| 66 |
-
-
|
| 67 |
-
- More GPU
|
| 68 |
-
- More models
|
| 69 |
|
| 70 |
# License
|
| 71 |
|
| 72 |
This leaderboard is a research preview intended for non-commercial use only.
|
| 73 |
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
|
| 74 |
-
Please direct inquiries
|
| 75 |
|
| 76 |
# Acknowledgements
|
| 77 |
|
| 78 |
-
We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`).
|
|
|
|
| 6 |
However, even if we run models with the exact same architecture and size on the same GPU, the average energy consumption **per prompt** is different because different models have **different verbosity**.
|
| 7 |
That is, when asked the same thing, different models answer in different lengths.
|
| 8 |
|
| 9 |
+
## Columns
|
| 10 |
|
| 11 |
+
- `gpu`: NVIDIA GPU model name. Note that NLP evaluation was only run once on our A40 GPUs, so this column only changes system-level measurements like latency and energy.
|
| 12 |
- `task`: Name of the task. See *Tasks* below for details.
|
| 13 |
+
- `energy_efficiency`: The average NLP evaluation metric attained per Joule of energy.
|
| 14 |
+
- `energy` (J): The average energy consumed by the model to generate a response.
|
| 15 |
+
- `nlp_average`: The arithmetic average of the NLP evaluation metrics we obtained. See *NLP evaluation metrics* below for details.
|
| 16 |
- `throughput` (token/s): The average number of tokens generated per second.
|
|
|
|
| 17 |
- `latency` (s): The average time it took for the model to generate a response.
|
| 18 |
+
- `response_length` (token): The average number of tokens in the model's response.
|
| 19 |
- `parameters`: The number of parameters the model has, in units of billion.
|
| 20 |
|
| 21 |
## Tasks
|
|
|
|
| 29 |
| instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. |
|
| 30 |
| instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. |
|
| 31 |
|
| 32 |
+
You can see that response length is shorter on average for the `-concise` variants of the tasks.
|
| 33 |
+
This affects the number of decoding iterations the model has to run in order to finish responding, thus affecting latency and energy consumption per prompt.
|
| 34 |
+
|
| 35 |
## Setup
|
| 36 |
|
| 37 |
Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).
|
|
|
|
| 39 |
### Software
|
| 40 |
|
| 41 |
- PyTorch 2.0.1
|
| 42 |
+
- [Zeus](https://ml.energy/zeus) -- For GPU time and energy measurement
|
| 43 |
+
- [FastChat](https://github.com/lm-sys/fastchat) -- For running inference on various models
|
| 44 |
+
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) -- For NLP evaluation metrics
|
| 45 |
|
| 46 |
### Hardware
|
| 47 |
|
| 48 |
- NVIDIA A40 GPU
|
| 49 |
+
- NVIDIA A100 GPU
|
| 50 |
|
| 51 |
### Parameters
|
| 52 |
|
|
|
|
| 68 |
A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
|
| 69 |
```
|
| 70 |
|
| 71 |
+
## NLP evaluation metrics
|
| 72 |
+
|
| 73 |
+
- `arc`: [AI2 Reasoning Challenge](https://allenai.org/data/arc)'s `challenge` dataset, measures capability to do grade-school level question answering, 25 shot
|
| 74 |
+
- `hellaswag`: [HellaSwag dataset](https://allenai.org/data/hellaswag), measuring grounded commonsense, 10 shot
|
| 75 |
+
- `truthfulqa`: [TruthfulQA dataset](https://arxiv.org/abs/2109.07958), measuring truthfulness against questions that elicit common falsehoods, 0 shot
|
| 76 |
+
|
| 77 |
## Upcoming
|
| 78 |
|
| 79 |
+
- More optimized inference runtimes, like TensorRT.
|
| 80 |
+
- More GPU models, like V100.
|
| 81 |
+
- More models, like RWKV.
|
| 82 |
|
| 83 |
# License
|
| 84 |
|
| 85 |
This leaderboard is a research preview intended for non-commercial use only.
|
| 86 |
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
|
| 87 |
+
Please direct inquiries/reports of potential violation to Jae-Won Chung.
|
| 88 |
|
| 89 |
# Acknowledgements
|
| 90 |
|
| 91 |
+
We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 80GB GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`).
|
app.py
CHANGED
|
@@ -19,9 +19,9 @@ class TableManager:
|
|
| 19 |
"""Load leaderboard data from CSV files in data_dir."""
|
| 20 |
# Load and merge CSV files.
|
| 21 |
df = self._read_tables(data_dir)
|
| 22 |
-
models = json.load(open(f"{data_dir}/models.json"))
|
| 23 |
|
| 24 |
# Add the #params column.
|
|
|
|
| 25 |
df["parameters"] = df["model"].apply(lambda x: models[x]["params"])
|
| 26 |
|
| 27 |
# Make the first column (model) an HTML anchor to the model's website.
|
|
@@ -34,8 +34,8 @@ class TableManager:
|
|
| 34 |
)
|
| 35 |
df["model"] = df["model"].apply(format_model_link)
|
| 36 |
|
| 37 |
-
# Sort by energy.
|
| 38 |
-
df = df.sort_values(by="
|
| 39 |
|
| 40 |
# The full table where all the data are.
|
| 41 |
self.full_df = df
|
|
@@ -48,6 +48,11 @@ class TableManager:
|
|
| 48 |
"""Read tables."""
|
| 49 |
df_score = pd.read_csv(f"{data_dir}/score.csv")
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
with open(f"{data_dir}/schema.yaml") as file:
|
| 52 |
self.schema: dict[str, list] = yaml.safe_load(file)
|
| 53 |
|
|
@@ -66,7 +71,24 @@ class TableManager:
|
|
| 66 |
if res_df.empty:
|
| 67 |
raise ValueError(f"No benchmark CSV files were read from {data_dir=}.")
|
| 68 |
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
def _format_msg(self, text: str) -> str:
|
| 72 |
"""Formats into HTML that prints in Monospace font."""
|
|
@@ -111,8 +133,8 @@ class TableManager:
|
|
| 111 |
def get_dropdown(self):
|
| 112 |
columns = self.full_df.columns.tolist()[1:] # include gpu and task in the dropdown
|
| 113 |
return [
|
| 114 |
-
gr.Dropdown(choices=columns, label="X"),
|
| 115 |
-
gr.Dropdown(choices=columns, label="Y"),
|
| 116 |
gr.Dropdown(choices=columns, label="Z (optional)"),
|
| 117 |
]
|
| 118 |
|
|
@@ -320,7 +342,8 @@ with block:
|
|
| 320 |
plot_width_input = gr.Textbox("600", lines=1, label="Width (px)")
|
| 321 |
plot_height_input = gr.Textbox("600", lines=1, label="Height (px)")
|
| 322 |
with gr.Row():
|
| 323 |
-
plot
|
|
|
|
| 324 |
with gr.Row():
|
| 325 |
plot_message = gr.HTML("")
|
| 326 |
add_col_btn.click(TableManager.update_dropdown, inputs=tbm, outputs=axis_dropdowns) # type: ignore
|
|
|
|
| 19 |
"""Load leaderboard data from CSV files in data_dir."""
|
| 20 |
# Load and merge CSV files.
|
| 21 |
df = self._read_tables(data_dir)
|
|
|
|
| 22 |
|
| 23 |
# Add the #params column.
|
| 24 |
+
models = json.load(open(f"{data_dir}/models.json"))
|
| 25 |
df["parameters"] = df["model"].apply(lambda x: models[x]["params"])
|
| 26 |
|
| 27 |
# Make the first column (model) an HTML anchor to the model's website.
|
|
|
|
| 34 |
)
|
| 35 |
df["model"] = df["model"].apply(format_model_link)
|
| 36 |
|
| 37 |
+
# Sort by our 'energy efficiency' score.
|
| 38 |
+
df = df.sort_values(by="energy_efficiency", ascending=True)
|
| 39 |
|
| 40 |
# The full table where all the data are.
|
| 41 |
self.full_df = df
|
|
|
|
| 48 |
"""Read tables."""
|
| 49 |
df_score = pd.read_csv(f"{data_dir}/score.csv")
|
| 50 |
|
| 51 |
+
# Compute average NLP metrics
|
| 52 |
+
columns = df_score.columns.to_list()
|
| 53 |
+
columns.remove("model")
|
| 54 |
+
df_score["nlp_average"] = df_score[columns].mean(axis=1)
|
| 55 |
+
|
| 56 |
with open(f"{data_dir}/schema.yaml") as file:
|
| 57 |
self.schema: dict[str, list] = yaml.safe_load(file)
|
| 58 |
|
|
|
|
| 71 |
if res_df.empty:
|
| 72 |
raise ValueError(f"No benchmark CSV files were read from {data_dir=}.")
|
| 73 |
|
| 74 |
+
df = pd.merge(res_df, df_score, on=["model"])
|
| 75 |
+
|
| 76 |
+
# Energy efficiency is defined as the amount of average NLP performance
|
| 77 |
+
# the model gets per Joule of energy.
|
| 78 |
+
df["energy_efficiency"] = df["nlp_average"] / df["energy"]
|
| 79 |
+
|
| 80 |
+
# Order columns.
|
| 81 |
+
columns = df.columns.to_list()
|
| 82 |
+
cols_to_order = ["model"]
|
| 83 |
+
cols_to_order.extend(self.schema.keys())
|
| 84 |
+
cols_to_order.extend(["energy_efficiency", "energy", "nlp_average"])
|
| 85 |
+
columns = cols_to_order + [col for col in columns if col not in cols_to_order]
|
| 86 |
+
df = df[columns]
|
| 87 |
+
|
| 88 |
+
# Delete rows with *any* NaN values.
|
| 89 |
+
df = df.dropna()
|
| 90 |
+
|
| 91 |
+
return df.round(2)
|
| 92 |
|
| 93 |
def _format_msg(self, text: str) -> str:
|
| 94 |
"""Formats into HTML that prints in Monospace font."""
|
|
|
|
| 133 |
def get_dropdown(self):
|
| 134 |
columns = self.full_df.columns.tolist()[1:] # include gpu and task in the dropdown
|
| 135 |
return [
|
| 136 |
+
gr.Dropdown("nlp_average", choices=columns, label="X"),
|
| 137 |
+
gr.Dropdown("energy_efficiency", choices=columns, label="Y"),
|
| 138 |
gr.Dropdown(choices=columns, label="Z (optional)"),
|
| 139 |
]
|
| 140 |
|
|
|
|
| 342 |
plot_width_input = gr.Textbox("600", lines=1, label="Width (px)")
|
| 343 |
plot_height_input = gr.Textbox("600", lines=1, label="Height (px)")
|
| 344 |
with gr.Row():
|
| 345 |
+
# By default show a plot of average model quality vs energy consumption.
|
| 346 |
+
plot = gr.Plot(global_tbm.plot_scatter("600", "600", "gpu", "nlp_average", "energy")[0])
|
| 347 |
with gr.Row():
|
| 348 |
plot_message = gr.HTML("")
|
| 349 |
add_col_btn.click(TableManager.update_dropdown, inputs=tbm, outputs=axis_dropdowns) # type: ignore
|
data/2023-06-17/score.csv
CHANGED
|
@@ -1,21 +1,21 @@
|
|
| 1 |
-
model,
|
| 2 |
-
lmsys/vicuna-7B,
|
| 3 |
-
lmsys/vicuna-13B,
|
| 4 |
-
tatsu-lab/alpaca-7B,
|
| 5 |
-
metaai/llama-7B,
|
| 6 |
-
metaai/llama-13B,
|
| 7 |
-
camel-ai/CAMEL-13B-Combined-Data,
|
| 8 |
-
BlinkDL/RWKV-4-Raven-7B-v12-Eng98%-Other2%-20230521-ctx8192.pth,NaN,NaN,NaN
|
| 9 |
-
databricks/dolly-v2-12b,
|
| 10 |
-
FreedomIntelligence/phoenix-inst-chat-7b,
|
| 11 |
-
h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt-v2,
|
| 12 |
-
lmsys/fastchat-t5-3b-v1.0,
|
| 13 |
-
Neutralzz/BiLLa-7B-SFT,
|
| 14 |
-
nomic-ai/gpt4all-13b-snoozy,
|
| 15 |
-
openaccess-ai-collective/manticore-13b-chat-pyg,
|
| 16 |
-
OpenAssistant/oasst-sft-1-pythia-12b,
|
| 17 |
-
project-baize/baize-v2-7B,
|
| 18 |
-
BAIR/koala-7b,
|
| 19 |
-
BAIR/koala-13b,
|
| 20 |
-
StabilityAI/stablelm-tuned-alpha-7b,
|
| 21 |
-
togethercomputer/RedPajama-INCITE-7B-Chat,
|
|
|
|
| 1 |
+
model,arc,hellaswag,truthfulqa
|
| 2 |
+
lmsys/vicuna-7B,53.5,77.5,49.0
|
| 3 |
+
lmsys/vicuna-13B,52.9,80.1,51.8
|
| 4 |
+
tatsu-lab/alpaca-7B,52.6,76.9,39.6
|
| 5 |
+
metaai/llama-7B,51.1,77.7,34.1
|
| 6 |
+
metaai/llama-13B,56.3,80.9,39.9
|
| 7 |
+
camel-ai/CAMEL-13B-Combined-Data,55.5,79.3,47.3
|
| 8 |
+
BlinkDL/RWKV-4-Raven-7B-v12-Eng98%-Other2%-20230521-ctx8192.pth,NaN,NaN,NaN
|
| 9 |
+
databricks/dolly-v2-12b,42.2,71.8,33.4
|
| 10 |
+
FreedomIntelligence/phoenix-inst-chat-7b,45.0,63.2,47.1
|
| 11 |
+
h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt-v2,36.9,61.6,37.9
|
| 12 |
+
lmsys/fastchat-t5-3b-v1.0,35.9,46.4,48.8
|
| 13 |
+
Neutralzz/BiLLa-7B-SFT,27.7,26.0,49.0
|
| 14 |
+
nomic-ai/gpt4all-13b-snoozy,56.1,78.7,48.4
|
| 15 |
+
openaccess-ai-collective/manticore-13b-chat-pyg,58.7,82.0,48.9
|
| 16 |
+
OpenAssistant/oasst-sft-1-pythia-12b,45.6,69.9,39.2
|
| 17 |
+
project-baize/baize-v2-7B,48.5,75.0,41.7
|
| 18 |
+
BAIR/koala-7b,47.1,73.7,46.0
|
| 19 |
+
BAIR/koala-13b,52.9,77.5,50.1
|
| 20 |
+
StabilityAI/stablelm-tuned-alpha-7b,31.9,53.6,40.2
|
| 21 |
+
togethercomputer/RedPajama-INCITE-7B-Chat,42.2,70.8,36.1
|