Spaces:
Paused
Paused
Amber Tanaka
commited on
Add About page (#15)
Browse files- about.py +65 -0
- app.py +4 -1
- content.py +4 -0
about.py
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
with gr.Blocks() as demo:
|
| 5 |
+
gr.Markdown(
|
| 6 |
+
"""
|
| 7 |
+
## About AstaBench
|
| 8 |
+
AstaBench is a best-in-class AI agents evaluation framework to measure scientific research abilities. AstaBench provides a challenging new test for AI agents: the first benchmark challenge that evaluates agents’ scientific abilities on a broad spectrum of research skills, including literature understanding, data analysis, planning, tool use, coding, and search.
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
**Why AstaBench?**
|
| 12 |
+
Newer benchmarks may test agentic AI and isolated aspects of scientific reasoning, but none rigorously measure agentic AI or capture the full range of skills research demands. Agents can appear effective by simply retrying tasks—often at high computational cost and with inconsistent results. Scientific AI needs evaluations that reflect the real complexity of research.
|
| 13 |
+
|
| 14 |
+
AstaBench fills that gap: a suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. AstaBench helps scientists identify which agents best support their needs through task-relevant leaderboards, while giving AI developers a standard execution environment and standard tools to test the scientific reasoning capabilities of their agents compared to well-known baselines from the literature, including both open and closed LLM foundation models.
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
**What Does AstaBench Include?**
|
| 18 |
+
The suite includes over 8,000 tasks across 11 benchmarks, organized into four core categories:
|
| 19 |
+
- Literature Understanding
|
| 20 |
+
- Code & Execution
|
| 21 |
+
- Data Analysis
|
| 22 |
+
- End-to-End Discovery
|
| 23 |
+
|
| 24 |
+
🔍 Learn more in the AstaBench technical blog post
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
**Understanding the Leaderboards**
|
| 28 |
+
The AstaBench Main Leaderboard provides a high-level view of overall agent performance and efficiency:
|
| 29 |
+
- Overall score: A macro-average of the four category-level averages (equal weighting)
|
| 30 |
+
- Overall cost: Average cost per task, aggregated only across benchmarks with reported cost
|
| 31 |
+
|
| 32 |
+
Each category leaderboard provides:
|
| 33 |
+
- Average score and cost for that category
|
| 34 |
+
- A breakdown by individual benchmarks
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
**Scoring & Aggregation**
|
| 38 |
+
AstaBench encourages broad, honest evaluation. Here's how we handle scoring, cost, and partial results:
|
| 39 |
+
|
| 40 |
+
_Scores_
|
| 41 |
+
- Each benchmark returns an average score based on per-task accuracy
|
| 42 |
+
- Skipped benchmarks receive a score of 0.00
|
| 43 |
+
- All scores are aggregated upward using macro-averaging
|
| 44 |
+
- Partial completions are included (even with poor performance)
|
| 45 |
+
|
| 46 |
+
_Cost_
|
| 47 |
+
- Costs are reported in USD per task, based on values at the time of submission
|
| 48 |
+
- Benchmarks without cost data are excluded from cost averages
|
| 49 |
+
- In scatter plots, agents without cost are plotted far right and clearly marked
|
| 50 |
+
Note: Cost values reflect pricing and infrastructure conditions at the time of each submission. We recognize that compute costs may change over time, and are actively working on methods to normalize cost data across submissions for fairer longitudinal comparisons.
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
_Coverage_
|
| 54 |
+
- Main leaderboard: category coverage (X/4)
|
| 55 |
+
- Category view: benchmark coverage (X/Y)
|
| 56 |
+
- Incomplete coverage is flagged visually
|
| 57 |
+
|
| 58 |
+
These design choices ensure fair comparison while penalizing cherry-picking and omissions.
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
**Learn More**
|
| 62 |
+
- AstaBench technical blog post
|
| 63 |
+
- FAQ and submission guide
|
| 64 |
+
""", elem_id="about-content"
|
| 65 |
+
)
|
app.py
CHANGED
|
@@ -4,7 +4,7 @@ import os
|
|
| 4 |
|
| 5 |
from apscheduler.schedulers.background import BackgroundScheduler
|
| 6 |
from huggingface_hub import HfApi
|
| 7 |
-
import literature_understanding, main_page, c_and_e, data_analysis, e2e, submission
|
| 8 |
|
| 9 |
from content import css
|
| 10 |
|
|
@@ -106,6 +106,9 @@ with demo.route("Data Analysis", "/data-analysis"):
|
|
| 106 |
with demo.route("Discovery", "/discovery"):
|
| 107 |
render_logo()
|
| 108 |
e2e.demo.render()
|
|
|
|
|
|
|
|
|
|
| 109 |
with demo.route(" 🚀 Submit an Agent"):
|
| 110 |
render_logo()
|
| 111 |
submission.demo.render()
|
|
|
|
| 4 |
|
| 5 |
from apscheduler.schedulers.background import BackgroundScheduler
|
| 6 |
from huggingface_hub import HfApi
|
| 7 |
+
import literature_understanding, main_page, c_and_e, data_analysis, e2e, submission, about
|
| 8 |
|
| 9 |
from content import css
|
| 10 |
|
|
|
|
| 106 |
with demo.route("Discovery", "/discovery"):
|
| 107 |
render_logo()
|
| 108 |
e2e.demo.render()
|
| 109 |
+
with demo.route("About", "/about"):
|
| 110 |
+
render_logo()
|
| 111 |
+
about.demo.render()
|
| 112 |
with demo.route(" 🚀 Submit an Agent"):
|
| 113 |
render_logo()
|
| 114 |
submission.demo.render()
|
content.py
CHANGED
|
@@ -148,6 +148,10 @@ css = """
|
|
| 148 |
font-size: 18px;
|
| 149 |
max-width: 60%;
|
| 150 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
#category-intro {
|
| 152 |
font-size: 18px;
|
| 153 |
max-width: 60%;
|
|
|
|
| 148 |
font-size: 18px;
|
| 149 |
max-width: 60%;
|
| 150 |
}
|
| 151 |
+
#about-content {
|
| 152 |
+
font-size: 18px;
|
| 153 |
+
max-width: 60%;
|
| 154 |
+
}
|
| 155 |
#category-intro {
|
| 156 |
font-size: 18px;
|
| 157 |
max-width: 60%;
|