Amber Tanaka commited on
Commit
dd5281a
·
unverified ·
1 Parent(s): b48ebfb

Add About page (#15)

Browse files
Files changed (3) hide show
  1. about.py +65 -0
  2. app.py +4 -1
  3. content.py +4 -0
about.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+
3
+
4
+ with gr.Blocks() as demo:
5
+ gr.Markdown(
6
+ """
7
+ ## About AstaBench
8
+ AstaBench is a best-in-class AI agents evaluation framework to measure scientific research abilities. AstaBench provides a challenging new test for AI agents: the first benchmark challenge that evaluates agents’ scientific abilities on a broad spectrum of research skills, including literature understanding, data analysis, planning, tool use, coding, and search.
9
+
10
+
11
+ **Why AstaBench?**
12
+ Newer benchmarks may test agentic AI and isolated aspects of scientific reasoning, but none rigorously measure agentic AI or capture the full range of skills research demands. Agents can appear effective by simply retrying tasks—often at high computational cost and with inconsistent results. Scientific AI needs evaluations that reflect the real complexity of research.
13
+
14
+ AstaBench fills that gap: a suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. AstaBench helps scientists identify which agents best support their needs through task-relevant leaderboards, while giving AI developers a standard execution environment and standard tools to test the scientific reasoning capabilities of their agents compared to well-known baselines from the literature, including both open and closed LLM foundation models.
15
+
16
+
17
+ **What Does AstaBench Include?**
18
+ The suite includes over 8,000 tasks across 11 benchmarks, organized into four core categories:
19
+ - Literature Understanding
20
+ - Code & Execution
21
+ - Data Analysis
22
+ - End-to-End Discovery
23
+
24
+ 🔍 Learn more in the AstaBench technical blog post
25
+
26
+
27
+ **Understanding the Leaderboards**
28
+ The AstaBench Main Leaderboard provides a high-level view of overall agent performance and efficiency:
29
+ - Overall score: A macro-average of the four category-level averages (equal weighting)
30
+ - Overall cost: Average cost per task, aggregated only across benchmarks with reported cost
31
+
32
+ Each category leaderboard provides:
33
+ - Average score and cost for that category
34
+ - A breakdown by individual benchmarks
35
+
36
+
37
+ **Scoring & Aggregation**
38
+ AstaBench encourages broad, honest evaluation. Here's how we handle scoring, cost, and partial results:
39
+
40
+ _Scores_
41
+ - Each benchmark returns an average score based on per-task accuracy
42
+ - Skipped benchmarks receive a score of 0.00
43
+ - All scores are aggregated upward using macro-averaging
44
+ - Partial completions are included (even with poor performance)
45
+
46
+ _Cost_
47
+ - Costs are reported in USD per task, based on values at the time of submission
48
+ - Benchmarks without cost data are excluded from cost averages
49
+ - In scatter plots, agents without cost are plotted far right and clearly marked
50
+ Note: Cost values reflect pricing and infrastructure conditions at the time of each submission. We recognize that compute costs may change over time, and are actively working on methods to normalize cost data across submissions for fairer longitudinal comparisons.
51
+
52
+
53
+ _Coverage_
54
+ - Main leaderboard: category coverage (X/4)
55
+ - Category view: benchmark coverage (X/Y)
56
+ - Incomplete coverage is flagged visually
57
+
58
+ These design choices ensure fair comparison while penalizing cherry-picking and omissions.
59
+
60
+
61
+ **Learn More**
62
+ - AstaBench technical blog post
63
+ - FAQ and submission guide
64
+ """, elem_id="about-content"
65
+ )
app.py CHANGED
@@ -4,7 +4,7 @@ import os
4
 
5
  from apscheduler.schedulers.background import BackgroundScheduler
6
  from huggingface_hub import HfApi
7
- import literature_understanding, main_page, c_and_e, data_analysis, e2e, submission
8
 
9
  from content import css
10
 
@@ -106,6 +106,9 @@ with demo.route("Data Analysis", "/data-analysis"):
106
  with demo.route("Discovery", "/discovery"):
107
  render_logo()
108
  e2e.demo.render()
 
 
 
109
  with demo.route(" 🚀 Submit an Agent"):
110
  render_logo()
111
  submission.demo.render()
 
4
 
5
  from apscheduler.schedulers.background import BackgroundScheduler
6
  from huggingface_hub import HfApi
7
+ import literature_understanding, main_page, c_and_e, data_analysis, e2e, submission, about
8
 
9
  from content import css
10
 
 
106
  with demo.route("Discovery", "/discovery"):
107
  render_logo()
108
  e2e.demo.render()
109
+ with demo.route("About", "/about"):
110
+ render_logo()
111
+ about.demo.render()
112
  with demo.route(" 🚀 Submit an Agent"):
113
  render_logo()
114
  submission.demo.render()
content.py CHANGED
@@ -148,6 +148,10 @@ css = """
148
  font-size: 18px;
149
  max-width: 60%;
150
  }
 
 
 
 
151
  #category-intro {
152
  font-size: 18px;
153
  max-width: 60%;
 
148
  font-size: 18px;
149
  max-width: 60%;
150
  }
151
+ #about-content {
152
+ font-size: 18px;
153
+ max-width: 60%;
154
+ }
155
  #category-intro {
156
  font-size: 18px;
157
  max-width: 60%;