Amber Tanaka commited on
Commit
abb9f0a
·
unverified ·
1 Parent(s): 1c65d8c

update About page layout (#49)

Browse files
Files changed (2) hide show
  1. about.py +77 -67
  2. content.py +14 -0
about.py CHANGED
@@ -2,72 +2,82 @@ import gradio as gr
2
 
3
 
4
  def build_page():
5
- gr.Markdown(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  """
7
- ## About AstaBench
8
- AstaBench is a novel AI agents evaluation framework, providing a challenging new test for AI agents: the first benchmark challenge that evaluates agents’ scientific abilities on a broad spectrum of research skills, including literature understanding, data analysis, planning, tool use, coding, and search. Asta’s set of standard tools makes it easy to build general-purpose science agents and to compare their performance in an apples-to-apples manner.
9
 
10
-
11
- ## Why AstaBench?
12
- Most current benchmarks test agentic AI and isolated aspects of scientific reasoning, but rarely evaluate AI agentic behavior rigorously or capture the full skill set scientific research requires. Agents can appear effective despite inconsistent results and high compute use, often outperforming others by consuming more resources. Advancing scientific AI requires evaluations that emphasize reproducibility, efficiency, and the real complexity of research.
13
-
14
- AstaBench fills this gap: an agents evaluation framework and suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. AstaBench helps scientists identify which agents best support their needs through task-relevant leaderboards, while giving AI developers a standard execution environment and tools to test the scientific reasoning capabilities of their agents compared to well-known baselines from the literature, including both open and closed LLM foundation models.
15
-
16
-
17
- ## What Does AstaBench Include?
18
- AstaBench includes a rigorous agents evaluation framework and a suite of benchmarks consisting of over 2,400 problems across 11 benchmarks, organized into four core categories:
19
- Literature Understanding
20
- Code & Execution
21
- Data Analysis
22
- End-to-End Discovery
23
- Plus: a large suite of integrated agents and leaderboards with results from extensive evaluation of agents and models.
24
-
25
- 🔍 Learn more in the AstaBench technical blog post
26
-
27
-
28
- ## Understanding the Leaderboards
29
- The AstaBench Overall Leaderboard provides a high-level view of overall agent performance and efficiency:
30
- - Overall score: A macro-average of the four category-level averages (equal weighting)
31
- - Overall cost: Average cost per task, aggregated only across benchmarks with reported cost
32
-
33
- Each category leaderboard provides:
34
- - Average score and cost for that category (macro-averaged across the benchmarks in the category)
35
- - A breakdown by individual benchmarks
36
-
37
-
38
- ## Scoring & Aggregation
39
- AstaBench encourages careful, transparent evaluation. Here's how we handle scoring, cost, and partial results:
40
-
41
- **Scores**
42
- - Each benchmark returns an average score based on per-problem scores
43
- - All scores are aggregated upward using macro-averaging
44
- - Partial completions are included (even with poor performance)
45
-
46
- **Cost**
47
- - Costs are reported in USD per task.
48
- - Benchmarks without cost data are excluded from cost averages
49
- - In scatter plots, agents without cost are plotted to the far right and clearly marked.
50
-
51
- Note: Cost values reflect pricing and infrastructure conditions at the time of each submission. We recognize that compute costs may change over time, and are actively working on methods to normalize cost data across submissions for fairer longitudinal comparisons.
52
-
53
- **Coverage**
54
- - Main leaderboard: category coverage (X/4)
55
- - Category view: benchmark coverage (X/Y)
56
- - Incomplete coverage is flagged visually
57
-
58
- These design choices ensure fair comparison while penalizing cherry-picking and omissions.
59
-
60
-
61
- ## Learn More
62
- - AstaBench technical blog post
63
- - FAQ and submission guide
64
- """, elem_id="about-content"
65
- )
66
-
67
- # Floating feedback button
68
- floating_feedback_button_html = """
69
- <div>
70
- <a id="feedback-button" href="https://docs.google.com/forms/d/e/1FAIpQLSfJdVkD62aPYh8XehN2FrSeHUWt488Ejc-QdtuZn5NZ3eNoxA/viewform">Have feedback?</a>
71
- </div>
72
- """
73
- gr.HTML(floating_feedback_button_html)
 
2
 
3
 
4
  def build_page():
5
+ with gr.Column(elem_id="about-page-content-wrapper"):
6
+ gr.Markdown(
7
+ """
8
+ ## About AstaBench
9
+ AstaBench is a novel AI agents evaluation framework, providing a challenging new test for AI agents: the first benchmark challenge that evaluates agents’ scientific abilities on a broad spectrum of research skills, including literature understanding, data analysis, planning, tool use, coding, and search. Asta’s set of standard tools makes it easy to build general-purpose science agents and to compare their performance in an apples-to-apples manner.
10
+ """)
11
+ gr.Markdown("---",elem_classes="divider-line")
12
+
13
+ gr.Markdown(""" ## Why AstaBench?
14
+ Most current benchmarks test agentic AI and isolated aspects of scientific reasoning, but rarely evaluate AI agentic behavior rigorously or capture the full skill set scientific research requires. Agents can appear effective despite inconsistent results and high compute use, often outperforming others by consuming more resources. Advancing scientific AI requires evaluations that emphasize reproducibility, efficiency, and the real complexity of research.
15
+
16
+ AstaBench fills this gap: an agents evaluation framework and suite of open benchmarks for evaluating scientific AI assistants on core scientific tasks that require novel reasoning. AstaBench helps scientists identify which agents best support their needs through task-relevant leaderboards, while giving AI developers a standard execution environment and tools to test the scientific reasoning capabilities of their agents compared to well-known baselines from the literature, including both open and closed LLM foundation models.
17
+ """)
18
+ gr.Markdown("---",elem_classes="divider-line")
19
+
20
+ gr.Markdown("""
21
+ ## What Does AstaBench Include?
22
+ AstaBench includes a rigorous agents evaluation framework and a suite of benchmarks consisting of over 2,400 problems across 11 benchmarks, organized into four core categories:
23
+ - Literature Understanding
24
+ - Code & Execution
25
+ - Data Analysis
26
+ - End-to-End Discovery
27
+ Plus: a large suite of integrated agents and leaderboards with results from extensive evaluation of agents and models.
28
+
29
+ 🔍 Learn more in the AstaBench technical blog post
30
+ """)
31
+ gr.Markdown("---",elem_classes="divider-line")
32
+
33
+ gr.Markdown("""
34
+ ## Understanding the Leaderboards
35
+ The AstaBench Overall Leaderboard provides a high-level view of overall agent performance and efficiency:
36
+ - Overall score: A macro-average of the four category-level averages (equal weighting)
37
+ - Overall cost: Average cost per task, aggregated only across benchmarks with reported cost
38
+
39
+ Each category leaderboard provides:
40
+ - Average score and cost for that category (macro-averaged across the benchmarks in the category)
41
+ - A breakdown by individual benchmarks
42
+ """)
43
+ gr.Markdown("---",elem_classes="divider-line")
44
+
45
+ gr.Markdown("""
46
+ ## Scoring & Aggregation
47
+ AstaBench encourages careful, transparent evaluation. Here's how we handle scoring, cost, and partial results:
48
+
49
+ **Scores**
50
+ - Each benchmark returns an average score based on per-problem scores
51
+ - All scores are aggregated upward using macro-averaging
52
+ - Partial completions are included (even with poor performance)
53
+
54
+ **Cost**
55
+ - Costs are reported in USD per task.
56
+ - Benchmarks without cost data are excluded from cost averages
57
+ - In scatter plots, agents without cost are plotted to the far right and clearly marked.
58
+
59
+ Note: Cost values reflect pricing and infrastructure conditions at the time of each submission. We recognize that compute costs may change over time, and are actively working on methods to normalize cost data across submissions for fairer longitudinal comparisons.
60
+
61
+ **Coverage**
62
+ - Main leaderboard: category coverage (X/4)
63
+ - Category view: benchmark coverage (X/Y)
64
+ - Incomplete coverage is flagged visually
65
+
66
+ These design choices ensure fair comparison while penalizing cherry-picking and omissions.
67
+ """)
68
+ gr.Markdown("---",elem_classes="divider-line")
69
+
70
+ gr.Markdown("""
71
+ ## Learn More
72
+ - AstaBench technical blog post
73
+ - FAQ and submission guide
74
  """
75
+ )
 
76
 
77
+ # Floating feedback button
78
+ floating_feedback_button_html = """
79
+ <div>
80
+ <a id="feedback-button" href="https://docs.google.com/forms/d/e/1FAIpQLSfJdVkD62aPYh8XehN2FrSeHUWt488Ejc-QdtuZn5NZ3eNoxA/viewform">Have feedback?</a>
81
+ </div>
82
+ """
83
+ gr.HTML(floating_feedback_button_html)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
content.py CHANGED
@@ -445,4 +445,18 @@ span.wrap[tabindex="0"][role="button"][data-editable="false"] {
445
  #main-header h2 {
446
  color: #f0529c;
447
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
448
  """
 
445
  #main-header h2 {
446
  color: #f0529c;
447
  }
448
+ #about-page-content-wrapper {
449
+ margin-left: auto;
450
+ margin-right: auto;
451
+ max-width: 800px;
452
+ padding: 0 24px;
453
+ display: flex;
454
+ flex-direction: column;
455
+ gap: 40px;
456
+ margin-top: 40px;
457
+ opacity: 85%;
458
+ }
459
+ .divider-line {
460
+ opacity: 40%;
461
+ }
462
  """