Spaces:

galileo-ai
/

agent-leaderboard

Running on CPU Upgrade

App Files Files Community

Pratik Bhavsar commited on Feb 6

Commit

ae900da

1 Parent(s): e2809a3

added key insights

Browse files

Files changed (1) hide show

data_loader.py +97 -10

data_loader.py CHANGED Viewed

@@ -721,15 +721,102 @@ METHODOLOGY = """
         cases that challenge real-world applicability.
     </p>
-    <h2 class="methodology-subtitle">Tool Selection Quality (TSQ) Metric</h2>
-    <ul class="metric-list">
-        <li>Correctly identify when tools are needed</li>
-        <li>Select the appropriate tool for the task</li>
-        <li>Handle cases where no suitable tool exists</li>
-        <li>Maintain context across multiple interactions</li>
-        <li>Consider cost-effectiveness of tool usage</li>
-        <li>Optimize for minimal necessary tool calls</li>
-    </ul>
     <h2 class="methodology-subtitle">Dataset Structure</h2>
     <div class="table-container">
@@ -847,5 +934,5 @@ METHODOLOGY = """
             <li>Monthly model additions</li>
         </ul>
     </div>
-</div>
 """

         cases that challenge real-world applicability.
     </p>
+<style>
+.key-insights thead tr {
+    background: linear-gradient(90deg, #60A5FA, #818CF8);
+}
+.key-insights td:first-child {
+    color: var(--accent-blue);
+    background: var(--bg-primary);
+}
+.key-insights td:last-child {
+    background: var(--bg-primary);
+}
+.key-insights td {
+    padding: 1rem;
+    border-bottom: 1px solid rgba(31, 41, 55, 0.5);
+}
+</style>
+<div class="methodology-section">
+    <h1 class="methodology-subtitle">Key Insights</h1>
+    <div class="table-container">
+        <table class="dataset-table key-insights">
+            <thead>
+                <tr>
+                    <th>Category</th>
+                    <th>Finding</th>
+                </tr>
+            </thead>
+            <tbody>
+                <tr>
+                    <td>Performance Champion</td>
+                    <td>Gemini-2.0-flash dominates with 0.935 score at just $0.075 per million tokens, excelling in both complex tasks (0.95) and safety features (0.98)</td>
+                </tr>
+                <tr>
+                    <td>Price-Performance Paradox</td>
+                    <td>Top 3 models span 20x price difference yet only 3% performance gap, challenging pricing assumptions</td>
+                </tr>
+                <tr>
+                    <td>Open Vs Closed Source</td>
+                    <td>The new Mistral-small leads in open source models and performs similar to GPT-4o-mini at 0.83, signaling OSS maturity in tool calling</td>
+                </tr>
+                <tr>
+                    <td>Reasoning Models</td>
+                    <td>Although being great for reasoning, o1 and o3-mini are far from perfect scoring 0.87 and 0.84 respectively. DeepSeek V3 and R1 were excluded from rankings due to limited function support</td>
+                </tr>
+                <tr>
+                    <td>Tool Miss Detection</td>
+                    <td>Dataset averages of 0.59 and 0.78 reveal fundamental challenges in handling edge cases and maintaining context, even as models excel at basic tasks</td>
+                </tr>
+                <tr>
+                    <td>Architecture Trade-offs</td>
+                    <td>Long context vs parallel execution shows architectural limits: O1 leads context (0.98) but fails parallel tasks (0.43), while GPT-4o shows opposite pattern</td>
+                </tr>
+            </tbody>
+        </table>
+    </div>
+    <h2 class="methodology-subtitle">Development Implications</h2>
+    <div class="table-container">
+        <table class="dataset-table key-insights">
+            <thead>
+                <tr>
+                    <th>Area</th>
+                    <th>Recommendation</th>
+                </tr>
+            </thead>
+            <tbody>
+                <tr>
+                    <td>Task Complexity</td>
+                    <td>Simple tasks work with most models. Complex workflows requiring multiple tools need models with 0.85+ scores in composite tests</td>
+                </tr>
+                <tr>
+                    <td>Error Handling</td>
+                    <td>Models with low tool selection scores need guardrails. Add validation layers and structured error recovery, especially for parameter collection</td>
+                </tr>
+                <tr>
+                    <td>Context Management</td>
+                    <td>Long conversations require either models strong in context retention or external context storage systems</td>
+                </tr>
+                <tr>
+                    <td>Reasoning Models</td>
+                    <td>While o1 and o3-mini excelled in function calling, DeepSeek V3 and R1 were excluded from rankings due to limited function support</td>
+                </tr>
+                <tr>
+                    <td>Safety Controls</td>
+                    <td>Add strict tool access controls for models weak in irrelevance detection. Include validation layers for inconsistent performers</td>
+                </tr>
+                <tr>
+                    <td>Open Vs Closed Source</td>
+                    <td>Private models lead in complex tasks, but open-source options work well for basic operations. Choose based on your scaling needs</td>
+                </tr>
+            </tbody>
+        </table>
+    </div>
     <h2 class="methodology-subtitle">Dataset Structure</h2>
     <div class="table-container">
             <li>Monthly model additions</li>
         </ul>
     </div>
 """