Pratik Bhavsar commited on
Commit
ae900da
·
1 Parent(s): e2809a3

added key insights

Browse files
Files changed (1) hide show
  1. data_loader.py +97 -10
data_loader.py CHANGED
@@ -721,15 +721,102 @@ METHODOLOGY = """
721
  cases that challenge real-world applicability.
722
  </p>
723
 
724
- <h2 class="methodology-subtitle">Tool Selection Quality (TSQ) Metric</h2>
725
- <ul class="metric-list">
726
- <li>Correctly identify when tools are needed</li>
727
- <li>Select the appropriate tool for the task</li>
728
- <li>Handle cases where no suitable tool exists</li>
729
- <li>Maintain context across multiple interactions</li>
730
- <li>Consider cost-effectiveness of tool usage</li>
731
- <li>Optimize for minimal necessary tool calls</li>
732
- </ul>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
733
 
734
  <h2 class="methodology-subtitle">Dataset Structure</h2>
735
  <div class="table-container">
@@ -847,5 +934,5 @@ METHODOLOGY = """
847
  <li>Monthly model additions</li>
848
  </ul>
849
  </div>
850
- </div>
851
  """
 
721
  cases that challenge real-world applicability.
722
  </p>
723
 
724
+ <style>
725
+ .key-insights thead tr {
726
+ background: linear-gradient(90deg, #60A5FA, #818CF8);
727
+ }
728
+
729
+ .key-insights td:first-child {
730
+ color: var(--accent-blue);
731
+ background: var(--bg-primary);
732
+ }
733
+
734
+ .key-insights td:last-child {
735
+ background: var(--bg-primary);
736
+ }
737
+
738
+ .key-insights td {
739
+ padding: 1rem;
740
+ border-bottom: 1px solid rgba(31, 41, 55, 0.5);
741
+ }
742
+ </style>
743
+
744
+ <div class="methodology-section">
745
+ <h1 class="methodology-subtitle">Key Insights</h1>
746
+ <div class="table-container">
747
+ <table class="dataset-table key-insights">
748
+ <thead>
749
+ <tr>
750
+ <th>Category</th>
751
+ <th>Finding</th>
752
+ </tr>
753
+ </thead>
754
+ <tbody>
755
+ <tr>
756
+ <td>Performance Champion</td>
757
+ <td>Gemini-2.0-flash dominates with 0.935 score at just $0.075 per million tokens, excelling in both complex tasks (0.95) and safety features (0.98)</td>
758
+ </tr>
759
+ <tr>
760
+ <td>Price-Performance Paradox</td>
761
+ <td>Top 3 models span 20x price difference yet only 3% performance gap, challenging pricing assumptions</td>
762
+ </tr>
763
+ <tr>
764
+ <td>Open Vs Closed Source</td>
765
+ <td>The new Mistral-small leads in open source models and performs similar to GPT-4o-mini at 0.83, signaling OSS maturity in tool calling</td>
766
+ </tr>
767
+ <tr>
768
+ <td>Reasoning Models</td>
769
+ <td>Although being great for reasoning, o1 and o3-mini are far from perfect scoring 0.87 and 0.84 respectively. DeepSeek V3 and R1 were excluded from rankings due to limited function support</td>
770
+ </tr>
771
+ <tr>
772
+ <td>Tool Miss Detection</td>
773
+ <td>Dataset averages of 0.59 and 0.78 reveal fundamental challenges in handling edge cases and maintaining context, even as models excel at basic tasks</td>
774
+ </tr>
775
+ <tr>
776
+ <td>Architecture Trade-offs</td>
777
+ <td>Long context vs parallel execution shows architectural limits: O1 leads context (0.98) but fails parallel tasks (0.43), while GPT-4o shows opposite pattern</td>
778
+ </tr>
779
+ </tbody>
780
+ </table>
781
+ </div>
782
+
783
+ <h2 class="methodology-subtitle">Development Implications</h2>
784
+ <div class="table-container">
785
+ <table class="dataset-table key-insights">
786
+ <thead>
787
+ <tr>
788
+ <th>Area</th>
789
+ <th>Recommendation</th>
790
+ </tr>
791
+ </thead>
792
+ <tbody>
793
+ <tr>
794
+ <td>Task Complexity</td>
795
+ <td>Simple tasks work with most models. Complex workflows requiring multiple tools need models with 0.85+ scores in composite tests</td>
796
+ </tr>
797
+ <tr>
798
+ <td>Error Handling</td>
799
+ <td>Models with low tool selection scores need guardrails. Add validation layers and structured error recovery, especially for parameter collection</td>
800
+ </tr>
801
+ <tr>
802
+ <td>Context Management</td>
803
+ <td>Long conversations require either models strong in context retention or external context storage systems</td>
804
+ </tr>
805
+ <tr>
806
+ <td>Reasoning Models</td>
807
+ <td>While o1 and o3-mini excelled in function calling, DeepSeek V3 and R1 were excluded from rankings due to limited function support</td>
808
+ </tr>
809
+ <tr>
810
+ <td>Safety Controls</td>
811
+ <td>Add strict tool access controls for models weak in irrelevance detection. Include validation layers for inconsistent performers</td>
812
+ </tr>
813
+ <tr>
814
+ <td>Open Vs Closed Source</td>
815
+ <td>Private models lead in complex tasks, but open-source options work well for basic operations. Choose based on your scaling needs</td>
816
+ </tr>
817
+ </tbody>
818
+ </table>
819
+ </div>
820
 
821
  <h2 class="methodology-subtitle">Dataset Structure</h2>
822
  <div class="table-container">
 
934
  <li>Monthly model additions</li>
935
  </ul>
936
  </div>
937
+
938
  """