Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Pratik Bhavsar
commited on
Commit
·
ae900da
1
Parent(s):
e2809a3
added key insights
Browse files- data_loader.py +97 -10
data_loader.py
CHANGED
@@ -721,15 +721,102 @@ METHODOLOGY = """
|
|
721 |
cases that challenge real-world applicability.
|
722 |
</p>
|
723 |
|
724 |
-
|
725 |
-
|
726 |
-
|
727 |
-
|
728 |
-
|
729 |
-
|
730 |
-
|
731 |
-
|
732 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
733 |
|
734 |
<h2 class="methodology-subtitle">Dataset Structure</h2>
|
735 |
<div class="table-container">
|
@@ -847,5 +934,5 @@ METHODOLOGY = """
|
|
847 |
<li>Monthly model additions</li>
|
848 |
</ul>
|
849 |
</div>
|
850 |
-
|
851 |
"""
|
|
|
721 |
cases that challenge real-world applicability.
|
722 |
</p>
|
723 |
|
724 |
+
<style>
|
725 |
+
.key-insights thead tr {
|
726 |
+
background: linear-gradient(90deg, #60A5FA, #818CF8);
|
727 |
+
}
|
728 |
+
|
729 |
+
.key-insights td:first-child {
|
730 |
+
color: var(--accent-blue);
|
731 |
+
background: var(--bg-primary);
|
732 |
+
}
|
733 |
+
|
734 |
+
.key-insights td:last-child {
|
735 |
+
background: var(--bg-primary);
|
736 |
+
}
|
737 |
+
|
738 |
+
.key-insights td {
|
739 |
+
padding: 1rem;
|
740 |
+
border-bottom: 1px solid rgba(31, 41, 55, 0.5);
|
741 |
+
}
|
742 |
+
</style>
|
743 |
+
|
744 |
+
<div class="methodology-section">
|
745 |
+
<h1 class="methodology-subtitle">Key Insights</h1>
|
746 |
+
<div class="table-container">
|
747 |
+
<table class="dataset-table key-insights">
|
748 |
+
<thead>
|
749 |
+
<tr>
|
750 |
+
<th>Category</th>
|
751 |
+
<th>Finding</th>
|
752 |
+
</tr>
|
753 |
+
</thead>
|
754 |
+
<tbody>
|
755 |
+
<tr>
|
756 |
+
<td>Performance Champion</td>
|
757 |
+
<td>Gemini-2.0-flash dominates with 0.935 score at just $0.075 per million tokens, excelling in both complex tasks (0.95) and safety features (0.98)</td>
|
758 |
+
</tr>
|
759 |
+
<tr>
|
760 |
+
<td>Price-Performance Paradox</td>
|
761 |
+
<td>Top 3 models span 20x price difference yet only 3% performance gap, challenging pricing assumptions</td>
|
762 |
+
</tr>
|
763 |
+
<tr>
|
764 |
+
<td>Open Vs Closed Source</td>
|
765 |
+
<td>The new Mistral-small leads in open source models and performs similar to GPT-4o-mini at 0.83, signaling OSS maturity in tool calling</td>
|
766 |
+
</tr>
|
767 |
+
<tr>
|
768 |
+
<td>Reasoning Models</td>
|
769 |
+
<td>Although being great for reasoning, o1 and o3-mini are far from perfect scoring 0.87 and 0.84 respectively. DeepSeek V3 and R1 were excluded from rankings due to limited function support</td>
|
770 |
+
</tr>
|
771 |
+
<tr>
|
772 |
+
<td>Tool Miss Detection</td>
|
773 |
+
<td>Dataset averages of 0.59 and 0.78 reveal fundamental challenges in handling edge cases and maintaining context, even as models excel at basic tasks</td>
|
774 |
+
</tr>
|
775 |
+
<tr>
|
776 |
+
<td>Architecture Trade-offs</td>
|
777 |
+
<td>Long context vs parallel execution shows architectural limits: O1 leads context (0.98) but fails parallel tasks (0.43), while GPT-4o shows opposite pattern</td>
|
778 |
+
</tr>
|
779 |
+
</tbody>
|
780 |
+
</table>
|
781 |
+
</div>
|
782 |
+
|
783 |
+
<h2 class="methodology-subtitle">Development Implications</h2>
|
784 |
+
<div class="table-container">
|
785 |
+
<table class="dataset-table key-insights">
|
786 |
+
<thead>
|
787 |
+
<tr>
|
788 |
+
<th>Area</th>
|
789 |
+
<th>Recommendation</th>
|
790 |
+
</tr>
|
791 |
+
</thead>
|
792 |
+
<tbody>
|
793 |
+
<tr>
|
794 |
+
<td>Task Complexity</td>
|
795 |
+
<td>Simple tasks work with most models. Complex workflows requiring multiple tools need models with 0.85+ scores in composite tests</td>
|
796 |
+
</tr>
|
797 |
+
<tr>
|
798 |
+
<td>Error Handling</td>
|
799 |
+
<td>Models with low tool selection scores need guardrails. Add validation layers and structured error recovery, especially for parameter collection</td>
|
800 |
+
</tr>
|
801 |
+
<tr>
|
802 |
+
<td>Context Management</td>
|
803 |
+
<td>Long conversations require either models strong in context retention or external context storage systems</td>
|
804 |
+
</tr>
|
805 |
+
<tr>
|
806 |
+
<td>Reasoning Models</td>
|
807 |
+
<td>While o1 and o3-mini excelled in function calling, DeepSeek V3 and R1 were excluded from rankings due to limited function support</td>
|
808 |
+
</tr>
|
809 |
+
<tr>
|
810 |
+
<td>Safety Controls</td>
|
811 |
+
<td>Add strict tool access controls for models weak in irrelevance detection. Include validation layers for inconsistent performers</td>
|
812 |
+
</tr>
|
813 |
+
<tr>
|
814 |
+
<td>Open Vs Closed Source</td>
|
815 |
+
<td>Private models lead in complex tasks, but open-source options work well for basic operations. Choose based on your scaling needs</td>
|
816 |
+
</tr>
|
817 |
+
</tbody>
|
818 |
+
</table>
|
819 |
+
</div>
|
820 |
|
821 |
<h2 class="methodology-subtitle">Dataset Structure</h2>
|
822 |
<div class="table-container">
|
|
|
934 |
<li>Monthly model additions</li>
|
935 |
</ul>
|
936 |
</div>
|
937 |
+
|
938 |
"""
|