agent-leaderboard / results_v2.csv
Pratik Bhavsar
added new mistral medium
9941c78
Model,Vendor,Avg AC,Avg TSQ,Avg Total Cost,Avg Session Duration,Avg Turns,Banking AC,Healthcare AC,Insurance AC,Investment AC,Telecom AC,Banking TSQ,Healthcare TSQ,Insurance TSQ,Investment TSQ,Telecom TSQ,Avg Input Cost ($),Avg Output Cost ($),Banking Cost,Healthcare Cost,Insurance Cost,Investment Cost,Telecom Cost,Banking Duration,Healthcare Duration,Insurance Duration,Investment Duration,Telecom Duration,Banking Turns,Healthcare Turns,Insurance Turns,Investment Turns,Telecom Turns,Model Type,$/M input token,$/M output token,Output Type
gpt-4.1-2025-04-14,OpenAI,0.62,0.8,0.0684,24.32,3.1,0.6,0.62,0.66,0.64,0.58,0.81,0.83,0.68,0.88,0.82,0.0577,0.0107,0.052,0.0711,0.0629,0.0777,0.0783,18.52,24.4,25.24,27.88,25.58,2.61,3.15,2.92,3.3,3.48,Proprietary,2.0,8.0,Normal
mistral-medium-2508,Mistral,0.61,0.77,0.0199,37.45,2.98,0.57,0.6,0.7,0.57,0.59,0.74,0.75,0.73,0.87,0.76,0.0164,0.0035,0.0185,0.0196,0.0195,0.0223,0.0195,31.91,34.94,36.96,43.91,39.53,3.06,2.85,2.92,3.12,2.97,Proprietary,0.4,2.0,Normal
gpt-4.1-mini-2025-04-14,OpenAI,0.56,0.79,0.0141,26.0,3.43,0.56,0.6,0.46,0.5,0.64,0.8,0.85,0.63,0.84,0.83,0.0123,0.0018,0.0115,0.0143,0.0131,0.0164,0.0156,21.28,26.82,23.32,30.5,28.07,2.99,3.32,3.28,3.76,3.79,Proprietary,0.4,1.6,Normal
claude-sonnet-4-20250514,Anthropic,0.55,0.92,0.1537,66.6,2.89,0.58,0.62,0.53,0.49,0.53,0.9,0.95,0.93,0.92,0.9,0.1212,0.0325,0.1359,0.1542,0.1442,0.1669,0.1675,55.36,57.93,56.87,86.44,76.38,2.54,2.84,2.79,3.06,3.22,Proprietary,3.0,15.0,Normal
kimi-k2-instruct,Moonshot AI,0.53,0.9,0.0386,163.62,2.84,0.58,0.49,0.58,0.47,0.53,0.89,0.91,0.88,0.93,0.91,0.0346,0.004,0.0344,0.0401,0.0367,0.0419,0.0397,165.45,155.42,164.9,161.14,171.17,2.58,2.81,2.79,3.1,2.93,Open source,1.0,3.0,Normal
qwen3-235b-a22b-instruct-2507,Alibaba,0.53,0.85,0.0074,238.02,2.4,0.44,0.49,0.74,0.41,0.58,0.88,0.84,0.85,0.91,0.79,0.0067,0.0007,0.0059,0.0079,0.0077,0.0079,0.008,206.53,267.85,233.49,252.94,229.29,1.99,2.41,2.53,2.47,2.62,Open source,0.2,0.6,Reasoning
qwen2.5-72b-instruct,Alibaba,0.51,0.8,0.0361,34.68,2.65,0.48,0.61,0.52,0.42,0.52,0.78,0.84,0.77,0.82,0.79,0.0338,0.0023,0.0292,0.0348,0.0338,0.0417,0.0415,27.34,29.09,30.46,41.32,45.2,2.3,2.46,2.47,3.0,3.0,Open source,0.9,0.9,Normal
gemini-2.5-flash-lite,Google,0.47,0.84,0.0039,9.8,3.11,0.45,0.6,0.54,0.35,0.41,0.82,0.9,0.78,0.86,0.84,0.0034,0.0005,0.003,0.0036,0.0039,0.0049,0.0043,7.96,8.82,9.52,11.31,11.39,2.61,3.0,2.87,3.69,3.36,Proprietary,0.1,0.4,Reasoning
glm-4.5-air,Zai,0.44,0.94,0.0194,69.17,4.96,0.49,0.4,0.53,0.46,0.33,0.94,0.91,0.94,0.96,0.94,0.014,0.0054,0.0152,0.021,0.0191,0.0216,0.0199,57.56,69.95,86.47,76.51,55.35,3.99,5.2,5.02,5.41,5.2,Open source,0.2,1.1,Reasoning
gemini-2.5-pro,Google,0.43,0.86,0.1447,125.85,3.57,0.45,0.4,0.54,0.31,0.44,0.88,0.87,0.87,0.85,0.83,0.0442,0.1005,0.1253,0.1475,0.1386,0.1464,0.1656,108.83,126.91,121.99,129.62,141.92,3.1,3.57,3.49,3.7,3.97,Proprietary,1.25,10.0,Reasoning
grok-4-0709,xAI,0.42,0.88,0.2387,225.94,3.19,0.29,0.4,0.48,0.5,0.42,0.92,0.84,0.91,0.9,0.82,0.1008,0.1379,0.2295,0.2679,0.2073,0.2257,0.2632,226.62,326.28,157.1,189.18,230.5,2.98,3.42,3.07,3.01,3.46,Proprietary,3.0,15.0,Reasoning
deepseek-v3,Deepseek,0.4,0.8,0.0141,59.97,3.71,0.38,0.32,0.48,0.36,0.47,0.8,0.74,0.76,0.87,0.81,0.0119,0.0022,0.0123,0.0158,0.0139,0.0151,0.0138,44.46,68.38,48.54,70.21,68.27,3.27,4.2,3.48,3.79,3.83,Open source,0.27,1.1,Normal
gemini-2.5-flash,Google,0.38,0.94,0.0271,39.84,3.9,0.48,0.38,0.44,0.22,0.36,0.94,0.94,0.94,0.94,0.95,0.0123,0.0148,0.0248,0.0283,0.0273,0.0308,0.0241,33.03,36.81,38.24,42.78,48.34,3.53,3.96,3.78,4.28,3.98,Proprietary,0.3,2.5,Reasoning
gpt-4.1-nano-2025-04-14,OpenAI,0.38,0.63,0.0038,12.36,3.56,0.4,0.4,0.41,0.29,0.38,0.64,0.54,0.54,0.77,0.65,0.0034,0.0004,0.0029,0.0038,0.004,0.0042,0.0041,14.16,10.9,12.23,12.68,11.83,2.88,3.24,3.78,4.01,3.91,Proprietary,0.1,0.4,Normal
qwen3-235b-a22b-thinking-2507,Alibaba,0.34,0.85,0.0584,302.24,3.12,0.42,0.3,0.42,0.23,0.34,0.84,0.82,0.86,0.91,0.84,0.0275,0.0309,0.0535,0.0679,0.0573,0.0562,0.0575,309.41,310.33,316.64,266.96,307.84,2.86,3.43,3.2,3.03,3.1,Open source,0.65,3.0,Reasoning
magistral-medium-2506,Mistral,0.32,0.59,0.1182,32.96,4.4,0.3,0.35,0.38,0.26,0.3,0.59,0.67,0.56,0.63,0.51,0.108,0.0102,0.1067,0.0994,0.1077,0.1476,0.1294,24.98,35.81,33.33,39.18,31.49,4.21,3.46,3.92,5.36,5.07,Proprietary,2.0,5.0,Reasoning
nova-pro-v1,Amazon,0.29,0.65,0.0359,27.96,3.04,0.33,0.29,0.39,0.17,0.29,0.6,0.57,0.64,0.83,0.6,0.0316,0.0043,0.0304,0.0353,0.0359,0.04,0.038,23.45,27.94,27.9,32.09,28.43,2.72,2.88,2.99,3.36,3.26,Proprietary,0.8,3.2,Normal
mistral-small-2506,Mistral,0.26,0.71,0.0053,35.69,4.37,0.37,0.28,0.22,0.2,0.21,0.73,0.71,0.65,0.76,0.69,0.0049,0.0004,0.0041,0.0057,0.0054,0.0058,0.0056,30.64,36.02,30.83,41.96,39.02,3.3,4.47,4.52,4.87,4.67,Open source,0.1,0.3,Normal
llama-3.3-70b-instruct,Meta,0.2,0.62,0.0599,19.92,3.83,0.11,0.29,0.29,0.14,0.16,0.62,0.64,0.62,0.64,0.57,0.0588,0.0011,0.055,0.0544,0.0545,0.0664,0.069,17.62,19.4,18.55,23.91,20.14,3.61,3.34,3.42,4.29,4.5,Open source,0.88,0.88,Normal
caller,Arcee,0.16,0.65,0.0297,25.66,4.2,0.23,0.14,0.22,0.09,0.12,0.69,0.6,0.68,0.61,0.67,0.0282,0.0015,0.0262,0.0303,0.0305,0.0331,0.0286,22.83,25.54,26.42,29.66,23.85,3.76,4.19,4.18,4.75,4.14,Open source,0.55,0.85,Normal
nova-lite-v1,Amazon,0.16,0.55,0.0031,20.26,3.73,0.12,0.18,0.19,0.15,0.18,0.48,0.49,0.58,0.72,0.49,0.0027,0.0004,0.0026,0.0033,0.0031,0.0034,0.0029,17.53,20.61,19.67,24.28,19.2,3.31,4.13,3.57,4.04,3.62,Proprietary,0.06,0.24,Normal
magistral-small-2506,Mistral,0.16,0.53,0.0301,17.42,5.68,0.23,0.18,0.13,0.16,0.12,0.57,0.46,0.42,0.62,0.6,0.0275,0.0026,0.0245,0.0335,0.0302,0.034,0.0281,14.53,21.36,14.65,19.67,16.87,4.74,6.28,6.14,6.06,5.19,Open source,0.5,1.5,Reasoning