Skip to main content

Benchmarks

When evaluating what LLM to use, it's important to consider the model's "intelligence" - which you can get an idea of with the following benchmark results. With these, you can determine which size and quality your use case requires.

ModelParams (in billions)Function CallingMMLU (5-shot)GPQA (0-shot)GSM-8K (8-shot, CoT)MATH (4-shot, CoT)MT-bench
Claude-3.5 Sonnet98.57%88.759.4---
GPT-4o98.57%-53.6---
Rubra Llama-3 70B Instruct70.697.85%75.9033.9382.2634.248.36
Rubra Llama-3 8B Instruct8.989.28%64.3931.7068.9923.768.03
Rubra Qwen2-7B-Instruct8.5585.71%68.8830.3675.8228.728.08
Rubra Mistral 7B Instruct v0.38.1273.57%59.1229.9143.2911.147.69
Rubra Mistral 7B Instruct v0.28.1169.28%58.9029.9134.128.367.36
Rubra Phi-3 Mini 128k Instruct4.2765.71%66.6629.2474.0926.847.45
Nexusflow/NexusRaven-V2-13B1353.75% ∔43.2328.7922.677.125.36
Rubra Gemma-1.1 2B Instruct2.8445.00%38.8524.556.142.385.75
gorilla-llm/gorilla-openfunctions-v26.9141.25% ∔49.1423.6648.2917.545.13
NousResearch/Hermes-2-Pro-Llama-3-8B8.0341.25%64.1631.9273.9221.587.83
Mistral 7B Instruct v0.37.2522.5%62.1030.5853.0712.987.50
Qwen2-7B-Instruct7.62-70.7832.1478.5430.108.29
Phi-3 Mini 128k Instruct3.82-68.1730.5880.4428.127.92
Mistral 7B Instruct v0.27.24-59.2727.6843.2110.307.50
Llama-3 70B Instruct70.6-79.9038.1790.6744.248.88
Llama-3 8B Instruct8.03-65.6931.4777.4127.588.07
Gemma-1.1 2B Instruct2.51-37.8422.996.296.145.82
info

MT-bench for all models was run in June 2024 using GPT-4.

MMLU, GPQA, GSM-8K & MATH were all calculated using the Language Model Evaluation Harness.

Our proprietary function calling benchmark will be open sourced in the coming months - half of it is composed of quickstart examples found in gptscript.

note

Some of the LLMs above require using custom libraries to post-process LLM generated tool calls. We followed those models' recommendations and guidelines in our evaluation.

mistralai/Mistral-7B-Instruct-v0.3 required mistral-inference library to extract function calls.

NousResearch/Hermes-2-Pro-Llama-3-8B required hermes-function-calling.

gorilla-llm/gorilla-openfunctions-v2 required special prompting detailed in their Github repo.

Nexusflow/NexusRaven-V2-13B required nexusraven-pip.

Nexusflow/NexusRaven-V2-13B and gorilla-llm/gorilla-openfunctions-v2 don't accept tool observations, the result of running a tool or function once the LLM calls it, so we appended the observation to the prompt.