Benchmarks

When evaluating what LLM to use, it's important to consider the model's "intelligence" - which you can get an idea of with the following benchmark results. With these, you can determine which size and quality your use case requires.

Model	Params (in billions)	Function Calling ↓	MMLU (5-shot)	GPQA (0-shot)	GSM-8K (8-shot, CoT)	MATH (4-shot, CoT)	MT-bench
Claude-3.5 Sonnet		98.57%	88.7	59.4	-	-	-
GPT-4o		98.57%	-	53.6	-	-	-
Rubra Llama-3 70B Instruct	70.6	97.85%	75.90	33.93	82.26	34.24	8.36
Rubra Llama-3 8B Instruct	8.9	89.28%	64.39	31.70	68.99	23.76	8.03
Rubra Qwen2-7B-Instruct	8.55	85.71%	68.88	30.36	75.82	28.72	8.08
Rubra Mistral 7B Instruct v0.3	8.12	73.57%	59.12	29.91	43.29	11.14	7.69
Rubra Mistral 7B Instruct v0.2	8.11	69.28%	58.90	29.91	34.12	8.36	7.36
Rubra Phi-3 Mini 128k Instruct	4.27	65.71%	66.66	29.24	74.09	26.84	7.45
Nexusflow/NexusRaven-V2-13B	13	53.75% ∔	43.23	28.79	22.67	7.12	5.36
Rubra Gemma-1.1 2B Instruct	2.84	45.00%	38.85	24.55	6.14	2.38	5.75
gorilla-llm/gorilla-openfunctions-v2	6.91	41.25% ∔	49.14	23.66	48.29	17.54	5.13
NousResearch/Hermes-2-Pro-Llama-3-8B	8.03	41.25%	64.16	31.92	73.92	21.58	7.83
Mistral 7B Instruct v0.3	7.25	22.5%	62.10	30.58	53.07	12.98	7.50
Qwen2-7B-Instruct	7.62	-	70.78	32.14	78.54	30.10	8.29
Phi-3 Mini 128k Instruct	3.82	-	68.17	30.58	80.44	28.12	7.92
Mistral 7B Instruct v0.2	7.24	-	59.27	27.68	43.21	10.30	7.50
Llama-3 70B Instruct	70.6	-	79.90	38.17	90.67	44.24	8.88
Llama-3 8B Instruct	8.03	-	65.69	31.47	77.41	27.58	8.07
Gemma-1.1 2B Instruct	2.51	-	37.84	22.99	6.29	6.14	5.82

info

MT-bench for all models was run in June 2024 using GPT-4.

MMLU, GPQA, GSM-8K & MATH were all calculated using the Language Model Evaluation Harness.

Our proprietary function calling benchmark will be open sourced in the coming months - half of it is composed of quickstart examples found in gptscript.

note

Some of the LLMs above require using custom libraries to post-process LLM generated tool calls. We followed those models' recommendations and guidelines in our evaluation.

mistralai/Mistral-7B-Instruct-v0.3 required mistral-inference library to extract function calls.

NousResearch/Hermes-2-Pro-Llama-3-8B required hermes-function-calling.

gorilla-llm/gorilla-openfunctions-v2 required special prompting detailed in their Github repo.

Nexusflow/NexusRaven-V2-13B required nexusraven-pip.

∔ Nexusflow/NexusRaven-V2-13B and gorilla-llm/gorilla-openfunctions-v2 don't accept tool observations, the result of running a tool or function once the LLM calls it, so we appended the observation to the prompt.