bellwether

The cost-and-failure-mode benchmark for LLM agents.

Headline metric: effective_TCoT is total spend per successful task completion, including the cost of failed retries. Lower is better. The methodology is the contribution; the leaderboard is its proof.

Validation is machine-checkable (no LLM-as-judge). Failures are classified into eight modes (refusal, confabulation, schema break, truncation, partial, off-task, timeout, error). One canonical prompt per task across all providers. Temperature 0; N=3 runs per task instance; cost-per-million-token pricing verified per provider.

See the full methodology for formulas, retry policy, validator contract, and reproducibility caveats. New to the metrics? The glossary defines each term.

3
tasks benched
27
distinct models
15
bench passes
$53.66
total bench spend

Cross-task ranking

Each cell is the model's rank on that task in its latest clean pass. Lower is better. Rank 1 highlighted in purple. Rows are grouped by model_class (standard, then reasoning, then search) and sorted within each class by average observed rank. Cross-class comparisons mislead because reasoning models burn 5x to 20x more output tokens per attempt and search models pay an unmetered per-search fee; see methodology s2.7.

Provider / Model Class structured_extraction function_call_routing synthetic_rag avg rank
openai/gpt-4o-mini standard #1 #3 #3 2.3
openrouter/meta-llama/llama-4-scout standard #4 #1 #2 2.3
openrouter/meta-llama/llama-4-maverick standard #2 #4 #4 3.3
google/gemini-2.5-flash-lite standard #8 #2 #1 3.7
xai/grok-3-mini standard #3 #5 #6 4.7
openrouter/deepseek/deepseek-chat standard #7 #8 #5 6.7
xai/grok-4-fast standard #6 #6 #8 6.7
openrouter/meta-llama/llama-3.3-70b-instruct standard #5 #7 #9 7.0
google/gemini-2.5-flash standard #18 #9 #7 11.3
anthropic/claude-haiku-4-5 standard #21* #22* #13 13.0
openai/gpt-4o standard #10 #16 #15 13.7
openrouter/mistralai/mistral-large standard #14 #14 #14 14.0
xai/grok-3 standard #15 #17 #11 14.3
google/gemini-2.5-pro standard #23* #21 #12 16.5
anthropic/claude-sonnet-4-6 standard #16 #18 #16 16.7
xai/grok-4 standard #20 #20 #21 20.3
anthropic/claude-opus-4-7 standard #22* #23* #23* n/a
openrouter/cohere/command-r-plus standard #24* #24* #24* n/a
openrouter/qwen/qwen-3-235b-a22b standard #25* #25* #25* n/a
openai/o3-mini reasoning #11 #12 #19 14.0
openai/o4-mini reasoning #12 #13 #20 15.0
openrouter/deepseek/deepseek-r1 reasoning #17 #11 #18 15.3
openai/o3 reasoning #13 #15 #22 16.7
perplexity/sonar-reasoning reasoning #27* #27* #27* n/a
perplexity/sonar-reasoning-pro reasoning #26* #26* #26* n/a
perplexity/sonar search #9 #10 #10 9.7
perplexity/sonar-pro search #19 #19 #17 18.3