bellwether: cost-and-failure-mode benchmark for LLM agents

Headline metric: effective_TCoT is total spend per successful task completion, including the cost of failed retries. Lower is better. The methodology is the contribution; the leaderboard is its proof.

Validation is machine-checkable (no LLM-as-judge). Failures are classified into eight modes (refusal, confabulation, schema break, truncation, partial, off-task, timeout, error). One canonical prompt per task across all providers. Temperature 0; N=3 runs per task instance; cost-per-million-token pricing verified per provider.

See the full methodology for formulas, retry policy, validator contract, and reproducibility caveats. New to the metrics? The glossary defines each term.

tasks benched

distinct models

bench passes

$53.66

total bench spend

Cross-task ranking

Each cell is the model's rank on that task in its latest clean pass. Lower is better. Rank 1 highlighted in purple. Rows are grouped by model_class (standard, then reasoning, then search) and sorted within each class by average observed rank. Cross-class comparisons mislead because reasoning models burn 5x to 20x more output tokens per attempt and search models pay an unmetered per-search fee; see methodology s2.7.

Provider / Model	Class	`structured_extraction`	`function_call_routing`	`synthetic_rag`	avg rank
openai/gpt-4o-mini	standard	#1	#3	#3	2.3
openrouter/meta-llama/llama-4-scout	standard	#4	#1	#2	2.3
openrouter/meta-llama/llama-4-maverick	standard	#2	#4	#4	3.3
google/gemini-2.5-flash-lite	standard	#8	#2	#1	3.7
xai/grok-3-mini	standard	#3	#5	#6	4.7
openrouter/deepseek/deepseek-chat	standard	#7	#8	#5	6.7
xai/grok-4-fast	standard	#6	#6	#8	6.7
openrouter/meta-llama/llama-3.3-70b-instruct	standard	#5	#7	#9	7.0
google/gemini-2.5-flash	standard	#18	#9	#7	11.3
anthropic/claude-haiku-4-5	standard	#21*	#22*	#13	13.0
openai/gpt-4o	standard	#10	#16	#15	13.7
openrouter/mistralai/mistral-large	standard	#14	#14	#14	14.0
xai/grok-3	standard	#15	#17	#11	14.3
google/gemini-2.5-pro	standard	#23*	#21	#12	16.5
anthropic/claude-sonnet-4-6	standard	#16	#18	#16	16.7
xai/grok-4	standard	#20	#20	#21	20.3
anthropic/claude-opus-4-7	standard	#22*	#23*	#23*	n/a
openrouter/cohere/command-r-plus	standard	#24*	#24*	#24*	n/a
openrouter/qwen/qwen-3-235b-a22b	standard	#25*	#25*	#25*	n/a
openai/o3-mini	reasoning	#11	#12	#19	14.0
openai/o4-mini	reasoning	#12	#13	#20	15.0
openrouter/deepseek/deepseek-r1	reasoning	#17	#11	#18	15.3
openai/o3	reasoning	#13	#15	#22	16.7
perplexity/sonar-reasoning	reasoning	#27*	#27*	#27*	n/a
perplexity/sonar-reasoning-pro	reasoning	#26*	#26*	#26*	n/a
perplexity/sonar	search	#9	#10	#10	9.7
perplexity/sonar-pro	search	#19	#19	#17	18.3

Task: `structured_extraction` detailed page ›

Latest clean pass · 2026-05-08 15:11:05 UTC · `46ce0a5`

#	Provider / Model	Success	effective_TCoT	Latency p50 / p95	1st-try pass	Failure modes (this pass)
1	openai/gpt-4o-mini standard	100.0% (6/6)	$0.00004	1.45s / 1.58s	100% (6/6)	none
2	openrouter/meta-llama/llama-4-maverick standard	100.0% (6/6)	$0.00005±$0.00000	2.38s / 5.55s	100% (6/6)	none
3	xai/grok-3-mini standard	100.0% (6/6)	$0.00006	7.67s / 8.35s	100% (6/6)	none
4	openrouter/meta-llama/llama-4-scout standard	83.3% (5/6)	$0.00006±$0.00002	1.17s / 2.18s	33% (2/6)	6schema_break
5	openrouter/meta-llama/llama-3.3-70b-instruct standard	100.0% (6/6)	$0.00007±$0.00001	1.25s / 1.86s	100% (6/6)	none
6	xai/grok-4-fast standard	100.0% (6/6)	$0.00007	2.52s / 3.28s	100% (6/6)	none
7	openrouter/deepseek/deepseek-chat standard	100.0% (6/6)	$0.00009±$0.00005	2.62s / 10.68s	83% (5/6)	1schema_break
8	google/gemini-2.5-flash-lite standard	100.0% (6/6)	$0.00011±$0.00003	6.92s / 17.68s	0% (0/6)	8schema_break
9	perplexity/sonar search	100.0% (6/6)	$0.00048±$0.00029	1.92s / 2.73s	33% (2/6)	7schema_break
10	openai/gpt-4o standard	100.0% (6/6)	$0.00069±$0.00007	1.07s / 2.00s	100% (6/6)	none
11	openai/o3-mini reasoning	100.0% (6/6)	$0.00074±$0.00014	4.15s / 14.22s	100% (6/6)	none
12	openai/o4-mini reasoning	100.0% (6/6)	$0.00081±$0.00015	1.73s / 11.97s	100% (6/6)	none
13	openai/o3 reasoning	100.0% (6/6)	$0.00097±$0.00042	2.04s / 3.10s	100% (6/6)	none
14	openrouter/mistralai/mistral-large standard	100.0% (6/6)	$0.00098±$0.00044	1.19s / 1.40s	50% (3/6)	3schema_break
15	xai/grok-3 standard	100.0% (6/6)	$0.00102	15.30s / 46.68s	100% (6/6)	none
16	anthropic/claude-sonnet-4-6 standard	100.0% (6/6)	$0.00108±$0.00023	8.80s / 26.46s	83% (5/6)	2error
17	openrouter/deepseek/deepseek-r1 reasoning	100.0% (6/6)	$0.00114±$0.00048	17.26s / 33.41s	100% (6/6)	none
18	google/gemini-2.5-flash standard	50.0% (3/6)	$0.00120±$0.00015	2.31s / 3.21s	0% (0/6)	13schema_break
19	perplexity/sonar-pro search	83.3% (5/6)	$0.00257±$0.00139	1.98s / 2.50s	50% (3/6)	6schema_break
20	xai/grok-4 standard	100.0% (6/6)	$0.00286	6.06s / 6.69s	100% (6/6)	none
21	anthropic/claude-haiku-4-5 standard	0.0% (0/6)	infinite	0.80s / 1.31s	0% (0/6)	18schema_break
22	anthropic/claude-opus-4-7 standard	0.0% (0/6)	infinite	0.14s / 0.26s	0% (0/6)	18error
23	google/gemini-2.5-pro standard	0.0% (0/6)	infinite	7.03s / 9.84s	0% (0/6)	18schema_break
24	openrouter/cohere/command-r-plus standard	0.0% (0/6)	infinite	0.02s / 0.08s	0% (0/6)	18error
25	openrouter/qwen/qwen-3-235b-a22b standard	0.0% (0/6)	infinite	0.03s / 0.08s	0% (0/6)	18error
26	perplexity/sonar-reasoning-pro reasoning	0.0% (0/6)	infinite	4.42s / 6.46s	0% (0/6)	18schema_break
27	perplexity/sonar-reasoning reasoning	0.0% (0/6)	infinite	0.07s / 0.16s	0% (0/6)	18error

Rank rows marked with “≈” are within 5% of the rank above on effective_TCoT — treat as tied within bench noise. See methodology s7.

Cost calculator

If I run this task times per month, monthly cost per provider:

Provider / Model	Monthly cost (USD)
openai/gpt-4o-ministandard	$0.44
openrouter/meta-llama/llama-4-maverickstandard	$0.47
xai/grok-3-ministandard	$0.56
openrouter/meta-llama/llama-4-scoutstandard	$0.64
openrouter/meta-llama/llama-3.3-70b-instructstandard	$0.67
xai/grok-4-faststandard	$0.68
openrouter/deepseek/deepseek-chatstandard	$0.90
google/gemini-2.5-flash-litestandard	$1.06
perplexity/sonarsearch	$4.78
openai/gpt-4ostandard	$6.94
openai/o3-minireasoning	$7.36
openai/o4-minireasoning	$8.15
openai/o3reasoning	$9.69
openrouter/mistralai/mistral-largestandard	$9.79
xai/grok-3standard	$10.17
anthropic/claude-sonnet-4-6standard	$10.79
openrouter/deepseek/deepseek-r1reasoning	$11.43
google/gemini-2.5-flashstandard	$12.05
perplexity/sonar-prosearch	$25.67
xai/grok-4standard	$28.56
anthropic/claude-haiku-4-5standard	n/a
anthropic/claude-opus-4-7standard	n/a
google/gemini-2.5-prostandard	n/a
openrouter/cohere/command-r-plusstandard	n/a
openrouter/qwen/qwen-3-235b-a22bstandard	n/a
perplexity/sonar-reasoning-proreasoning	n/a
perplexity/sonar-reasoningreasoning	n/a

Computed from this pass's effective_TCoT. Real production cost will vary with prompt length, model snapshot drift, and retry distribution on your data.

Critique-Pass track (METHODOLOGY s13)

Latest clean critique-on pass · 2026-05-16 16:31:49 UTC · 4e158ca. Each attempt wraps a single self-revision step using the locked canonical critique prompt; both legs count toward effective_TCoT. Delta columns compare against the critique-off headline above. Positive cost delta means critique made it more expensive; positive success delta means critique improved accuracy.

#	Provider / Model	Success	effective_TCoT	Δ TCoT	Δ success	Latency p50 / p95
1	openai/gpt-4o-mini standard	100.0% (6/6)	$0.00010	+130.2%	+0.0 pp	3.21s / 7.87s
2	openrouter/meta-llama/llama-4-maverick standard	100.0% (6/6)	$0.00010	+123.5%	+0.0 pp	4.77s / 5.84s
3	openrouter/meta-llama/llama-4-scout standard	100.0% (6/6)	$0.00012	+79.7%	+16.7 pp	1.92s / 2.27s
4	xai/grok-4-fast standard	100.0% (6/6)	$0.00016	+130.2%	+0.0 pp	7.33s / 10.07s
5	openrouter/meta-llama/llama-3.3-70b-instruct standard	100.0% (6/6)	$0.00016	+137.3%	+0.0 pp	3.07s / 7.90s
6	openrouter/deepseek/deepseek-chat standard	100.0% (6/6)	$0.00017	+91.0%	+0.0 pp	4.52s / 6.01s
7	xai/grok-3-mini standard	100.0% (6/6)	$0.00021	+280.0%	+0.0 pp	7.14s / 178.48s
8	google/gemini-2.5-flash-lite standard	66.7% (4/6)	$0.00034	+225.3%	-33.3 pp	7.81s / 22.92s
9	perplexity/sonar search	100.0% (6/6)	$0.00036	-24.9%	+0.0 pp	4.25s / 4.91s
10	perplexity/sonar-reasoning-pro reasoning	100.0% (6/6)	$0.00110	n/a	+100.0 pp	5.38s / 6.08s
11	openai/o3-mini reasoning	100.0% (6/6)	$0.00132	+79.5%	+0.0 pp	3.15s / 3.60s
12	openrouter/mistralai/mistral-large standard	100.0% (6/6)	$0.00136	+39.4%	+0.0 pp	2.47s / 3.18s
13	openai/gpt-4o standard	100.0% (6/6)	$0.00150	+116.4%	+0.0 pp	2.41s / 5.04s
14	google/gemini-2.5-flash standard	83.3% (5/6)	$0.00154	+27.5%	+33.3 pp	5.39s / 7.06s
15	perplexity/sonar-pro search	100.0% (6/6)	$0.00193	-24.9%	+16.7 pp	4.30s / 4.77s
16	openai/o4-mini reasoning	100.0% (6/6)	$0.00199	+144.8%	+0.0 pp	3.14s / 3.42s
17	openrouter/deepseek/deepseek-r1 reasoning	100.0% (6/6)	$0.00214	+87.5%	+0.0 pp	26.19s / 193.73s
18	anthropic/claude-sonnet-4-6 standard	100.0% (6/6)	$0.00223	+106.5%	+0.0 pp	2.47s / 2.55s
19	openai/o3 reasoning	100.0% (6/6)	$0.00277	+186.1%	+0.0 pp	4.36s / 5.01s
20	xai/grok-3 standard	100.0% (6/6)	$0.00305	+199.7%	+0.0 pp	5.82s / 145.38s
21	xai/grok-4 standard	100.0% (6/6)	$0.00305	+6.7%	+0.0 pp	11.97s / 70.09s
22	anthropic/claude-haiku-4-5 standard	0.0% (0/6)	infinite	n/a	+0.0 pp	1.41s / 2.13s
23	anthropic/claude-opus-4-7 standard	0.0% (0/6)	infinite	n/a	+0.0 pp	0.12s / 0.17s
24	google/gemini-2.5-pro standard	0.0% (0/6)	infinite	n/a	+0.0 pp	14.60s / 17.72s
25	openrouter/cohere/command-r-plus standard	0.0% (0/6)	infinite	n/a	+0.0 pp	0.03s / 0.09s
26	openrouter/qwen/qwen-3-235b-a22b standard	0.0% (0/6)	infinite	n/a	+0.0 pp	0.02s / 0.10s
27	perplexity/sonar-reasoning reasoning	0.0% (0/6)	infinite	n/a	+0.0 pp	0.06s / 0.13s

7 bench passes recorded for this task. Full pass-by-pass breakdown, reproducibility deltas, and per-provider failure-mode aggregates are on the detailed task page.

Task: `function_call_routing` detailed page ›

Latest clean pass · 2026-05-08 15:11:05 UTC · `46ce0a5`

#	Provider / Model	Success	effective_TCoT	Latency p50 / p95	1st-try pass	Failure modes (this pass)
1	openrouter/meta-llama/llama-4-scout standard	100.0% (6/6)	$0.00003±$0.00000	0.88s / 1.17s	100% (6/6)	none
2	google/gemini-2.5-flash-lite standard	100.0% (6/6)	$0.00003±$0.00000	6.06s / 17.35s	83% (5/6)	1error
3	openai/gpt-4o-mini standard	100.0% (6/6)	$0.00005±$0.00000	0.86s / 1.03s	100% (6/6)	none
4	openrouter/meta-llama/llama-4-maverick standard	100.0% (6/6)	$0.00005±$0.00000	2.67s / 6.05s	100% (6/6)	none
5	xai/grok-3-mini standard	100.0% (6/6)	$0.00008±$0.00000	5.98s / 6.12s	100% (6/6)	none
6	xai/grok-4-fast standard	100.0% (6/6)	$0.00009±$0.00000	2.20s / 4.47s	100% (6/6)	none
7	openrouter/meta-llama/llama-3.3-70b-instruct standard	100.0% (6/6)	$0.00010±$0.00001	0.78s / 4.34s	100% (6/6)	none
8	openrouter/deepseek/deepseek-chat standard	100.0% (6/6)	$0.00011±$0.00004	1.31s / 1.92s	83% (5/6)	1schema_break
9	google/gemini-2.5-flash standard	100.0% (6/6)	$0.00013±$0.00001	0.99s / 1.29s	100% (6/6)	none
10	perplexity/sonar search	100.0% (6/6)	$0.00034±$0.00015	1.66s / 3.36s	67% (4/6)	2schema_break
11	openrouter/deepseek/deepseek-r1 reasoning	100.0% (6/6)	$0.00048±$0.00024	6.30s / 12.02s	100% (6/6)	none
12	openai/o3-mini reasoning	100.0% (6/6)	$0.00049±$0.00015	1.73s / 3.20s	100% (6/6)	none
13	openai/o4-mini reasoning	100.0% (6/6)	$0.00055±$0.00016	2.01s / 2.24s	100% (6/6)	none
14	openrouter/mistralai/mistral-large standard	100.0% (6/6)	$0.00060±$0.00003	1.37s / 5.16s	100% (6/6)	none
15	openai/o3 reasoning	100.0% (6/6)	$0.00076±$0.00003	1.76s / 2.89s	100% (6/6)	none
16	openai/gpt-4o standard	100.0% (6/6)	$0.00077±$0.00004	1.24s / 2.45s	100% (6/6)	none
17	xai/grok-3 standard	100.0% (6/6)	$0.00098±$0.00006	1.29s / 2.08s	100% (6/6)	none
18	anthropic/claude-sonnet-4-6 standard	100.0% (6/6)	$0.00119±$0.00007	1.19s / 1.73s	100% (6/6)	none
19	perplexity/sonar-pro search	83.3% (5/6)	$0.00167±$0.00007	1.60s / 3.30s	83% (5/6)	3schema_break
20	xai/grok-4 standard	100.0% (6/6)	$0.00302±$0.00006	3.85s / 5.50s	100% (6/6)	none
21	google/gemini-2.5-pro standard	16.7% (1/6)	$0.01093	5.62s / 12.08s	0% (0/6)	16schema_break
22	anthropic/claude-haiku-4-5 standard	0.0% (0/6)	infinite	0.71s / 1.81s	0% (0/6)	18schema_break
23	anthropic/claude-opus-4-7 standard	0.0% (0/6)	infinite	0.12s / 0.17s	0% (0/6)	18error
24	openrouter/cohere/command-r-plus standard	0.0% (0/6)	infinite	0.03s / 0.09s	0% (0/6)	18error
25	openrouter/qwen/qwen-3-235b-a22b standard	0.0% (0/6)	infinite	0.02s / 0.08s	0% (0/6)	18error
26	perplexity/sonar-reasoning-pro reasoning	0.0% (0/6)	infinite	4.00s / 7.46s	0% (0/6)	18schema_break
27	perplexity/sonar-reasoning reasoning	0.0% (0/6)	infinite	0.08s / 0.14s	0% (0/6)	18error

Rank rows marked with “≈” are within 5% of the rank above on effective_TCoT — treat as tied within bench noise. See methodology s7.

Cost calculator

If I run this task times per month, monthly cost per provider:

Provider / Model	Monthly cost (USD)
openrouter/meta-llama/llama-4-scoutstandard	$0.30
google/gemini-2.5-flash-litestandard	$0.35
openai/gpt-4o-ministandard	$0.46
openrouter/meta-llama/llama-4-maverickstandard	$0.54
xai/grok-3-ministandard	$0.79
xai/grok-4-faststandard	$0.86
openrouter/meta-llama/llama-3.3-70b-instructstandard	$1.00
openrouter/deepseek/deepseek-chatstandard	$1.08
google/gemini-2.5-flashstandard	$1.29
perplexity/sonarsearch	$3.44
openrouter/deepseek/deepseek-r1reasoning	$4.75
openai/o3-minireasoning	$4.90
openai/o4-minireasoning	$5.54
openrouter/mistralai/mistral-largestandard	$5.95
openai/o3reasoning	$7.60
openai/gpt-4ostandard	$7.72
xai/grok-3standard	$9.84
anthropic/claude-sonnet-4-6standard	$11.91
perplexity/sonar-prosearch	$16.72
xai/grok-4standard	$30.18
google/gemini-2.5-prostandard	$109.30
anthropic/claude-haiku-4-5standard	n/a
anthropic/claude-opus-4-7standard	n/a
openrouter/cohere/command-r-plusstandard	n/a
openrouter/qwen/qwen-3-235b-a22bstandard	n/a
perplexity/sonar-reasoning-proreasoning	n/a
perplexity/sonar-reasoningreasoning	n/a

Computed from this pass's effective_TCoT. Real production cost will vary with prompt length, model snapshot drift, and retry distribution on your data.

Critique-Pass track (METHODOLOGY s13)

#	Provider / Model	Success	effective_TCoT	Δ TCoT	Δ success	Latency p50 / p95
1	openrouter/meta-llama/llama-4-scout standard	100.0% (6/6)	$0.00007	+122.4%	+0.0 pp	0.84s / 1.01s
2	google/gemini-2.5-flash-lite standard	100.0% (6/6)	$0.00007	+113.0%	+0.0 pp	1.47s / 1.92s
3	openai/gpt-4o-mini standard	100.0% (6/6)	$0.00010	+120.6%	+0.0 pp	1.37s / 1.97s
4	openrouter/meta-llama/llama-4-maverick standard	100.0% (6/6)	$0.00012	+122.1%	+0.0 pp	1.50s / 1.75s
5	xai/grok-4-fast standard	100.0% (6/6)	$0.00017	+102.7%	+0.0 pp	4.63s / 6.42s
6	openrouter/deepseek/deepseek-chat standard	100.0% (6/6)	$0.00020	+85.4%	+0.0 pp	3.81s / 4.94s
7	openrouter/meta-llama/llama-3.3-70b-instruct standard	100.0% (6/6)	$0.00023	+127.0%	+0.0 pp	1.82s / 2.36s
8	xai/grok-3-mini standard	100.0% (6/6)	$0.00025	+218.0%	+0.0 pp	5.48s / 10.55s
9	google/gemini-2.5-flash standard	100.0% (6/6)	$0.00028	+115.4%	+0.0 pp	1.98s / 2.40s
10	perplexity/sonar search	100.0% (6/6)	$0.00054	+55.7%	+0.0 pp	3.36s / 5.11s
11	openrouter/deepseek/deepseek-r1 reasoning	100.0% (6/6)	$0.00129	+170.4%	+0.0 pp	130.54s / 264.24s
12	openai/o3-mini reasoning	100.0% (6/6)	$0.00129	+163.8%	+0.0 pp	2.39s / 3.09s
13	openrouter/mistralai/mistral-large standard	100.0% (6/6)	$0.00132	+121.8%	+0.0 pp	1.46s / 1.76s
14	perplexity/sonar-reasoning-pro reasoning	100.0% (6/6)	$0.00134	n/a	+100.0 pp	4.42s / 6.06s
15	openai/o4-mini reasoning	100.0% (6/6)	$0.00154	+177.8%	+0.0 pp	3.39s / 5.52s
16	openai/gpt-4o standard	100.0% (6/6)	$0.00170	+120.6%	+0.0 pp	1.64s / 2.54s
17	openai/o3 reasoning	100.0% (6/6)	$0.00198	+160.1%	+0.0 pp	2.79s / 3.19s
18	perplexity/sonar-pro search	100.0% (6/6)	$0.00209	+25.1%	+16.7 pp	3.47s / 3.79s
19	anthropic/claude-sonnet-4-6 standard	100.0% (6/6)	$0.00260	+118.2%	+0.0 pp	2.15s / 2.66s
20	xai/grok-3 standard	100.0% (6/6)	$0.00290	+195.0%	+0.0 pp	5.31s / 6.40s
21	xai/grok-4 standard	100.0% (6/6)	$0.00290	-3.8%	+0.0 pp	5.12s / 6.54s
22	google/gemini-2.5-pro standard	33.3% (2/6)	$0.01114	+1.9%	+16.7 pp	12.43s / 22.54s
23	anthropic/claude-haiku-4-5 standard	0.0% (0/6)	infinite	n/a	+0.0 pp	1.37s / 2.10s
24	anthropic/claude-opus-4-7 standard	0.0% (0/6)	infinite	n/a	+0.0 pp	0.12s / 0.24s
25	openrouter/cohere/command-r-plus standard	0.0% (0/6)	infinite	n/a	+0.0 pp	0.03s / 0.10s
26	openrouter/qwen/qwen-3-235b-a22b standard	0.0% (0/6)	infinite	n/a	+0.0 pp	0.03s / 0.14s
27	perplexity/sonar-reasoning reasoning	0.0% (0/6)	infinite	n/a	+0.0 pp	0.07s / 0.14s

4 bench passes recorded for this task. Full pass-by-pass breakdown, reproducibility deltas, and per-provider failure-mode aggregates are on the detailed task page.

Task: `synthetic_rag` detailed page ›

Latest clean pass · 2026-05-08 15:11:05 UTC · `46ce0a5`

#	Provider / Model	Success	effective_TCoT	Latency p50 / p95	1st-try pass	Failure modes (this pass)
1	google/gemini-2.5-flash-lite standard	66.7% (4/6)	$0.00004±$0.00000	6.13s / 18.96s	50% (3/6)	6confabulation 1error
2	openrouter/meta-llama/llama-4-scout standard	66.7% (4/6)	$0.00004±$0.00001	0.41s / 0.58s	33% (2/6)	6confabulation 2schema_break
3	openai/gpt-4o-mini standard	66.7% (4/6)	$0.00006±$0.00000	0.62s / 0.76s	67% (4/6)	6confabulation
4	openrouter/meta-llama/llama-4-maverick standard	66.7% (4/6)	$0.00007±$0.00001	0.85s / 3.69s	67% (4/6)	6confabulation
5	openrouter/deepseek/deepseek-chat standard	66.7% (4/6)	$0.00010±$0.00000	0.93s / 1.49s	67% (4/6)	6confabulation
6	xai/grok-3-mini standard	66.7% (4/6)	$0.00011±$0.00000	12.55s / 16.70s	67% (4/6)	6confabulation
7	google/gemini-2.5-flash standard	66.7% (4/6)	$0.00012±$0.00000	1.03s / 10.84s	67% (4/6)	6confabulation
8	xai/grok-4-fast standard	66.7% (4/6)	$0.00015±$0.00000	1.83s / 7.16s	67% (4/6)	6confabulation
9	openrouter/meta-llama/llama-3.3-70b-instruct standard	66.7% (4/6)	$0.00015±$0.00000	0.57s / 4.98s	67% (4/6)	6confabulation
10	perplexity/sonar search	66.7% (4/6)	$0.00033±$0.00000	1.42s / 1.75s	67% (4/6)	6confabulation
11	xai/grok-3 standard	100.0% (6/6)	$0.00037±$0.00002	0.90s / 2.67s	100% (6/6)	none
12	google/gemini-2.5-pro standard	83.3% (5/6)	$0.00046±$0.00020	6.97s / 13.99s	67% (4/6)	5confabulation
13	anthropic/claude-haiku-4-5 standard	66.7% (4/6)	$0.00053±$0.00000	0.67s / 0.79s	67% (4/6)	6confabulation
14	openrouter/mistralai/mistral-large standard	83.3% (5/6)	$0.00058±$0.00031	0.62s / 2.50s	67% (4/6)	5confabulation
15	openai/gpt-4o standard	66.7% (4/6)	$0.00090±$0.00001	0.77s / 1.69s	67% (4/6)	6confabulation
16	anthropic/claude-sonnet-4-6 standard	66.7% (4/6)	$0.00142±$0.00001	1.02s / 1.52s	67% (4/6)	6confabulation
17	perplexity/sonar-pro search	66.7% (4/6)	$0.00149±$0.00001	1.53s / 2.82s	67% (4/6)	6confabulation
18	openrouter/deepseek/deepseek-r1 reasoning	66.7% (4/6)	$0.00266±$0.00007	10.87s / 68.12s	67% (4/6)	6confabulation
19	openai/o3-mini reasoning	66.7% (4/6)	$0.00289±$0.00028	2.11s / 5.38s	67% (4/6)	6confabulation
20	openai/o4-mini reasoning	66.7% (4/6)	$0.00314±$0.00027	2.13s / 7.69s	67% (4/6)	6confabulation
21	xai/grok-4 standard	66.7% (4/6)	$0.00629±$0.00001	5.10s / 84.15s	67% (4/6)	6confabulation
22	openai/o3 reasoning	66.7% (4/6)	$0.00661±$0.00000	3.80s / 7.88s	67% (4/6)	6confabulation
23	anthropic/claude-opus-4-7 standard	0.0% (0/6)	infinite	0.12s / 0.51s	0% (0/6)	18error
24	openrouter/cohere/command-r-plus standard	0.0% (0/6)	infinite	0.02s / 0.08s	0% (0/6)	18error
25	openrouter/qwen/qwen-3-235b-a22b standard	0.0% (0/6)	infinite	0.02s / 0.10s	0% (0/6)	18error
26	perplexity/sonar-reasoning-pro reasoning	0.0% (0/6)	infinite	7.37s / 11.67s	0% (0/6)	18confabulation
27	perplexity/sonar-reasoning reasoning	0.0% (0/6)	infinite	0.07s / 0.12s	0% (0/6)	18error

Rank rows marked with “≈” are within 5% of the rank above on effective_TCoT — treat as tied within bench noise. See methodology s7.

Cost calculator

If I run this task times per month, monthly cost per provider:

Provider / Model	Monthly cost (USD)
google/gemini-2.5-flash-litestandard	$0.38
openrouter/meta-llama/llama-4-scoutstandard	$0.45
openai/gpt-4o-ministandard	$0.59
openrouter/meta-llama/llama-4-maverickstandard	$0.74
openrouter/deepseek/deepseek-chatstandard	$1.05
xai/grok-3-ministandard	$1.07
google/gemini-2.5-flashstandard	$1.20
xai/grok-4-faststandard	$1.47
openrouter/meta-llama/llama-3.3-70b-instructstandard	$1.48
perplexity/sonarsearch	$3.33
xai/grok-3standard	$3.68
google/gemini-2.5-prostandard	$4.58
anthropic/claude-haiku-4-5standard	$5.27
openrouter/mistralai/mistral-largestandard	$5.81
openai/gpt-4ostandard	$8.98
anthropic/claude-sonnet-4-6standard	$14.16
perplexity/sonar-prosearch	$14.87
openrouter/deepseek/deepseek-r1reasoning	$26.63
openai/o3-minireasoning	$28.86
openai/o4-minireasoning	$31.43
xai/grok-4standard	$62.89
openai/o3reasoning	$66.10
anthropic/claude-opus-4-7standard	n/a
openrouter/cohere/command-r-plusstandard	n/a
openrouter/qwen/qwen-3-235b-a22bstandard	n/a
perplexity/sonar-reasoning-proreasoning	n/a
perplexity/sonar-reasoningreasoning	n/a

Computed from this pass's effective_TCoT. Real production cost will vary with prompt length, model snapshot drift, and retry distribution on your data.

Critique-Pass track (METHODOLOGY s13)

#	Provider / Model	Success	effective_TCoT	Δ TCoT	Δ success	Latency p50 / p95
1	google/gemini-2.5-flash-lite standard	66.7% (4/6)	$0.00009	+123.1%	+0.0 pp	0.99s / 1.23s
2	openrouter/meta-llama/llama-4-scout standard	66.7% (4/6)	$0.00009	+91.0%	+0.0 pp	0.43s / 0.77s
3	openai/gpt-4o-mini standard	66.7% (4/6)	$0.00013	+125.4%	+0.0 pp	1.13s / 1.66s
4	openrouter/meta-llama/llama-4-maverick standard	66.7% (4/6)	$0.00016	+123.6%	+0.0 pp	1.07s / 2.01s
5	openrouter/deepseek/deepseek-chat standard	66.7% (4/6)	$0.00024	+128.1%	+0.0 pp	2.54s / 8.29s
6	google/gemini-2.5-flash standard	66.7% (4/6)	$0.00028	+130.3%	+0.0 pp	2.02s / 19.72s
7	xai/grok-4-fast standard	66.7% (4/6)	$0.00030	+102.4%	+0.0 pp	6.05s / 20.94s
8	xai/grok-3-mini standard	66.7% (4/6)	$0.00043	+297.4%	+0.0 pp	6.21s / 11.95s
9	openrouter/meta-llama/llama-3.3-70b-instruct standard	50.0% (3/6)	$0.00058	+288.8%	-16.7 pp	1.17s / 8.35s
10	perplexity/sonar search	66.7% (4/6)	$0.00079	+136.8%	+0.0 pp	3.74s / 3.96s
11	google/gemini-2.5-pro standard	66.7% (4/6)	$0.00114	+148.2%	-16.7 pp	9.77s / 31.75s
12	anthropic/claude-haiku-4-5 standard	66.7% (4/6)	$0.00120	+127.4%	+0.0 pp	1.25s / 1.34s
13	perplexity/sonar-reasoning-pro reasoning	66.7% (4/6)	$0.00168	n/a	+66.7 pp	4.62s / 7.54s
14	openrouter/mistralai/mistral-large standard	66.7% (4/6)	$0.00169	+190.6%	-16.7 pp	1.09s / 1.46s
15	openai/gpt-4o standard	66.7% (4/6)	$0.00208	+131.8%	+0.0 pp	1.10s / 6.71s
16	perplexity/sonar-pro search	66.7% (4/6)	$0.00252	+69.6%	+0.0 pp	3.92s / 5.44s
17	anthropic/claude-sonnet-4-6 standard	66.7% (4/6)	$0.00323	+128.3%	+0.0 pp	2.12s / 2.40s
18	xai/grok-4 standard	66.7% (4/6)	$0.00454	-27.9%	+0.0 pp	5.73s / 15.56s
19	xai/grok-3 standard	66.7% (4/6)	$0.00472	+1182.1%	-33.3 pp	7.14s / 16.23s
20	openai/o3-mini reasoning	66.7% (4/6)	$0.00506	+75.4%	+0.0 pp	3.92s / 14.39s
21	openrouter/deepseek/deepseek-r1 reasoning	66.7% (4/6)	$0.00613	+130.2%	+0.0 pp	25.88s / 167.43s
22	openai/o4-mini reasoning	66.7% (4/6)	$0.00720	+128.9%	+0.0 pp	3.53s / 8.92s
23	openai/o3 reasoning	66.7% (4/6)	$0.01409	+113.1%	+0.0 pp	2.39s / 17.25s
24	anthropic/claude-opus-4-7 standard	0.0% (0/6)	infinite	n/a	+0.0 pp	0.11s / 0.31s
25	openrouter/cohere/command-r-plus standard	0.0% (0/6)	infinite	n/a	+0.0 pp	0.03s / 0.10s
26	openrouter/qwen/qwen-3-235b-a22b standard	0.0% (0/6)	infinite	n/a	+0.0 pp	0.04s / 0.11s
27	perplexity/sonar-reasoning reasoning	0.0% (0/6)	infinite	n/a	+0.0 pp	0.07s / 0.15s

4 bench passes recorded for this task. Full pass-by-pass breakdown, reproducibility deltas, and per-provider failure-mode aggregates are on the detailed task page.

Coming in v1: `document_extraction_ocr`v1

Multimodal document OCR task on a redistribution-compatible open corpus (FUNSD, SROIE, CORD candidates, gated on license verification). Tests the procurement-question stressors that synthetic-render demos miss:

Handwriting tolerance
Multi-column and table layouts
Scan artifacts: skew, noise, JPEG compression, low resolution
Mixed text + graphics; multi-page documents

Same exact-field-match validator as structured_extraction so the delta between text-input and image-input isolates the OCR layer's per-provider contribution. Methodology bump 0.2 to 0.3 lands alongside: image-input pricing in pricing.py, modality field on the Task protocol, attempt-and-classify vision detection (no hardcoded supports_vision). See METHODOLOGY s10 for why the synthetic-render shortcut was rejected.

Coming in v0.5

Async runner. Parallelize adapter calls across providers within a task. Cuts wall-clock roughly proportional to provider count.
BYO task config. TOML schema plus a ConfigDrivenTask that loads --task-config PATH. Validator types: json_field_match, exact_match, regex_match, schema_match. Goal: bench your own prompts without writing Python.
Plugin loader. --plugins-dir DIR loads tasks and providers from arbitrary paths. Prerequisite for the v1 hosted-run service.
Bootstrap confidence intervals on effective_TCoT to replace the v0.3 5%-ratio tied-rank heuristic.
N=5 default with bootstrap CI infrastructure in place.
Search-cost accounting in TCoT for Perplexity Sonar entries. Per-search fees (about $5 per 1k) are currently excluded; v0.5 closes the gap.
Tuned-prompt track formalization. Real contract for who tunes, against which split, to what convergence criterion. Renders as a "tuned delta" column.
Historical leaderboard with trend lines. Per-provider effective_TCoT and success_rate across model snapshots, inline SVG sparklines.

Recently shipped

v0.2 (this release): Critique-Pass evaluation track per METHODOLOGY s13. Optional --critique-pass wraps each attempt with a locked single-shot self-revision step; twin-leaderboard reporting with per-model delta columns surfaces the cost-vs-accuracy tradeoff.
v0.4: 27 model coverage across 6 providers, model_class field separating reasoning and search from standard chat, OpenAI-compatible adapter generalization for xAI / Perplexity / OpenRouter.
v0.3: 3 tasks (added function_call_routing and synthetic_rag), mean ± std reporting, tied-rank marker, per-task drill-down pages, cost calculator widget.

Cross-task ranking

Task: structured_extraction detailed page ›

Latest clean pass · 2026-05-08 15:11:05 UTC · 46ce0a5

Cost calculator

Critique-Pass track (METHODOLOGY s13)

Task: function_call_routing detailed page ›

Latest clean pass · 2026-05-08 15:11:05 UTC · 46ce0a5

Cost calculator

Critique-Pass track (METHODOLOGY s13)

Task: synthetic_rag detailed page ›

Latest clean pass · 2026-05-08 15:11:05 UTC · 46ce0a5

Cost calculator

Critique-Pass track (METHODOLOGY s13)

Coming in v1: document_extraction_ocrv1

Coming in v0.5

Recently shipped

Task: `structured_extraction` detailed page ›

Latest clean pass · 2026-05-08 15:11:05 UTC · `46ce0a5`

Task: `function_call_routing` detailed page ›

Latest clean pass · 2026-05-08 15:11:05 UTC · `46ce0a5`

Task: `synthetic_rag` detailed page ›

Latest clean pass · 2026-05-08 15:11:05 UTC · `46ce0a5`

Coming in v1: `document_extraction_ocr`v1