bellwether: synthetic

Critique-Pass delta (latest clean off vs latest clean on)

Per-model deltas between the critique-off baseline and the critique-on track per METHODOLOGY s13.3. Positive cost delta means the critique pass made it more expensive; positive success delta means it improved accuracy. Negative deltas are honestly reported.

Provider / Model	Off eff_TCoT	On eff_TCoT	Δ eff_TCoT	Off success	On success	Δ success
anthropic/claude-haiku-4-5standard	$0.00053	$0.00120	+127.4%	66.7%	66.7%	+0.0 pp
anthropic/claude-opus-4-7standard	infinite	infinite	n/a	0.0%	0.0%	+0.0 pp
anthropic/claude-sonnet-4-6standard	$0.00142	$0.00323	+128.3%	66.7%	66.7%	+0.0 pp
google/gemini-2.5-flashstandard	$0.00012	$0.00028	+130.3%	66.7%	66.7%	+0.0 pp
google/gemini-2.5-flash-litestandard	$0.00004	$0.00009	+123.1%	66.7%	66.7%	+0.0 pp
google/gemini-2.5-prostandard	$0.00046	$0.00114	+148.2%	83.3%	66.7%	-16.7 pp
openai/gpt-4ostandard	$0.00090	$0.00208	+131.8%	66.7%	66.7%	+0.0 pp
openai/gpt-4o-ministandard	$0.00006	$0.00013	+125.4%	66.7%	66.7%	+0.0 pp
openai/o3reasoning	$0.00661	$0.01409	+113.1%	66.7%	66.7%	+0.0 pp
openai/o3-minireasoning	$0.00289	$0.00506	+75.4%	66.7%	66.7%	+0.0 pp
openai/o4-minireasoning	$0.00314	$0.00720	+128.9%	66.7%	66.7%	+0.0 pp
openrouter/cohere/command-r-plusstandard	infinite	infinite	n/a	0.0%	0.0%	+0.0 pp
openrouter/deepseek/deepseek-chatstandard	$0.00010	$0.00024	+128.1%	66.7%	66.7%	+0.0 pp
openrouter/deepseek/deepseek-r1reasoning	$0.00266	$0.00613	+130.2%	66.7%	66.7%	+0.0 pp
openrouter/meta-llama/llama-3.3-70b-instructstandard	$0.00015	$0.00058	+288.8%	66.7%	50.0%	-16.7 pp
openrouter/meta-llama/llama-4-maverickstandard	$0.00007	$0.00016	+123.6%	66.7%	66.7%	+0.0 pp
openrouter/meta-llama/llama-4-scoutstandard	$0.00004	$0.00009	+91.0%	66.7%	66.7%	+0.0 pp
openrouter/mistralai/mistral-largestandard	$0.00058	$0.00169	+190.6%	83.3%	66.7%	-16.7 pp
openrouter/qwen/qwen-3-235b-a22bstandard	infinite	infinite	n/a	0.0%	0.0%	+0.0 pp
perplexity/sonarsearch	$0.00033	$0.00079	+136.8%	66.7%	66.7%	+0.0 pp
perplexity/sonar-prosearch	$0.00149	$0.00252	+69.6%	66.7%	66.7%	+0.0 pp
perplexity/sonar-reasoningreasoning	infinite	infinite	n/a	0.0%	0.0%	+0.0 pp
perplexity/sonar-reasoning-proreasoning	infinite	$0.00168	n/a	0.0%	66.7%	+66.7 pp
xai/grok-3standard	$0.00037	$0.00472	+1182.1%	100.0%	66.7%	-33.3 pp
xai/grok-3-ministandard	$0.00011	$0.00043	+297.4%	66.7%	66.7%	+0.0 pp
xai/grok-4standard	$0.00629	$0.00454	-27.9%	66.7%	66.7%	+0.0 pp
xai/grok-4-faststandard	$0.00015	$0.00030	+102.4%	66.7%	66.7%	+0.0 pp

Reproducibility (latest pass vs previous)

Provider / Model	Latest eff_TCoT	Previous eff_TCoT	Δ eff_TCoT	Δ success rate
anthropic/claude-haiku-4-5	$0.00053	$0.00057	-7.3%	+0.0 pp
anthropic/claude-opus-4-7	$n/a	$n/a	n/a	+0.0 pp
anthropic/claude-sonnet-4-6	$0.00142	$0.00174	-18.5%	+6.7 pp
google/gemini-2.5-flash	$0.00012	$0.00012	+2.5%	-6.7 pp
google/gemini-2.5-flash-lite	$0.00004	$0.00004	-13.4%	+6.7 pp
google/gemini-2.5-pro	$0.00046	$0.00049	-5.8%	+10.0 pp
openai/gpt-4o	$0.00090	$0.00110	-18.1%	+6.7 pp
openai/gpt-4o-mini	$0.00006	$0.00007	-18.5%	+6.7 pp

Failure-mode totals across all passes

Counts per failure mode summed across every attempt across every pass; useful for spotting which providers exhibit which failure patterns systematically.

Provider / Model	Failure modes (total counts)
google/gemini-2.5-flash-lite	48confabulation 1error
openrouter/meta-llama/llama-4-scout	12confabulation 2schema_break
openai/gpt-4o-mini	48confabulation
openrouter/meta-llama/llama-4-maverick	12confabulation
openrouter/deepseek/deepseek-chat	12confabulation
google/gemini-2.5-flash	28confabulation
xai/grok-4-fast	12confabulation
xai/grok-3-mini	12confabulation
openrouter/meta-llama/llama-3.3-70b-instruct	15confabulation
perplexity/sonar	12confabulation
google/gemini-2.5-pro	19confabulation 8schema_break
anthropic/claude-haiku-4-5	44confabulation
perplexity/sonar-reasoning-pro	24confabulation
openrouter/mistralai/mistral-large	11confabulation
openai/gpt-4o	48confabulation
perplexity/sonar-pro	12confabulation
anthropic/claude-sonnet-4-6	48confabulation
xai/grok-4	12confabulation
xai/grok-3	6confabulation
openai/o3-mini	12confabulation
openrouter/deepseek/deepseek-r1	12confabulation
openai/o4-mini	12confabulation
openai/o3	12confabulation
anthropic/claude-opus-4-7	126error
openrouter/cohere/command-r-plus	36error
openrouter/qwen/qwen-3-235b-a22b	36error
perplexity/sonar-reasoning	36error

All bench passes (chronological, newest first)

Pass 2026-05-16 16:31:49 UTC 4e158ca clean critique-pass (s13)

#	Provider / Model	Success	effective_TCoT	Latency p50 / p95	Attempts	Failure modes (this pass)
1	google/gemini-2.5-flash-litestandard	66.7% (4/6)	$0.00009±$0.00000	0.99s / 1.23s	10 (67% 1st-try)	6confabulation
2	openrouter/meta-llama/llama-4-scoutstandard	66.7% (4/6)	$0.00009±$0.00000	0.43s / 0.77s	10 (67% 1st-try)	6confabulation
3	openai/gpt-4o-ministandard	66.7% (4/6)	$0.00013±$0.00000	1.13s / 1.66s	10 (67% 1st-try)	6confabulation
4	openrouter/meta-llama/llama-4-maverickstandard	66.7% (4/6)	$0.00016±$0.00000	1.07s / 2.01s	10 (67% 1st-try)	6confabulation
5	openrouter/deepseek/deepseek-chatstandard	66.7% (4/6)	$0.00024±$0.00000	2.54s / 8.29s	10 (67% 1st-try)	6confabulation
6	google/gemini-2.5-flashstandard	66.7% (4/6)	$0.00028±$0.00001	2.02s / 19.72s	10 (67% 1st-try)	6confabulation
7	xai/grok-4-faststandard	66.7% (4/6)	$0.00030±$0.00000	6.05s / 20.94s	10 (67% 1st-try)	6confabulation
8	xai/grok-3-ministandard	66.7% (4/6)	$0.00043±$0.00000	6.21s / 11.95s	10 (67% 1st-try)	6confabulation
9	openrouter/meta-llama/llama-3.3-70b-instructstandard	50.0% (3/6)	$0.00058±$0.00001	1.17s / 8.35s	12 (50% 1st-try)	9confabulation
10	perplexity/sonarsearch	66.7% (4/6)	$0.00079±$0.00001	3.74s / 3.96s	10 (67% 1st-try)	6confabulation
11	google/gemini-2.5-prostandard	66.7% (4/6)	$0.00114±$0.00002	9.77s / 31.75s	10 (67% 1st-try)	4schema_break 2confabulation
12	anthropic/claude-haiku-4-5standard	66.7% (4/6)	$0.00120±$0.00000	1.25s / 1.34s	10 (67% 1st-try)	6confabulation
13	perplexity/sonar-reasoning-proreasoning	66.7% (4/6)	$0.00168±$0.00001	4.62s / 7.54s	10 (67% 1st-try)	6confabulation
14	openrouter/mistralai/mistral-largestandard	66.7% (4/6)	$0.00169±$0.00001	1.09s / 1.46s	10 (67% 1st-try)	6confabulation
15	openai/gpt-4ostandard	66.7% (4/6)	$0.00208±$0.00001	1.10s / 6.71s	10 (67% 1st-try)	6confabulation
16	perplexity/sonar-prosearch	66.7% (4/6)	$0.00252±$0.00002	3.92s / 5.44s	10 (67% 1st-try)	6confabulation
17	anthropic/claude-sonnet-4-6standard	66.7% (4/6)	$0.00323±$0.00001	2.12s / 2.40s	10 (67% 1st-try)	6confabulation
18	xai/grok-4standard	66.7% (4/6)	$0.00454±$0.00001	5.73s / 15.56s	10 (67% 1st-try)	6confabulation
19	xai/grok-3standard	66.7% (4/6)	$0.00472±$0.00001	7.14s / 16.23s	10 (67% 1st-try)	6confabulation
20	openai/o3-minireasoning	66.7% (4/6)	$0.00506±$0.00047	3.92s / 14.39s	10 (67% 1st-try)	6confabulation
21	openrouter/deepseek/deepseek-r1reasoning	66.7% (4/6)	$0.00613±$0.00015	25.88s / 167.43s	10 (67% 1st-try)	6confabulation
22	openai/o4-minireasoning	66.7% (4/6)	$0.00720±$0.00040	3.53s / 8.92s	10 (67% 1st-try)	6confabulation
23	openai/o3reasoning	66.7% (4/6)	$0.01409±$0.00026	2.39s / 17.25s	10 (67% 1st-try)	6confabulation
24	anthropic/claude-opus-4-7standard	0.0% (0/6)	infinite	0.11s / 0.31s	18 (0% 1st-try)	18error
25	openrouter/cohere/command-r-plusstandard	0.0% (0/6)	infinite	0.03s / 0.10s	18 (0% 1st-try)	18error
26	openrouter/qwen/qwen-3-235b-a22bstandard	0.0% (0/6)	infinite	0.04s / 0.11s	18 (0% 1st-try)	18error
27	perplexity/sonar-reasoningreasoning	0.0% (0/6)	infinite	0.07s / 0.15s	18 (0% 1st-try)	18error

Pass 2026-05-08 15:11:05 UTC 46ce0a5 clean

#	Provider / Model	Success	effective_TCoT	Latency p50 / p95	Attempts	Failure modes (this pass)
1	google/gemini-2.5-flash-litestandard	66.7% (4/6)	$0.00004±$0.00000	6.13s / 18.96s	11 (50% 1st-try)	6confabulation 1error
2	openrouter/meta-llama/llama-4-scoutstandard	66.7% (4/6)	$0.00004±$0.00001	0.41s / 0.58s	12 (33% 1st-try)	6confabulation 2schema_break
3	openai/gpt-4o-ministandard	66.7% (4/6)	$0.00006±$0.00000	0.62s / 0.76s	10 (67% 1st-try)	6confabulation
4	openrouter/meta-llama/llama-4-maverickstandard	66.7% (4/6)	$0.00007±$0.00001	0.85s / 3.69s	10 (67% 1st-try)	6confabulation
5	openrouter/deepseek/deepseek-chatstandard	66.7% (4/6)	$0.00010±$0.00000	0.93s / 1.49s	10 (67% 1st-try)	6confabulation
6	xai/grok-3-ministandard	66.7% (4/6)	$0.00011±$0.00000	12.55s / 16.70s	10 (67% 1st-try)	6confabulation
7	google/gemini-2.5-flashstandard	66.7% (4/6)	$0.00012±$0.00000	1.03s / 10.84s	10 (67% 1st-try)	6confabulation
8	xai/grok-4-faststandard	66.7% (4/6)	$0.00015±$0.00000	1.83s / 7.16s	10 (67% 1st-try)	6confabulation
9	openrouter/meta-llama/llama-3.3-70b-instructstandard	66.7% (4/6)	$0.00015±$0.00000	0.57s / 4.98s	10 (67% 1st-try)	6confabulation
10	perplexity/sonarsearch	66.7% (4/6)	$0.00033±$0.00000	1.42s / 1.75s	10 (67% 1st-try)	6confabulation
11	xai/grok-3standard	100.0% (6/6)	$0.00037±$0.00002	0.90s / 2.67s	6 (100% 1st-try)	none
12	google/gemini-2.5-prostandard	83.3% (5/6)	$0.00046±$0.00020	6.97s / 13.99s	10 (67% 1st-try)	5confabulation
13	anthropic/claude-haiku-4-5standard	66.7% (4/6)	$0.00053±$0.00000	0.67s / 0.79s	10 (67% 1st-try)	6confabulation
14	openrouter/mistralai/mistral-largestandard	83.3% (5/6)	$0.00058±$0.00031	0.62s / 2.50s	10 (67% 1st-try)	5confabulation
15	openai/gpt-4ostandard	66.7% (4/6)	$0.00090±$0.00001	0.77s / 1.69s	10 (67% 1st-try)	6confabulation
16	anthropic/claude-sonnet-4-6standard	66.7% (4/6)	$0.00142±$0.00001	1.02s / 1.52s	10 (67% 1st-try)	6confabulation
17	perplexity/sonar-prosearch	66.7% (4/6)	$0.00149±$0.00001	1.53s / 2.82s	10 (67% 1st-try)	6confabulation
18	openrouter/deepseek/deepseek-r1reasoning	66.7% (4/6)	$0.00266±$0.00007	10.87s / 68.12s	10 (67% 1st-try)	6confabulation
19	openai/o3-minireasoning	66.7% (4/6)	$0.00289±$0.00028	2.11s / 5.38s	10 (67% 1st-try)	6confabulation
20	openai/o4-minireasoning	66.7% (4/6)	$0.00314±$0.00027	2.13s / 7.69s	10 (67% 1st-try)	6confabulation
21	xai/grok-4standard	66.7% (4/6)	$0.00629±$0.00001	5.10s / 84.15s	10 (67% 1st-try)	6confabulation
22	openai/o3reasoning	66.7% (4/6)	$0.00661±$0.00000	3.80s / 7.88s	10 (67% 1st-try)	6confabulation
23	anthropic/claude-opus-4-7standard	0.0% (0/6)	infinite	0.12s / 0.51s	18 (0% 1st-try)	18error
24	openrouter/cohere/command-r-plusstandard	0.0% (0/6)	infinite	0.02s / 0.08s	18 (0% 1st-try)	18error
25	openrouter/qwen/qwen-3-235b-a22bstandard	0.0% (0/6)	infinite	0.02s / 0.10s	18 (0% 1st-try)	18error
26	perplexity/sonar-reasoning-proreasoning	0.0% (0/6)	infinite	7.37s / 11.67s	18 (0% 1st-try)	18confabulation
27	perplexity/sonar-reasoningreasoning	0.0% (0/6)	infinite	0.07s / 0.12s	18 (0% 1st-try)	18error

Pass 2026-05-08 01:46:18 UTC e6e987c clean

#	Provider / Model	Success	effective_TCoT	Latency p50 / p95	Attempts	Failure modes (this pass)
1	google/gemini-2.5-flash-litestandard	60.0% (9/15)	$0.00004±$0.00000	0.50s / 0.73s	27 (60% 1st-try)	18confabulation
2	openai/gpt-4o-ministandard	60.0% (9/15)	$0.00007±$0.00000	0.67s / 1.34s	27 (60% 1st-try)	18confabulation
3	google/gemini-2.5-flashstandard	73.3% (11/15)	$0.00012±$0.00004	1.15s / 10.89s	27 (60% 1st-try)	16confabulation
4	google/gemini-2.5-prostandard	73.3% (11/15)	$0.00049±$0.00018	6.45s / 18.34s	27 (60% 1st-try)	12confabulation 4schema_break
5	anthropic/claude-haiku-4-5standard	66.7% (10/15)	$0.00057±$0.00013	0.70s / 1.51s	27 (60% 1st-try)	17confabulation
6	openai/gpt-4ostandard	60.0% (9/15)	$0.00110±$0.00001	0.51s / 0.73s	27 (60% 1st-try)	18confabulation
7	anthropic/claude-sonnet-4-6standard	60.0% (9/15)	$0.00174±$0.00000	1.08s / 3.36s	27 (60% 1st-try)	18confabulation
8	anthropic/claude-opus-4-7standard	0.0% (0/15)	infinite	0.13s / 0.41s	45 (0% 1st-try)	45error

Pass 2026-05-08 01:28:01 UTC d95fbc2 clean

#	Provider / Model	Success	effective_TCoT	Latency p50 / p95	Attempts	Failure modes (this pass)
1	google/gemini-2.5-flash-litestandard	60.0% (9/15)	$0.00004	0.49s / 0.80s	27 (60% 1st-try)	18confabulation
2	openai/gpt-4o-ministandard	60.0% (9/15)	$0.00007	0.61s / 0.88s	27 (60% 1st-try)	18confabulation
3	anthropic/claude-haiku-4-5standard	80.0% (12/15)	$0.00044	0.61s / 1.28s	27 (60% 1st-try)	15confabulation
4	openai/gpt-4ostandard	60.0% (9/15)	$0.00110	0.49s / 0.81s	27 (60% 1st-try)	18confabulation
5	anthropic/claude-sonnet-4-6standard	60.0% (9/15)	$0.00174	1.11s / 1.91s	27 (60% 1st-try)	18confabulation
6	anthropic/claude-opus-4-7standard	0.0% (0/15)	infinite	0.11s / 0.27s	45 (0% 1st-try)	45error