bellwether  ›  structured_extraction

Critique-Pass delta (latest clean off vs latest clean on)

Per-model deltas between the critique-off baseline and the critique-on track per METHODOLOGY s13.3. Positive cost delta means the critique pass made it more expensive; positive success delta means it improved accuracy. Negative deltas are honestly reported.

Provider / Model Off eff_TCoT On eff_TCoT Δ eff_TCoT Off success On success Δ success
anthropic/claude-haiku-4-5standard infinite infinite n/a 0.0% 0.0% +0.0 pp
anthropic/claude-opus-4-7standard infinite infinite n/a 0.0% 0.0% +0.0 pp
anthropic/claude-sonnet-4-6standard $0.00108 $0.00223 +106.5% 100.0% 100.0% +0.0 pp
google/gemini-2.5-flashstandard $0.00120 $0.00154 +27.5% 50.0% 83.3% +33.3 pp
google/gemini-2.5-flash-litestandard $0.00011 $0.00034 +225.3% 100.0% 66.7% -33.3 pp
google/gemini-2.5-prostandard infinite infinite n/a 0.0% 0.0% +0.0 pp
openai/gpt-4ostandard $0.00069 $0.00150 +116.4% 100.0% 100.0% +0.0 pp
openai/gpt-4o-ministandard $0.00004 $0.00010 +130.2% 100.0% 100.0% +0.0 pp
openai/o3reasoning $0.00097 $0.00277 +186.1% 100.0% 100.0% +0.0 pp
openai/o3-minireasoning $0.00074 $0.00132 +79.5% 100.0% 100.0% +0.0 pp
openai/o4-minireasoning $0.00081 $0.00199 +144.8% 100.0% 100.0% +0.0 pp
openrouter/cohere/command-r-plusstandard infinite infinite n/a 0.0% 0.0% +0.0 pp
openrouter/deepseek/deepseek-chatstandard $0.00009 $0.00017 +91.0% 100.0% 100.0% +0.0 pp
openrouter/deepseek/deepseek-r1reasoning $0.00114 $0.00214 +87.5% 100.0% 100.0% +0.0 pp
openrouter/meta-llama/llama-3.3-70b-instructstandard $0.00007 $0.00016 +137.3% 100.0% 100.0% +0.0 pp
openrouter/meta-llama/llama-4-maverickstandard $0.00005 $0.00010 +123.5% 100.0% 100.0% +0.0 pp
openrouter/meta-llama/llama-4-scoutstandard $0.00006 $0.00012 +79.7% 83.3% 100.0% +16.7 pp
openrouter/mistralai/mistral-largestandard $0.00098 $0.00136 +39.4% 100.0% 100.0% +0.0 pp
openrouter/qwen/qwen-3-235b-a22bstandard infinite infinite n/a 0.0% 0.0% +0.0 pp
perplexity/sonarsearch $0.00048 $0.00036 -24.9% 100.0% 100.0% +0.0 pp
perplexity/sonar-prosearch $0.00257 $0.00193 -24.9% 83.3% 100.0% +16.7 pp
perplexity/sonar-reasoningreasoning infinite infinite n/a 0.0% 0.0% +0.0 pp
perplexity/sonar-reasoning-proreasoning infinite $0.00110 n/a 0.0% 100.0% +100.0 pp
xai/grok-3standard $0.00102 $0.00305 +199.7% 100.0% 100.0% +0.0 pp
xai/grok-3-ministandard $0.00006 $0.00021 +280.0% 100.0% 100.0% +0.0 pp
xai/grok-4standard $0.00286 $0.00305 +6.7% 100.0% 100.0% +0.0 pp
xai/grok-4-faststandard $0.00007 $0.00016 +130.2% 100.0% 100.0% +0.0 pp

Reproducibility (latest pass vs previous)

Provider / Model Latest eff_TCoT Previous eff_TCoT Δ eff_TCoT Δ success rate
anthropic/claude-haiku-4-5 $n/a $n/a n/a +0.0 pp
anthropic/claude-opus-4-7 $n/a $n/a n/a +0.0 pp
anthropic/claude-sonnet-4-6 $0.00108 $0.00099 +9.5% +0.0 pp
google/gemini-2.5-flash $0.00120 $0.00076 +59.4% -23.3 pp
google/gemini-2.5-flash-lite $0.00011 $0.00011 -3.6% +0.0 pp
google/gemini-2.5-pro $n/a $n/a n/a +0.0 pp
openai/gpt-4o $0.00069 $0.00068 +2.6% +0.0 pp
openai/gpt-4o-mini $0.00004 $0.00004 +0.0% +0.0 pp

Failure-mode totals across all passes

Counts per failure mode summed across every attempt across every pass; useful for spotting which providers exhibit which failure patterns systematically.

Provider / Model Failure modes (total counts)
openrouter/meta-llama/llama-4-scout 10schema_break
google/gemini-2.5-flash-lite 123schema_break 4error
google/gemini-2.5-flash 76schema_break
anthropic/claude-haiku-4-5 126schema_break
anthropic/claude-opus-4-7 126error
google/gemini-2.5-pro 126schema_break
openrouter/cohere/command-r-plus 36error
openrouter/qwen/qwen-3-235b-a22b 36error
perplexity/sonar-reasoning 36error
openrouter/deepseek/deepseek-chat 1schema_break
perplexity/sonar 7schema_break
openrouter/mistralai/mistral-large 3schema_break
anthropic/claude-sonnet-4-6 2error
perplexity/sonar-pro 6schema_break
perplexity/sonar-reasoning-pro 18schema_break

All bench passes (chronological, newest first)

Pass 2026-05-16 16:31:49 UTC 4e158ca clean critique-pass (s13)
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 openai/gpt-4o-ministandard 100.0% (6/6)
$0.00010
3.21s / 7.87s 6 (100% 1st-try) none
2 openrouter/meta-llama/llama-4-maverickstandard 100.0% (6/6)
$0.00010±$0.00000
4.77s / 5.84s 6 (100% 1st-try) none
3 openrouter/meta-llama/llama-4-scoutstandard 100.0% (6/6)
$0.00012±$0.00007
1.92s / 2.27s 10 (50% 1st-try) 4schema_break
4 xai/grok-4-faststandard 100.0% (6/6)
$0.00016
7.33s / 10.07s 6 (100% 1st-try) none
5 openrouter/meta-llama/llama-3.3-70b-instructstandard 100.0% (6/6)
$0.00016±$0.00001
3.07s / 7.90s 6 (100% 1st-try) none
6 openrouter/deepseek/deepseek-chatstandard 100.0% (6/6)
$0.00017±$0.00001
4.52s / 6.01s 6 (100% 1st-try) none
7 xai/grok-3-ministandard 100.0% (6/6)
$0.00021
7.14s / 178.48s 6 (100% 1st-try) none
8 google/gemini-2.5-flash-litestandard 66.7% (4/6)
$0.00034±$0.00007
7.81s / 22.92s 15 (0% 1st-try) 9schema_break 2error
9 perplexity/sonarsearch 100.0% (6/6)
$0.00036±$0.00002
4.25s / 4.91s 6 (100% 1st-try) none
10 perplexity/sonar-reasoning-proreasoning 100.0% (6/6)
$0.00110±$0.00001
5.38s / 6.08s 6 (100% 1st-try) none
11 openai/o3-minireasoning 100.0% (6/6)
$0.00132±$0.00026
3.15s / 3.60s 6 (100% 1st-try) none
12 openrouter/mistralai/mistral-largestandard 100.0% (6/6)
$0.00136±$0.00004
2.47s / 3.18s 6 (100% 1st-try) none
13 openai/gpt-4ostandard 100.0% (6/6)
$0.00150±$0.00015
2.41s / 5.04s 6 (100% 1st-try) none
14 google/gemini-2.5-flashstandard 83.3% (5/6)
$0.00154±$0.00031
5.39s / 7.06s 16 (0% 1st-try) 11schema_break
15 perplexity/sonar-prosearch 100.0% (6/6)
$0.00193±$0.00019
4.30s / 4.77s 6 (100% 1st-try) none
16 openai/o4-minireasoning 100.0% (6/6)
$0.00199±$0.00015
3.14s / 3.42s 6 (100% 1st-try) none
17 openrouter/deepseek/deepseek-r1reasoning 100.0% (6/6)
$0.00214±$0.00154
26.19s / 193.73s 6 (100% 1st-try) none
18 anthropic/claude-sonnet-4-6standard 100.0% (6/6)
$0.00223±$0.00002
2.47s / 2.55s 6 (100% 1st-try) none
19 openai/o3reasoning 100.0% (6/6)
$0.00277±$0.00053
4.36s / 5.01s 6 (100% 1st-try) none
20 xai/grok-3standard 100.0% (6/6)
$0.00305
5.82s / 145.38s 6 (100% 1st-try) none
21 xai/grok-4standard 100.0% (6/6)
$0.00305
11.97s / 70.09s 6 (100% 1st-try) none
22 anthropic/claude-haiku-4-5standard 0.0% (0/6)
infinite
1.41s / 2.13s 18 (0% 1st-try) 18schema_break
23 anthropic/claude-opus-4-7standard 0.0% (0/6)
infinite
0.12s / 0.17s 18 (0% 1st-try) 18error
24 google/gemini-2.5-prostandard 0.0% (0/6)
infinite
14.60s / 17.72s 18 (0% 1st-try) 18schema_break
25 openrouter/cohere/command-r-plusstandard 0.0% (0/6)
infinite
0.03s / 0.09s 18 (0% 1st-try) 18error
26 openrouter/qwen/qwen-3-235b-a22bstandard 0.0% (0/6)
infinite
0.02s / 0.10s 18 (0% 1st-try) 18error
27 perplexity/sonar-reasoningreasoning 0.0% (0/6)
infinite
0.06s / 0.13s 18 (0% 1st-try) 18error
Pass 2026-05-08 15:11:05 UTC 46ce0a5 clean
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 openai/gpt-4o-ministandard 100.0% (6/6)
$0.00004
1.45s / 1.58s 6 (100% 1st-try) none
2 openrouter/meta-llama/llama-4-maverickstandard 100.0% (6/6)
$0.00005±$0.00000
2.38s / 5.55s 6 (100% 1st-try) none
3 xai/grok-3-ministandard 100.0% (6/6)
$0.00006
7.67s / 8.35s 6 (100% 1st-try) none
4 openrouter/meta-llama/llama-4-scoutstandard 83.3% (5/6)
$0.00006±$0.00002
1.17s / 2.18s 11 (33% 1st-try) 6schema_break
5 openrouter/meta-llama/llama-3.3-70b-instructstandard 100.0% (6/6)
$0.00007±$0.00001
1.25s / 1.86s 6 (100% 1st-try) none
6 xai/grok-4-faststandard 100.0% (6/6)
$0.00007
2.52s / 3.28s 6 (100% 1st-try) none
7 openrouter/deepseek/deepseek-chatstandard 100.0% (6/6)
$0.00009±$0.00005
2.62s / 10.68s 7 (83% 1st-try) 1schema_break
8 google/gemini-2.5-flash-litestandard 100.0% (6/6)
$0.00011±$0.00003
6.92s / 17.68s 14 (0% 1st-try) 8schema_break
9 perplexity/sonarsearch 100.0% (6/6)
$0.00048±$0.00029
1.92s / 2.73s 13 (33% 1st-try) 7schema_break
10 openai/gpt-4ostandard 100.0% (6/6)
$0.00069±$0.00007
1.07s / 2.00s 6 (100% 1st-try) none
11 openai/o3-minireasoning 100.0% (6/6)
$0.00074±$0.00014
4.15s / 14.22s 6 (100% 1st-try) none
12 openai/o4-minireasoning 100.0% (6/6)
$0.00081±$0.00015
1.73s / 11.97s 6 (100% 1st-try) none
13 openai/o3reasoning 100.0% (6/6)
$0.00097±$0.00042
2.04s / 3.10s 6 (100% 1st-try) none
14 openrouter/mistralai/mistral-largestandard 100.0% (6/6)
$0.00098±$0.00044
1.19s / 1.40s 9 (50% 1st-try) 3schema_break
15 xai/grok-3standard 100.0% (6/6)
$0.00102
15.30s / 46.68s 6 (100% 1st-try) none
16 anthropic/claude-sonnet-4-6standard 100.0% (6/6)
$0.00108±$0.00023
8.80s / 26.46s 8 (83% 1st-try) 2error
17 openrouter/deepseek/deepseek-r1reasoning 100.0% (6/6)
$0.00114±$0.00048
17.26s / 33.41s 6 (100% 1st-try) none
18 google/gemini-2.5-flashstandard 50.0% (3/6)
$0.00120±$0.00015
2.31s / 3.21s 16 (0% 1st-try) 13schema_break
19 perplexity/sonar-prosearch 83.3% (5/6)
$0.00257±$0.00139
1.98s / 2.50s 11 (50% 1st-try) 6schema_break
20 xai/grok-4standard 100.0% (6/6)
$0.00286
6.06s / 6.69s 6 (100% 1st-try) none
21 anthropic/claude-haiku-4-5standard 0.0% (0/6)
infinite
0.80s / 1.31s 18 (0% 1st-try) 18schema_break
22 anthropic/claude-opus-4-7standard 0.0% (0/6)
infinite
0.14s / 0.26s 18 (0% 1st-try) 18error
23 google/gemini-2.5-prostandard 0.0% (0/6)
infinite
7.03s / 9.84s 18 (0% 1st-try) 18schema_break
24 openrouter/cohere/command-r-plusstandard 0.0% (0/6)
infinite
0.02s / 0.08s 18 (0% 1st-try) 18error
25 openrouter/qwen/qwen-3-235b-a22bstandard 0.0% (0/6)
infinite
0.03s / 0.08s 18 (0% 1st-try) 18error
26 perplexity/sonar-reasoning-proreasoning 0.0% (0/6)
infinite
4.42s / 6.46s 18 (0% 1st-try) 18schema_break
27 perplexity/sonar-reasoningreasoning 0.0% (0/6)
infinite
0.07s / 0.16s 18 (0% 1st-try) 18error
Pass 2026-05-08 01:46:18 UTC e6e987c clean
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 openai/gpt-4o-ministandard 100.0% (15/15)
$0.00004
1.07s / 1.53s 15 (100% 1st-try) none
2 google/gemini-2.5-flash-litestandard 100.0% (15/15)
$0.00011±$0.00003
0.70s / 1.62s 36 (0% 1st-try) 21schema_break
3 openai/gpt-4ostandard 100.0% (15/15)
$0.00068±$0.00007
0.66s / 1.18s 15 (100% 1st-try) none
4 google/gemini-2.5-flashstandard 73.3% (11/15)
$0.00076±$0.00014
1.98s / 2.89s 38 (0% 1st-try) 27schema_break
5 anthropic/claude-sonnet-4-6standard 100.0% (15/15)
$0.00099±$0.00001
1.40s / 1.75s 15 (100% 1st-try) none
6 anthropic/claude-haiku-4-5standard 0.0% (0/15)
infinite
0.82s / 2.27s 45 (0% 1st-try) 45schema_break
7 anthropic/claude-opus-4-7standard 0.0% (0/15)
infinite
0.11s / 0.41s 45 (0% 1st-try) 45error
8 google/gemini-2.5-prostandard 0.0% (0/15)
infinite
7.94s / 9.93s 45 (0% 1st-try) 45schema_break
Pass 2026-05-08 01:28:01 UTC d95fbc2 clean
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 openai/gpt-4o-ministandard 100.0% (15/15)
$0.00004
1.36s / 1.61s 15 (100% 1st-try) none
2 google/gemini-2.5-flash-litestandard 100.0% (15/15)
$0.00011
1.23s / 4.54s 36 (0% 1st-try) 21schema_break
3 openai/gpt-4ostandard 100.0% (15/15)
$0.00068
0.89s / 1.89s 15 (100% 1st-try) none
4 google/gemini-2.5-flashstandard 73.3% (11/15)
$0.00071
1.92s / 2.98s 36 (0% 1st-try) 25schema_break
5 anthropic/claude-sonnet-4-6standard 100.0% (15/15)
$0.00099
1.66s / 4.86s 15 (100% 1st-try) none
6 anthropic/claude-haiku-4-5standard 0.0% (0/15)
infinite
0.73s / 1.66s 45 (0% 1st-try) 45schema_break
7 anthropic/claude-opus-4-7standard 0.0% (0/15)
infinite
0.14s / 0.43s 45 (0% 1st-try) 45error
8 google/gemini-2.5-prostandard 0.0% (0/15)
infinite
7.73s / 10.11s 45 (0% 1st-try) 45schema_break
Pass 2026-05-06 02:14:42 UTC 2eb9d9a clean
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 google/gemini-2.5-flash-litestandard 100.0% (15/15)
$0.00011
0.62s / 0.95s 36 (0% 1st-try) 21schema_break
2 openai/gpt-4ostandard 100.0% (15/15)
$0.00069
0.84s / 1.97s 15 (100% 1st-try) none
3 anthropic/claude-sonnet-4-6standard 100.0% (15/15)
$0.00099
1.37s / 3.65s 15 (100% 1st-try) none
Pass 2026-05-06 01:25:00 UTC 423a52c dirty tree, s9 non-headline
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 google/gemini-2.5-flash-litestandard 86.7% (13/15)
$0.00012
0.66s / 1.47s 37 (0% 1st-try) 22schema_break 2error
2 openai/gpt-4ostandard 100.0% (15/15)
$0.00070
0.72s / 1.40s 15 (100% 1st-try) none
3 anthropic/claude-sonnet-4-6standard 100.0% (15/15)
$0.00099
1.32s / 3.03s 15 (100% 1st-try) none
Pass 2026-05-06 01:22:44 UTC 423a52c clean
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 google/gemini-2.5-flash-litestandard 100.0% (15/15)
$0.00011
0.71s / 1.91s 36 (0% 1st-try) 21schema_break
2 openai/gpt-4ostandard 100.0% (15/15)
$0.00071
1.17s / 2.27s 15 (100% 1st-try) none
3 anthropic/claude-sonnet-4-6standard 100.0% (15/15)
$0.00099
1.40s / 2.66s 15 (100% 1st-try) none