Critique-Pass delta (latest clean off vs latest clean on)
Per-model deltas between the critique-off baseline and the critique-on track per METHODOLOGY s13.3. Positive cost delta means the critique pass made it more expensive; positive success delta means it improved accuracy. Negative deltas are honestly reported.
| Provider / Model |
Off eff_TCoT |
On eff_TCoT |
Δ eff_TCoT |
Off success |
On success |
Δ success |
| anthropic/claude-haiku-4-5standard |
infinite |
infinite |
n/a
|
0.0% |
0.0% |
+0.0 pp
|
| anthropic/claude-opus-4-7standard |
infinite |
infinite |
n/a
|
0.0% |
0.0% |
+0.0 pp
|
| anthropic/claude-sonnet-4-6standard |
$0.00108 |
$0.00223 |
+106.5%
|
100.0% |
100.0% |
+0.0 pp
|
| google/gemini-2.5-flashstandard |
$0.00120 |
$0.00154 |
+27.5%
|
50.0% |
83.3% |
+33.3 pp
|
| google/gemini-2.5-flash-litestandard |
$0.00011 |
$0.00034 |
+225.3%
|
100.0% |
66.7% |
-33.3 pp
|
| google/gemini-2.5-prostandard |
infinite |
infinite |
n/a
|
0.0% |
0.0% |
+0.0 pp
|
| openai/gpt-4ostandard |
$0.00069 |
$0.00150 |
+116.4%
|
100.0% |
100.0% |
+0.0 pp
|
| openai/gpt-4o-ministandard |
$0.00004 |
$0.00010 |
+130.2%
|
100.0% |
100.0% |
+0.0 pp
|
| openai/o3reasoning |
$0.00097 |
$0.00277 |
+186.1%
|
100.0% |
100.0% |
+0.0 pp
|
| openai/o3-minireasoning |
$0.00074 |
$0.00132 |
+79.5%
|
100.0% |
100.0% |
+0.0 pp
|
| openai/o4-minireasoning |
$0.00081 |
$0.00199 |
+144.8%
|
100.0% |
100.0% |
+0.0 pp
|
| openrouter/cohere/command-r-plusstandard |
infinite |
infinite |
n/a
|
0.0% |
0.0% |
+0.0 pp
|
| openrouter/deepseek/deepseek-chatstandard |
$0.00009 |
$0.00017 |
+91.0%
|
100.0% |
100.0% |
+0.0 pp
|
| openrouter/deepseek/deepseek-r1reasoning |
$0.00114 |
$0.00214 |
+87.5%
|
100.0% |
100.0% |
+0.0 pp
|
| openrouter/meta-llama/llama-3.3-70b-instructstandard |
$0.00007 |
$0.00016 |
+137.3%
|
100.0% |
100.0% |
+0.0 pp
|
| openrouter/meta-llama/llama-4-maverickstandard |
$0.00005 |
$0.00010 |
+123.5%
|
100.0% |
100.0% |
+0.0 pp
|
| openrouter/meta-llama/llama-4-scoutstandard |
$0.00006 |
$0.00012 |
+79.7%
|
83.3% |
100.0% |
+16.7 pp
|
| openrouter/mistralai/mistral-largestandard |
$0.00098 |
$0.00136 |
+39.4%
|
100.0% |
100.0% |
+0.0 pp
|
| openrouter/qwen/qwen-3-235b-a22bstandard |
infinite |
infinite |
n/a
|
0.0% |
0.0% |
+0.0 pp
|
| perplexity/sonarsearch |
$0.00048 |
$0.00036 |
-24.9%
|
100.0% |
100.0% |
+0.0 pp
|
| perplexity/sonar-prosearch |
$0.00257 |
$0.00193 |
-24.9%
|
83.3% |
100.0% |
+16.7 pp
|
| perplexity/sonar-reasoningreasoning |
infinite |
infinite |
n/a
|
0.0% |
0.0% |
+0.0 pp
|
| perplexity/sonar-reasoning-proreasoning |
infinite |
$0.00110 |
n/a
|
0.0% |
100.0% |
+100.0 pp
|
| xai/grok-3standard |
$0.00102 |
$0.00305 |
+199.7%
|
100.0% |
100.0% |
+0.0 pp
|
| xai/grok-3-ministandard |
$0.00006 |
$0.00021 |
+280.0%
|
100.0% |
100.0% |
+0.0 pp
|
| xai/grok-4standard |
$0.00286 |
$0.00305 |
+6.7%
|
100.0% |
100.0% |
+0.0 pp
|
| xai/grok-4-faststandard |
$0.00007 |
$0.00016 |
+130.2%
|
100.0% |
100.0% |
+0.0 pp
|
Reproducibility (latest pass vs previous)
| Provider / Model |
Latest eff_TCoT |
Previous eff_TCoT |
Δ eff_TCoT |
Δ success rate |
| anthropic/claude-haiku-4-5 |
$n/a |
$n/a |
n/a
|
+0.0 pp
|
| anthropic/claude-opus-4-7 |
$n/a |
$n/a |
n/a
|
+0.0 pp
|
| anthropic/claude-sonnet-4-6 |
$0.00108 |
$0.00099 |
+9.5%
|
+0.0 pp
|
| google/gemini-2.5-flash |
$0.00120 |
$0.00076 |
+59.4%
|
-23.3 pp
|
| google/gemini-2.5-flash-lite |
$0.00011 |
$0.00011 |
-3.6%
|
+0.0 pp
|
| google/gemini-2.5-pro |
$n/a |
$n/a |
n/a
|
+0.0 pp
|
| openai/gpt-4o |
$0.00069 |
$0.00068 |
+2.6%
|
+0.0 pp
|
| openai/gpt-4o-mini |
$0.00004 |
$0.00004 |
+0.0%
|
+0.0 pp
|
Failure-mode totals across all passes
Counts per failure mode summed across every attempt across every pass; useful for spotting which providers exhibit which failure patterns systematically.
| Provider / Model |
Failure modes (total counts) |
| openrouter/meta-llama/llama-4-scout |
10schema_break
|
| google/gemini-2.5-flash-lite |
123schema_break
4error
|
| google/gemini-2.5-flash |
76schema_break
|
| anthropic/claude-haiku-4-5 |
126schema_break
|
| anthropic/claude-opus-4-7 |
126error
|
| google/gemini-2.5-pro |
126schema_break
|
| openrouter/cohere/command-r-plus |
36error
|
| openrouter/qwen/qwen-3-235b-a22b |
36error
|
| perplexity/sonar-reasoning |
36error
|
| openrouter/deepseek/deepseek-chat |
1schema_break
|
| perplexity/sonar |
7schema_break
|
| openrouter/mistralai/mistral-large |
3schema_break
|
| anthropic/claude-sonnet-4-6 |
2error
|
| perplexity/sonar-pro |
6schema_break
|
| perplexity/sonar-reasoning-pro |
18schema_break
|
All bench passes (chronological, newest first)
Pass 2026-05-16 16:31:49 UTC
4e158ca
clean
critique-pass (s13)
| # |
Provider / Model |
Success |
effective_TCoT |
Latency p50 / p95 |
Attempts |
Failure modes (this pass) |
| 1 |
openai/gpt-4o-ministandard |
100.0% (6/6) |
|
3.21s / 7.87s |
6 (100% 1st-try) |
none
|
| 2 |
openrouter/meta-llama/llama-4-maverickstandard |
100.0% (6/6) |
|
4.77s / 5.84s |
6 (100% 1st-try) |
none
|
| 3 |
openrouter/meta-llama/llama-4-scoutstandard |
100.0% (6/6) |
|
1.92s / 2.27s |
10 (50% 1st-try) |
4schema_break
|
| 4 |
xai/grok-4-faststandard |
100.0% (6/6) |
|
7.33s / 10.07s |
6 (100% 1st-try) |
none
|
| 5 |
openrouter/meta-llama/llama-3.3-70b-instructstandard |
100.0% (6/6) |
|
3.07s / 7.90s |
6 (100% 1st-try) |
none
|
| 6 |
openrouter/deepseek/deepseek-chatstandard |
100.0% (6/6) |
|
4.52s / 6.01s |
6 (100% 1st-try) |
none
|
| 7 |
xai/grok-3-ministandard |
100.0% (6/6) |
|
7.14s / 178.48s |
6 (100% 1st-try) |
none
|
| 8 |
google/gemini-2.5-flash-litestandard |
66.7% (4/6) |
|
7.81s / 22.92s |
15 (0% 1st-try) |
9schema_break
2error
|
| 9 |
perplexity/sonarsearch |
100.0% (6/6) |
|
4.25s / 4.91s |
6 (100% 1st-try) |
none
|
| 10 |
perplexity/sonar-reasoning-proreasoning |
100.0% (6/6) |
|
5.38s / 6.08s |
6 (100% 1st-try) |
none
|
| 11 |
openai/o3-minireasoning |
100.0% (6/6) |
|
3.15s / 3.60s |
6 (100% 1st-try) |
none
|
| 12 |
openrouter/mistralai/mistral-largestandard |
100.0% (6/6) |
|
2.47s / 3.18s |
6 (100% 1st-try) |
none
|
| 13 |
openai/gpt-4ostandard |
100.0% (6/6) |
|
2.41s / 5.04s |
6 (100% 1st-try) |
none
|
| 14 |
google/gemini-2.5-flashstandard |
83.3% (5/6) |
|
5.39s / 7.06s |
16 (0% 1st-try) |
11schema_break
|
| 15 |
perplexity/sonar-prosearch |
100.0% (6/6) |
|
4.30s / 4.77s |
6 (100% 1st-try) |
none
|
| 16 |
openai/o4-minireasoning |
100.0% (6/6) |
|
3.14s / 3.42s |
6 (100% 1st-try) |
none
|
| 17 |
openrouter/deepseek/deepseek-r1reasoning |
100.0% (6/6) |
|
26.19s / 193.73s |
6 (100% 1st-try) |
none
|
| 18 |
anthropic/claude-sonnet-4-6standard |
100.0% (6/6) |
|
2.47s / 2.55s |
6 (100% 1st-try) |
none
|
| 19 |
openai/o3reasoning |
100.0% (6/6) |
|
4.36s / 5.01s |
6 (100% 1st-try) |
none
|
| 20 |
xai/grok-3standard |
100.0% (6/6) |
|
5.82s / 145.38s |
6 (100% 1st-try) |
none
|
| 21 |
xai/grok-4standard |
100.0% (6/6) |
|
11.97s / 70.09s |
6 (100% 1st-try) |
none
|
| 22 |
anthropic/claude-haiku-4-5standard |
0.0% (0/6) |
|
1.41s / 2.13s |
18 (0% 1st-try) |
18schema_break
|
| 23 |
anthropic/claude-opus-4-7standard |
0.0% (0/6) |
|
0.12s / 0.17s |
18 (0% 1st-try) |
18error
|
| 24 |
google/gemini-2.5-prostandard |
0.0% (0/6) |
|
14.60s / 17.72s |
18 (0% 1st-try) |
18schema_break
|
| 25 |
openrouter/cohere/command-r-plusstandard |
0.0% (0/6) |
|
0.03s / 0.09s |
18 (0% 1st-try) |
18error
|
| 26 |
openrouter/qwen/qwen-3-235b-a22bstandard |
0.0% (0/6) |
|
0.02s / 0.10s |
18 (0% 1st-try) |
18error
|
| 27 |
perplexity/sonar-reasoningreasoning |
0.0% (0/6) |
|
0.06s / 0.13s |
18 (0% 1st-try) |
18error
|
Pass 2026-05-08 15:11:05 UTC
46ce0a5
clean
| # |
Provider / Model |
Success |
effective_TCoT |
Latency p50 / p95 |
Attempts |
Failure modes (this pass) |
| 1 |
openai/gpt-4o-ministandard |
100.0% (6/6) |
|
1.45s / 1.58s |
6 (100% 1st-try) |
none
|
| 2 |
openrouter/meta-llama/llama-4-maverickstandard |
100.0% (6/6) |
|
2.38s / 5.55s |
6 (100% 1st-try) |
none
|
| 3 |
xai/grok-3-ministandard |
100.0% (6/6) |
|
7.67s / 8.35s |
6 (100% 1st-try) |
none
|
| 4 |
openrouter/meta-llama/llama-4-scoutstandard |
83.3% (5/6) |
|
1.17s / 2.18s |
11 (33% 1st-try) |
6schema_break
|
| 5 |
openrouter/meta-llama/llama-3.3-70b-instructstandard |
100.0% (6/6) |
|
1.25s / 1.86s |
6 (100% 1st-try) |
none
|
| 6 |
xai/grok-4-faststandard |
100.0% (6/6) |
|
2.52s / 3.28s |
6 (100% 1st-try) |
none
|
| 7 |
openrouter/deepseek/deepseek-chatstandard |
100.0% (6/6) |
|
2.62s / 10.68s |
7 (83% 1st-try) |
1schema_break
|
| 8 |
google/gemini-2.5-flash-litestandard |
100.0% (6/6) |
|
6.92s / 17.68s |
14 (0% 1st-try) |
8schema_break
|
| 9 |
perplexity/sonarsearch |
100.0% (6/6) |
|
1.92s / 2.73s |
13 (33% 1st-try) |
7schema_break
|
| 10 |
openai/gpt-4ostandard |
100.0% (6/6) |
|
1.07s / 2.00s |
6 (100% 1st-try) |
none
|
| 11 |
openai/o3-minireasoning |
100.0% (6/6) |
|
4.15s / 14.22s |
6 (100% 1st-try) |
none
|
| 12 |
openai/o4-minireasoning |
100.0% (6/6) |
|
1.73s / 11.97s |
6 (100% 1st-try) |
none
|
| 13 |
openai/o3reasoning |
100.0% (6/6) |
|
2.04s / 3.10s |
6 (100% 1st-try) |
none
|
| 14 |
openrouter/mistralai/mistral-largestandard |
100.0% (6/6) |
|
1.19s / 1.40s |
9 (50% 1st-try) |
3schema_break
|
| 15 |
xai/grok-3standard |
100.0% (6/6) |
|
15.30s / 46.68s |
6 (100% 1st-try) |
none
|
| 16 |
anthropic/claude-sonnet-4-6standard |
100.0% (6/6) |
|
8.80s / 26.46s |
8 (83% 1st-try) |
2error
|
| 17 |
openrouter/deepseek/deepseek-r1reasoning |
100.0% (6/6) |
|
17.26s / 33.41s |
6 (100% 1st-try) |
none
|
| 18 |
google/gemini-2.5-flashstandard |
50.0% (3/6) |
|
2.31s / 3.21s |
16 (0% 1st-try) |
13schema_break
|
| 19 |
perplexity/sonar-prosearch |
83.3% (5/6) |
|
1.98s / 2.50s |
11 (50% 1st-try) |
6schema_break
|
| 20 |
xai/grok-4standard |
100.0% (6/6) |
|
6.06s / 6.69s |
6 (100% 1st-try) |
none
|
| 21 |
anthropic/claude-haiku-4-5standard |
0.0% (0/6) |
|
0.80s / 1.31s |
18 (0% 1st-try) |
18schema_break
|
| 22 |
anthropic/claude-opus-4-7standard |
0.0% (0/6) |
|
0.14s / 0.26s |
18 (0% 1st-try) |
18error
|
| 23 |
google/gemini-2.5-prostandard |
0.0% (0/6) |
|
7.03s / 9.84s |
18 (0% 1st-try) |
18schema_break
|
| 24 |
openrouter/cohere/command-r-plusstandard |
0.0% (0/6) |
|
0.02s / 0.08s |
18 (0% 1st-try) |
18error
|
| 25 |
openrouter/qwen/qwen-3-235b-a22bstandard |
0.0% (0/6) |
|
0.03s / 0.08s |
18 (0% 1st-try) |
18error
|
| 26 |
perplexity/sonar-reasoning-proreasoning |
0.0% (0/6) |
|
4.42s / 6.46s |
18 (0% 1st-try) |
18schema_break
|
| 27 |
perplexity/sonar-reasoningreasoning |
0.0% (0/6) |
|
0.07s / 0.16s |
18 (0% 1st-try) |
18error
|
Pass 2026-05-08 01:46:18 UTC
e6e987c
clean
| # |
Provider / Model |
Success |
effective_TCoT |
Latency p50 / p95 |
Attempts |
Failure modes (this pass) |
| 1 |
openai/gpt-4o-ministandard |
100.0% (15/15) |
|
1.07s / 1.53s |
15 (100% 1st-try) |
none
|
| 2 |
google/gemini-2.5-flash-litestandard |
100.0% (15/15) |
|
0.70s / 1.62s |
36 (0% 1st-try) |
21schema_break
|
| 3 |
openai/gpt-4ostandard |
100.0% (15/15) |
|
0.66s / 1.18s |
15 (100% 1st-try) |
none
|
| 4 |
google/gemini-2.5-flashstandard |
73.3% (11/15) |
|
1.98s / 2.89s |
38 (0% 1st-try) |
27schema_break
|
| 5 |
anthropic/claude-sonnet-4-6standard |
100.0% (15/15) |
|
1.40s / 1.75s |
15 (100% 1st-try) |
none
|
| 6 |
anthropic/claude-haiku-4-5standard |
0.0% (0/15) |
|
0.82s / 2.27s |
45 (0% 1st-try) |
45schema_break
|
| 7 |
anthropic/claude-opus-4-7standard |
0.0% (0/15) |
|
0.11s / 0.41s |
45 (0% 1st-try) |
45error
|
| 8 |
google/gemini-2.5-prostandard |
0.0% (0/15) |
|
7.94s / 9.93s |
45 (0% 1st-try) |
45schema_break
|
Pass 2026-05-08 01:28:01 UTC
d95fbc2
clean
| # |
Provider / Model |
Success |
effective_TCoT |
Latency p50 / p95 |
Attempts |
Failure modes (this pass) |
| 1 |
openai/gpt-4o-ministandard |
100.0% (15/15) |
|
1.36s / 1.61s |
15 (100% 1st-try) |
none
|
| 2 |
google/gemini-2.5-flash-litestandard |
100.0% (15/15) |
|
1.23s / 4.54s |
36 (0% 1st-try) |
21schema_break
|
| 3 |
openai/gpt-4ostandard |
100.0% (15/15) |
|
0.89s / 1.89s |
15 (100% 1st-try) |
none
|
| 4 |
google/gemini-2.5-flashstandard |
73.3% (11/15) |
|
1.92s / 2.98s |
36 (0% 1st-try) |
25schema_break
|
| 5 |
anthropic/claude-sonnet-4-6standard |
100.0% (15/15) |
|
1.66s / 4.86s |
15 (100% 1st-try) |
none
|
| 6 |
anthropic/claude-haiku-4-5standard |
0.0% (0/15) |
|
0.73s / 1.66s |
45 (0% 1st-try) |
45schema_break
|
| 7 |
anthropic/claude-opus-4-7standard |
0.0% (0/15) |
|
0.14s / 0.43s |
45 (0% 1st-try) |
45error
|
| 8 |
google/gemini-2.5-prostandard |
0.0% (0/15) |
|
7.73s / 10.11s |
45 (0% 1st-try) |
45schema_break
|
Pass 2026-05-06 02:14:42 UTC
2eb9d9a
clean
| # |
Provider / Model |
Success |
effective_TCoT |
Latency p50 / p95 |
Attempts |
Failure modes (this pass) |
| 1 |
google/gemini-2.5-flash-litestandard |
100.0% (15/15) |
|
0.62s / 0.95s |
36 (0% 1st-try) |
21schema_break
|
| 2 |
openai/gpt-4ostandard |
100.0% (15/15) |
|
0.84s / 1.97s |
15 (100% 1st-try) |
none
|
| 3 |
anthropic/claude-sonnet-4-6standard |
100.0% (15/15) |
|
1.37s / 3.65s |
15 (100% 1st-try) |
none
|
Pass 2026-05-06 01:25:00 UTC
423a52c
dirty tree, s9 non-headline
| # |
Provider / Model |
Success |
effective_TCoT |
Latency p50 / p95 |
Attempts |
Failure modes (this pass) |
| 1 |
google/gemini-2.5-flash-litestandard |
86.7% (13/15) |
|
0.66s / 1.47s |
37 (0% 1st-try) |
22schema_break
2error
|
| 2 |
openai/gpt-4ostandard |
100.0% (15/15) |
|
0.72s / 1.40s |
15 (100% 1st-try) |
none
|
| 3 |
anthropic/claude-sonnet-4-6standard |
100.0% (15/15) |
|
1.32s / 3.03s |
15 (100% 1st-try) |
none
|
Pass 2026-05-06 01:22:44 UTC
423a52c
clean
| # |
Provider / Model |
Success |
effective_TCoT |
Latency p50 / p95 |
Attempts |
Failure modes (this pass) |
| 1 |
google/gemini-2.5-flash-litestandard |
100.0% (15/15) |
|
0.71s / 1.91s |
36 (0% 1st-try) |
21schema_break
|
| 2 |
openai/gpt-4ostandard |
100.0% (15/15) |
|
1.17s / 2.27s |
15 (100% 1st-try) |
none
|
| 3 |
anthropic/claude-sonnet-4-6standard |
100.0% (15/15) |
|
1.40s / 2.66s |
15 (100% 1st-try) |
none
|