bellwether  ›  synthetic_rag

Critique-Pass delta (latest clean off vs latest clean on)

Per-model deltas between the critique-off baseline and the critique-on track per METHODOLOGY s13.3. Positive cost delta means the critique pass made it more expensive; positive success delta means it improved accuracy. Negative deltas are honestly reported.

Provider / Model Off eff_TCoT On eff_TCoT Δ eff_TCoT Off success On success Δ success
anthropic/claude-haiku-4-5standard $0.00053 $0.00120 +127.4% 66.7% 66.7% +0.0 pp
anthropic/claude-opus-4-7standard infinite infinite n/a 0.0% 0.0% +0.0 pp
anthropic/claude-sonnet-4-6standard $0.00142 $0.00323 +128.3% 66.7% 66.7% +0.0 pp
google/gemini-2.5-flashstandard $0.00012 $0.00028 +130.3% 66.7% 66.7% +0.0 pp
google/gemini-2.5-flash-litestandard $0.00004 $0.00009 +123.1% 66.7% 66.7% +0.0 pp
google/gemini-2.5-prostandard $0.00046 $0.00114 +148.2% 83.3% 66.7% -16.7 pp
openai/gpt-4ostandard $0.00090 $0.00208 +131.8% 66.7% 66.7% +0.0 pp
openai/gpt-4o-ministandard $0.00006 $0.00013 +125.4% 66.7% 66.7% +0.0 pp
openai/o3reasoning $0.00661 $0.01409 +113.1% 66.7% 66.7% +0.0 pp
openai/o3-minireasoning $0.00289 $0.00506 +75.4% 66.7% 66.7% +0.0 pp
openai/o4-minireasoning $0.00314 $0.00720 +128.9% 66.7% 66.7% +0.0 pp
openrouter/cohere/command-r-plusstandard infinite infinite n/a 0.0% 0.0% +0.0 pp
openrouter/deepseek/deepseek-chatstandard $0.00010 $0.00024 +128.1% 66.7% 66.7% +0.0 pp
openrouter/deepseek/deepseek-r1reasoning $0.00266 $0.00613 +130.2% 66.7% 66.7% +0.0 pp
openrouter/meta-llama/llama-3.3-70b-instructstandard $0.00015 $0.00058 +288.8% 66.7% 50.0% -16.7 pp
openrouter/meta-llama/llama-4-maverickstandard $0.00007 $0.00016 +123.6% 66.7% 66.7% +0.0 pp
openrouter/meta-llama/llama-4-scoutstandard $0.00004 $0.00009 +91.0% 66.7% 66.7% +0.0 pp
openrouter/mistralai/mistral-largestandard $0.00058 $0.00169 +190.6% 83.3% 66.7% -16.7 pp
openrouter/qwen/qwen-3-235b-a22bstandard infinite infinite n/a 0.0% 0.0% +0.0 pp
perplexity/sonarsearch $0.00033 $0.00079 +136.8% 66.7% 66.7% +0.0 pp
perplexity/sonar-prosearch $0.00149 $0.00252 +69.6% 66.7% 66.7% +0.0 pp
perplexity/sonar-reasoningreasoning infinite infinite n/a 0.0% 0.0% +0.0 pp
perplexity/sonar-reasoning-proreasoning infinite $0.00168 n/a 0.0% 66.7% +66.7 pp
xai/grok-3standard $0.00037 $0.00472 +1182.1% 100.0% 66.7% -33.3 pp
xai/grok-3-ministandard $0.00011 $0.00043 +297.4% 66.7% 66.7% +0.0 pp
xai/grok-4standard $0.00629 $0.00454 -27.9% 66.7% 66.7% +0.0 pp
xai/grok-4-faststandard $0.00015 $0.00030 +102.4% 66.7% 66.7% +0.0 pp

Reproducibility (latest pass vs previous)

Provider / Model Latest eff_TCoT Previous eff_TCoT Δ eff_TCoT Δ success rate
anthropic/claude-haiku-4-5 $0.00053 $0.00057 -7.3% +0.0 pp
anthropic/claude-opus-4-7 $n/a $n/a n/a +0.0 pp
anthropic/claude-sonnet-4-6 $0.00142 $0.00174 -18.5% +6.7 pp
google/gemini-2.5-flash $0.00012 $0.00012 +2.5% -6.7 pp
google/gemini-2.5-flash-lite $0.00004 $0.00004 -13.4% +6.7 pp
google/gemini-2.5-pro $0.00046 $0.00049 -5.8% +10.0 pp
openai/gpt-4o $0.00090 $0.00110 -18.1% +6.7 pp
openai/gpt-4o-mini $0.00006 $0.00007 -18.5% +6.7 pp

Failure-mode totals across all passes

Counts per failure mode summed across every attempt across every pass; useful for spotting which providers exhibit which failure patterns systematically.

Provider / Model Failure modes (total counts)
google/gemini-2.5-flash-lite 48confabulation 1error
openrouter/meta-llama/llama-4-scout 12confabulation 2schema_break
openai/gpt-4o-mini 48confabulation
openrouter/meta-llama/llama-4-maverick 12confabulation
openrouter/deepseek/deepseek-chat 12confabulation
google/gemini-2.5-flash 28confabulation
xai/grok-4-fast 12confabulation
xai/grok-3-mini 12confabulation
openrouter/meta-llama/llama-3.3-70b-instruct 15confabulation
perplexity/sonar 12confabulation
google/gemini-2.5-pro 19confabulation 8schema_break
anthropic/claude-haiku-4-5 44confabulation
perplexity/sonar-reasoning-pro 24confabulation
openrouter/mistralai/mistral-large 11confabulation
openai/gpt-4o 48confabulation
perplexity/sonar-pro 12confabulation
anthropic/claude-sonnet-4-6 48confabulation
xai/grok-4 12confabulation
xai/grok-3 6confabulation
openai/o3-mini 12confabulation
openrouter/deepseek/deepseek-r1 12confabulation
openai/o4-mini 12confabulation
openai/o3 12confabulation
anthropic/claude-opus-4-7 126error
openrouter/cohere/command-r-plus 36error
openrouter/qwen/qwen-3-235b-a22b 36error
perplexity/sonar-reasoning 36error

All bench passes (chronological, newest first)

Pass 2026-05-16 16:31:49 UTC 4e158ca clean critique-pass (s13)
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 google/gemini-2.5-flash-litestandard 66.7% (4/6)
$0.00009±$0.00000
0.99s / 1.23s 10 (67% 1st-try) 6confabulation
2 openrouter/meta-llama/llama-4-scoutstandard 66.7% (4/6)
$0.00009±$0.00000
0.43s / 0.77s 10 (67% 1st-try) 6confabulation
3 openai/gpt-4o-ministandard 66.7% (4/6)
$0.00013±$0.00000
1.13s / 1.66s 10 (67% 1st-try) 6confabulation
4 openrouter/meta-llama/llama-4-maverickstandard 66.7% (4/6)
$0.00016±$0.00000
1.07s / 2.01s 10 (67% 1st-try) 6confabulation
5 openrouter/deepseek/deepseek-chatstandard 66.7% (4/6)
$0.00024±$0.00000
2.54s / 8.29s 10 (67% 1st-try) 6confabulation
6 google/gemini-2.5-flashstandard 66.7% (4/6)
$0.00028±$0.00001
2.02s / 19.72s 10 (67% 1st-try) 6confabulation
7 xai/grok-4-faststandard 66.7% (4/6)
$0.00030±$0.00000
6.05s / 20.94s 10 (67% 1st-try) 6confabulation
8 xai/grok-3-ministandard 66.7% (4/6)
$0.00043±$0.00000
6.21s / 11.95s 10 (67% 1st-try) 6confabulation
9 openrouter/meta-llama/llama-3.3-70b-instructstandard 50.0% (3/6)
$0.00058±$0.00001
1.17s / 8.35s 12 (50% 1st-try) 9confabulation
10 perplexity/sonarsearch 66.7% (4/6)
$0.00079±$0.00001
3.74s / 3.96s 10 (67% 1st-try) 6confabulation
11 google/gemini-2.5-prostandard 66.7% (4/6)
$0.00114±$0.00002
9.77s / 31.75s 10 (67% 1st-try) 4schema_break 2confabulation
12 anthropic/claude-haiku-4-5standard 66.7% (4/6)
$0.00120±$0.00000
1.25s / 1.34s 10 (67% 1st-try) 6confabulation
13 perplexity/sonar-reasoning-proreasoning 66.7% (4/6)
$0.00168±$0.00001
4.62s / 7.54s 10 (67% 1st-try) 6confabulation
14 openrouter/mistralai/mistral-largestandard 66.7% (4/6)
$0.00169±$0.00001
1.09s / 1.46s 10 (67% 1st-try) 6confabulation
15 openai/gpt-4ostandard 66.7% (4/6)
$0.00208±$0.00001
1.10s / 6.71s 10 (67% 1st-try) 6confabulation
16 perplexity/sonar-prosearch 66.7% (4/6)
$0.00252±$0.00002
3.92s / 5.44s 10 (67% 1st-try) 6confabulation
17 anthropic/claude-sonnet-4-6standard 66.7% (4/6)
$0.00323±$0.00001
2.12s / 2.40s 10 (67% 1st-try) 6confabulation
18 xai/grok-4standard 66.7% (4/6)
$0.00454±$0.00001
5.73s / 15.56s 10 (67% 1st-try) 6confabulation
19 xai/grok-3standard 66.7% (4/6)
$0.00472±$0.00001
7.14s / 16.23s 10 (67% 1st-try) 6confabulation
20 openai/o3-minireasoning 66.7% (4/6)
$0.00506±$0.00047
3.92s / 14.39s 10 (67% 1st-try) 6confabulation
21 openrouter/deepseek/deepseek-r1reasoning 66.7% (4/6)
$0.00613±$0.00015
25.88s / 167.43s 10 (67% 1st-try) 6confabulation
22 openai/o4-minireasoning 66.7% (4/6)
$0.00720±$0.00040
3.53s / 8.92s 10 (67% 1st-try) 6confabulation
23 openai/o3reasoning 66.7% (4/6)
$0.01409±$0.00026
2.39s / 17.25s 10 (67% 1st-try) 6confabulation
24 anthropic/claude-opus-4-7standard 0.0% (0/6)
infinite
0.11s / 0.31s 18 (0% 1st-try) 18error
25 openrouter/cohere/command-r-plusstandard 0.0% (0/6)
infinite
0.03s / 0.10s 18 (0% 1st-try) 18error
26 openrouter/qwen/qwen-3-235b-a22bstandard 0.0% (0/6)
infinite
0.04s / 0.11s 18 (0% 1st-try) 18error
27 perplexity/sonar-reasoningreasoning 0.0% (0/6)
infinite
0.07s / 0.15s 18 (0% 1st-try) 18error
Pass 2026-05-08 15:11:05 UTC 46ce0a5 clean
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 google/gemini-2.5-flash-litestandard 66.7% (4/6)
$0.00004±$0.00000
6.13s / 18.96s 11 (50% 1st-try) 6confabulation 1error
2 openrouter/meta-llama/llama-4-scoutstandard 66.7% (4/6)
$0.00004±$0.00001
0.41s / 0.58s 12 (33% 1st-try) 6confabulation 2schema_break
3 openai/gpt-4o-ministandard 66.7% (4/6)
$0.00006±$0.00000
0.62s / 0.76s 10 (67% 1st-try) 6confabulation
4 openrouter/meta-llama/llama-4-maverickstandard 66.7% (4/6)
$0.00007±$0.00001
0.85s / 3.69s 10 (67% 1st-try) 6confabulation
5 openrouter/deepseek/deepseek-chatstandard 66.7% (4/6)
$0.00010±$0.00000
0.93s / 1.49s 10 (67% 1st-try) 6confabulation
6 xai/grok-3-ministandard 66.7% (4/6)
$0.00011±$0.00000
12.55s / 16.70s 10 (67% 1st-try) 6confabulation
7 google/gemini-2.5-flashstandard 66.7% (4/6)
$0.00012±$0.00000
1.03s / 10.84s 10 (67% 1st-try) 6confabulation
8 xai/grok-4-faststandard 66.7% (4/6)
$0.00015±$0.00000
1.83s / 7.16s 10 (67% 1st-try) 6confabulation
9 openrouter/meta-llama/llama-3.3-70b-instructstandard 66.7% (4/6)
$0.00015±$0.00000
0.57s / 4.98s 10 (67% 1st-try) 6confabulation
10 perplexity/sonarsearch 66.7% (4/6)
$0.00033±$0.00000
1.42s / 1.75s 10 (67% 1st-try) 6confabulation
11 xai/grok-3standard 100.0% (6/6)
$0.00037±$0.00002
0.90s / 2.67s 6 (100% 1st-try) none
12 google/gemini-2.5-prostandard 83.3% (5/6)
$0.00046±$0.00020
6.97s / 13.99s 10 (67% 1st-try) 5confabulation
13 anthropic/claude-haiku-4-5standard 66.7% (4/6)
$0.00053±$0.00000
0.67s / 0.79s 10 (67% 1st-try) 6confabulation
14 openrouter/mistralai/mistral-largestandard 83.3% (5/6)
$0.00058±$0.00031
0.62s / 2.50s 10 (67% 1st-try) 5confabulation
15 openai/gpt-4ostandard 66.7% (4/6)
$0.00090±$0.00001
0.77s / 1.69s 10 (67% 1st-try) 6confabulation
16 anthropic/claude-sonnet-4-6standard 66.7% (4/6)
$0.00142±$0.00001
1.02s / 1.52s 10 (67% 1st-try) 6confabulation
17 perplexity/sonar-prosearch 66.7% (4/6)
$0.00149±$0.00001
1.53s / 2.82s 10 (67% 1st-try) 6confabulation
18 openrouter/deepseek/deepseek-r1reasoning 66.7% (4/6)
$0.00266±$0.00007
10.87s / 68.12s 10 (67% 1st-try) 6confabulation
19 openai/o3-minireasoning 66.7% (4/6)
$0.00289±$0.00028
2.11s / 5.38s 10 (67% 1st-try) 6confabulation
20 openai/o4-minireasoning 66.7% (4/6)
$0.00314±$0.00027
2.13s / 7.69s 10 (67% 1st-try) 6confabulation
21 xai/grok-4standard 66.7% (4/6)
$0.00629±$0.00001
5.10s / 84.15s 10 (67% 1st-try) 6confabulation
22 openai/o3reasoning 66.7% (4/6)
$0.00661±$0.00000
3.80s / 7.88s 10 (67% 1st-try) 6confabulation
23 anthropic/claude-opus-4-7standard 0.0% (0/6)
infinite
0.12s / 0.51s 18 (0% 1st-try) 18error
24 openrouter/cohere/command-r-plusstandard 0.0% (0/6)
infinite
0.02s / 0.08s 18 (0% 1st-try) 18error
25 openrouter/qwen/qwen-3-235b-a22bstandard 0.0% (0/6)
infinite
0.02s / 0.10s 18 (0% 1st-try) 18error
26 perplexity/sonar-reasoning-proreasoning 0.0% (0/6)
infinite
7.37s / 11.67s 18 (0% 1st-try) 18confabulation
27 perplexity/sonar-reasoningreasoning 0.0% (0/6)
infinite
0.07s / 0.12s 18 (0% 1st-try) 18error
Pass 2026-05-08 01:46:18 UTC e6e987c clean
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 google/gemini-2.5-flash-litestandard 60.0% (9/15)
$0.00004±$0.00000
0.50s / 0.73s 27 (60% 1st-try) 18confabulation
2 openai/gpt-4o-ministandard 60.0% (9/15)
$0.00007±$0.00000
0.67s / 1.34s 27 (60% 1st-try) 18confabulation
3 google/gemini-2.5-flashstandard 73.3% (11/15)
$0.00012±$0.00004
1.15s / 10.89s 27 (60% 1st-try) 16confabulation
4 google/gemini-2.5-prostandard 73.3% (11/15)
$0.00049±$0.00018
6.45s / 18.34s 27 (60% 1st-try) 12confabulation 4schema_break
5 anthropic/claude-haiku-4-5standard 66.7% (10/15)
$0.00057±$0.00013
0.70s / 1.51s 27 (60% 1st-try) 17confabulation
6 openai/gpt-4ostandard 60.0% (9/15)
$0.00110±$0.00001
0.51s / 0.73s 27 (60% 1st-try) 18confabulation
7 anthropic/claude-sonnet-4-6standard 60.0% (9/15)
$0.00174±$0.00000
1.08s / 3.36s 27 (60% 1st-try) 18confabulation
8 anthropic/claude-opus-4-7standard 0.0% (0/15)
infinite
0.13s / 0.41s 45 (0% 1st-try) 45error
Pass 2026-05-08 01:28:01 UTC d95fbc2 clean
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 google/gemini-2.5-flash-litestandard 60.0% (9/15)
$0.00004
0.49s / 0.80s 27 (60% 1st-try) 18confabulation
2 openai/gpt-4o-ministandard 60.0% (9/15)
$0.00007
0.61s / 0.88s 27 (60% 1st-try) 18confabulation
3 anthropic/claude-haiku-4-5standard 80.0% (12/15)
$0.00044
0.61s / 1.28s 27 (60% 1st-try) 15confabulation
4 openai/gpt-4ostandard 60.0% (9/15)
$0.00110
0.49s / 0.81s 27 (60% 1st-try) 18confabulation
5 anthropic/claude-sonnet-4-6standard 60.0% (9/15)
$0.00174
1.11s / 1.91s 27 (60% 1st-try) 18confabulation
6 anthropic/claude-opus-4-7standard 0.0% (0/15)
infinite
0.11s / 0.27s 45 (0% 1st-try) 45error