bellwether  ›  function_call_routing

Critique-Pass delta (latest clean off vs latest clean on)

Per-model deltas between the critique-off baseline and the critique-on track per METHODOLOGY s13.3. Positive cost delta means the critique pass made it more expensive; positive success delta means it improved accuracy. Negative deltas are honestly reported.

Provider / Model Off eff_TCoT On eff_TCoT Δ eff_TCoT Off success On success Δ success
anthropic/claude-haiku-4-5standard infinite infinite n/a 0.0% 0.0% +0.0 pp
anthropic/claude-opus-4-7standard infinite infinite n/a 0.0% 0.0% +0.0 pp
anthropic/claude-sonnet-4-6standard $0.00119 $0.00260 +118.2% 100.0% 100.0% +0.0 pp
google/gemini-2.5-flashstandard $0.00013 $0.00028 +115.4% 100.0% 100.0% +0.0 pp
google/gemini-2.5-flash-litestandard $0.00003 $0.00007 +113.0% 100.0% 100.0% +0.0 pp
google/gemini-2.5-prostandard $0.01093 $0.01114 +1.9% 16.7% 33.3% +16.7 pp
openai/gpt-4ostandard $0.00077 $0.00170 +120.6% 100.0% 100.0% +0.0 pp
openai/gpt-4o-ministandard $0.00005 $0.00010 +120.6% 100.0% 100.0% +0.0 pp
openai/o3reasoning $0.00076 $0.00198 +160.1% 100.0% 100.0% +0.0 pp
openai/o3-minireasoning $0.00049 $0.00129 +163.8% 100.0% 100.0% +0.0 pp
openai/o4-minireasoning $0.00055 $0.00154 +177.8% 100.0% 100.0% +0.0 pp
openrouter/cohere/command-r-plusstandard infinite infinite n/a 0.0% 0.0% +0.0 pp
openrouter/deepseek/deepseek-chatstandard $0.00011 $0.00020 +85.4% 100.0% 100.0% +0.0 pp
openrouter/deepseek/deepseek-r1reasoning $0.00048 $0.00129 +170.4% 100.0% 100.0% +0.0 pp
openrouter/meta-llama/llama-3.3-70b-instructstandard $0.00010 $0.00023 +127.0% 100.0% 100.0% +0.0 pp
openrouter/meta-llama/llama-4-maverickstandard $0.00005 $0.00012 +122.1% 100.0% 100.0% +0.0 pp
openrouter/meta-llama/llama-4-scoutstandard $0.00003 $0.00007 +122.4% 100.0% 100.0% +0.0 pp
openrouter/mistralai/mistral-largestandard $0.00060 $0.00132 +121.8% 100.0% 100.0% +0.0 pp
openrouter/qwen/qwen-3-235b-a22bstandard infinite infinite n/a 0.0% 0.0% +0.0 pp
perplexity/sonarsearch $0.00034 $0.00054 +55.7% 100.0% 100.0% +0.0 pp
perplexity/sonar-prosearch $0.00167 $0.00209 +25.1% 83.3% 100.0% +16.7 pp
perplexity/sonar-reasoningreasoning infinite infinite n/a 0.0% 0.0% +0.0 pp
perplexity/sonar-reasoning-proreasoning infinite $0.00134 n/a 0.0% 100.0% +100.0 pp
xai/grok-3standard $0.00098 $0.00290 +195.0% 100.0% 100.0% +0.0 pp
xai/grok-3-ministandard $0.00008 $0.00025 +218.0% 100.0% 100.0% +0.0 pp
xai/grok-4standard $0.00302 $0.00290 -3.8% 100.0% 100.0% +0.0 pp
xai/grok-4-faststandard $0.00009 $0.00017 +102.7% 100.0% 100.0% +0.0 pp

Reproducibility (latest pass vs previous)

Provider / Model Latest eff_TCoT Previous eff_TCoT Δ eff_TCoT Δ success rate
anthropic/claude-haiku-4-5 $n/a $n/a n/a +0.0 pp
anthropic/claude-opus-4-7 $n/a $n/a n/a +0.0 pp
anthropic/claude-sonnet-4-6 $0.00119 $0.00120 -0.7% +0.0 pp
google/gemini-2.5-flash $0.00013 $0.00013 -1.8% +0.0 pp
google/gemini-2.5-flash-lite $0.00003 $0.00003 +1.8% +0.0 pp
google/gemini-2.5-pro $0.01093 $0.00418 +161.3% -23.3 pp
openai/gpt-4o $0.00077 $0.00078 -0.7% +0.0 pp
openai/gpt-4o-mini $0.00005 $0.00005 -0.7% +0.0 pp

Failure-mode totals across all passes

Counts per failure mode summed across every attempt across every pass; useful for spotting which providers exhibit which failure patterns systematically.

Provider / Model Failure modes (total counts)
google/gemini-2.5-pro 63schema_break
anthropic/claude-haiku-4-5 126schema_break
anthropic/claude-opus-4-7 126error
openrouter/cohere/command-r-plus 36error
openrouter/qwen/qwen-3-235b-a22b 36error
perplexity/sonar-reasoning 36error
google/gemini-2.5-flash-lite 1error
openrouter/deepseek/deepseek-chat 1schema_break
perplexity/sonar 2schema_break
perplexity/sonar-pro 3schema_break
perplexity/sonar-reasoning-pro 18schema_break

All bench passes (chronological, newest first)

Pass 2026-05-16 16:31:49 UTC 4e158ca clean critique-pass (s13)
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 openrouter/meta-llama/llama-4-scoutstandard 100.0% (6/6)
$0.00007±$0.00000
0.84s / 1.01s 6 (100% 1st-try) none
2 google/gemini-2.5-flash-litestandard 100.0% (6/6)
$0.00007±$0.00000
1.47s / 1.92s 6 (100% 1st-try) none
3 openai/gpt-4o-ministandard 100.0% (6/6)
$0.00010±$0.00001
1.37s / 1.97s 6 (100% 1st-try) none
4 openrouter/meta-llama/llama-4-maverickstandard 100.0% (6/6)
$0.00012±$0.00001
1.50s / 1.75s 6 (100% 1st-try) none
5 xai/grok-4-faststandard 100.0% (6/6)
$0.00017±$0.00001
4.63s / 6.42s 6 (100% 1st-try) none
6 openrouter/deepseek/deepseek-chatstandard 100.0% (6/6)
$0.00020±$0.00001
3.81s / 4.94s 6 (100% 1st-try) none
7 openrouter/meta-llama/llama-3.3-70b-instructstandard 100.0% (6/6)
$0.00023±$0.00000
1.82s / 2.36s 6 (100% 1st-try) none
8 xai/grok-3-ministandard 100.0% (6/6)
$0.00025±$0.00001
5.48s / 10.55s 6 (100% 1st-try) none
9 google/gemini-2.5-flashstandard 100.0% (6/6)
$0.00028±$0.00002
1.98s / 2.40s 6 (100% 1st-try) none
10 perplexity/sonarsearch 100.0% (6/6)
$0.00054±$0.00001
3.36s / 5.11s 6 (100% 1st-try) none
11 openrouter/deepseek/deepseek-r1reasoning 100.0% (6/6)
$0.00129±$0.00021
130.54s / 264.24s 6 (100% 1st-try) none
12 openai/o3-minireasoning 100.0% (6/6)
$0.00129±$0.00020
2.39s / 3.09s 6 (100% 1st-try) none
13 openrouter/mistralai/mistral-largestandard 100.0% (6/6)
$0.00132±$0.00006
1.46s / 1.76s 6 (100% 1st-try) none
14 perplexity/sonar-reasoning-proreasoning 100.0% (6/6)
$0.00134±$0.00008
4.42s / 6.06s 6 (100% 1st-try) none
15 openai/o4-minireasoning 100.0% (6/6)
$0.00154±$0.00016
3.39s / 5.52s 6 (100% 1st-try) none
16 openai/gpt-4ostandard 100.0% (6/6)
$0.00170±$0.00009
1.64s / 2.54s 6 (100% 1st-try) none
17 openai/o3reasoning 100.0% (6/6)
$0.00198±$0.00031
2.79s / 3.19s 6 (100% 1st-try) none
18 perplexity/sonar-prosearch 100.0% (6/6)
$0.00209±$0.00015
3.47s / 3.79s 6 (100% 1st-try) none
19 anthropic/claude-sonnet-4-6standard 100.0% (6/6)
$0.00260±$0.00015
2.15s / 2.66s 6 (100% 1st-try) none
20 xai/grok-3standard 100.0% (6/6)
$0.00290±$0.00014
5.31s / 6.40s 6 (100% 1st-try) none
21 xai/grok-4standard 100.0% (6/6)
$0.00290±$0.00014
5.12s / 6.54s 6 (100% 1st-try) none
22 google/gemini-2.5-prostandard 33.3% (2/6)
$0.01114±$0.00004
12.43s / 22.54s 16 (0% 1st-try) 14schema_break
23 anthropic/claude-haiku-4-5standard 0.0% (0/6)
infinite
1.37s / 2.10s 18 (0% 1st-try) 18schema_break
24 anthropic/claude-opus-4-7standard 0.0% (0/6)
infinite
0.12s / 0.24s 18 (0% 1st-try) 18error
25 openrouter/cohere/command-r-plusstandard 0.0% (0/6)
infinite
0.03s / 0.10s 18 (0% 1st-try) 18error
26 openrouter/qwen/qwen-3-235b-a22bstandard 0.0% (0/6)
infinite
0.03s / 0.14s 18 (0% 1st-try) 18error
27 perplexity/sonar-reasoningreasoning 0.0% (0/6)
infinite
0.07s / 0.14s 18 (0% 1st-try) 18error
Pass 2026-05-08 15:11:05 UTC 46ce0a5 clean
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 openrouter/meta-llama/llama-4-scoutstandard 100.0% (6/6)
$0.00003±$0.00000
0.88s / 1.17s 6 (100% 1st-try) none
2 google/gemini-2.5-flash-litestandard 100.0% (6/6)
$0.00003±$0.00000
6.06s / 17.35s 7 (83% 1st-try) 1error
3 openai/gpt-4o-ministandard 100.0% (6/6)
$0.00005±$0.00000
0.86s / 1.03s 6 (100% 1st-try) none
4 openrouter/meta-llama/llama-4-maverickstandard 100.0% (6/6)
$0.00005±$0.00000
2.67s / 6.05s 6 (100% 1st-try) none
5 xai/grok-3-ministandard 100.0% (6/6)
$0.00008±$0.00000
5.98s / 6.12s 6 (100% 1st-try) none
6 xai/grok-4-faststandard 100.0% (6/6)
$0.00009±$0.00000
2.20s / 4.47s 6 (100% 1st-try) none
7 openrouter/meta-llama/llama-3.3-70b-instructstandard 100.0% (6/6)
$0.00010±$0.00001
0.78s / 4.34s 6 (100% 1st-try) none
8 openrouter/deepseek/deepseek-chatstandard 100.0% (6/6)
$0.00011±$0.00004
1.31s / 1.92s 7 (83% 1st-try) 1schema_break
9 google/gemini-2.5-flashstandard 100.0% (6/6)
$0.00013±$0.00001
0.99s / 1.29s 6 (100% 1st-try) none
10 perplexity/sonarsearch 100.0% (6/6)
$0.00034±$0.00015
1.66s / 3.36s 8 (67% 1st-try) 2schema_break
11 openrouter/deepseek/deepseek-r1reasoning 100.0% (6/6)
$0.00048±$0.00024
6.30s / 12.02s 6 (100% 1st-try) none
12 openai/o3-minireasoning 100.0% (6/6)
$0.00049±$0.00015
1.73s / 3.20s 6 (100% 1st-try) none
13 openai/o4-minireasoning 100.0% (6/6)
$0.00055±$0.00016
2.01s / 2.24s 6 (100% 1st-try) none
14 openrouter/mistralai/mistral-largestandard 100.0% (6/6)
$0.00060±$0.00003
1.37s / 5.16s 6 (100% 1st-try) none
15 openai/o3reasoning 100.0% (6/6)
$0.00076±$0.00003
1.76s / 2.89s 6 (100% 1st-try) none
16 openai/gpt-4ostandard 100.0% (6/6)
$0.00077±$0.00004
1.24s / 2.45s 6 (100% 1st-try) none
17 xai/grok-3standard 100.0% (6/6)
$0.00098±$0.00006
1.29s / 2.08s 6 (100% 1st-try) none
18 anthropic/claude-sonnet-4-6standard 100.0% (6/6)
$0.00119±$0.00007
1.19s / 1.73s 6 (100% 1st-try) none
19 perplexity/sonar-prosearch 83.3% (5/6)
$0.00167±$0.00007
1.60s / 3.30s 8 (83% 1st-try) 3schema_break
20 xai/grok-4standard 100.0% (6/6)
$0.00302±$0.00006
3.85s / 5.50s 6 (100% 1st-try) none
21 google/gemini-2.5-prostandard 16.7% (1/6)
$0.01093
5.62s / 12.08s 17 (0% 1st-try) 16schema_break
22 anthropic/claude-haiku-4-5standard 0.0% (0/6)
infinite
0.71s / 1.81s 18 (0% 1st-try) 18schema_break
23 anthropic/claude-opus-4-7standard 0.0% (0/6)
infinite
0.12s / 0.17s 18 (0% 1st-try) 18error
24 openrouter/cohere/command-r-plusstandard 0.0% (0/6)
infinite
0.03s / 0.09s 18 (0% 1st-try) 18error
25 openrouter/qwen/qwen-3-235b-a22bstandard 0.0% (0/6)
infinite
0.02s / 0.08s 18 (0% 1st-try) 18error
26 perplexity/sonar-reasoning-proreasoning 0.0% (0/6)
infinite
4.00s / 7.46s 18 (0% 1st-try) 18schema_break
27 perplexity/sonar-reasoningreasoning 0.0% (0/6)
infinite
0.08s / 0.14s 18 (0% 1st-try) 18error
Pass 2026-05-08 01:46:18 UTC e6e987c clean
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 google/gemini-2.5-flash-litestandard 100.0% (15/15)
$0.00003±$0.00000
0.48s / 0.62s 15 (100% 1st-try) none
2 openai/gpt-4o-ministandard 100.0% (15/15)
$0.00005±$0.00000
0.82s / 1.06s 15 (100% 1st-try) none
3 google/gemini-2.5-flashstandard 100.0% (15/15)
$0.00013±$0.00001
0.92s / 0.97s 15 (100% 1st-try) none
4 openai/gpt-4ostandard 100.0% (15/15)
$0.00078±$0.00004
0.52s / 0.82s 15 (100% 1st-try) none
5 anthropic/claude-sonnet-4-6standard 100.0% (15/15)
$0.00120±$0.00007
1.17s / 1.72s 15 (100% 1st-try) none
6 google/gemini-2.5-prostandard 40.0% (6/15)
$0.00418±$0.00012
6.06s / 13.01s 39 (0% 1st-try) 33schema_break
7 anthropic/claude-haiku-4-5standard 0.0% (0/15)
infinite
0.80s / 1.85s 45 (0% 1st-try) 45schema_break
8 anthropic/claude-opus-4-7standard 0.0% (0/15)
infinite
0.11s / 0.22s 45 (0% 1st-try) 45error
Pass 2026-05-08 01:28:01 UTC d95fbc2 clean
# Provider / Model Success effective_TCoT Latency p50 / p95 Attempts Failure modes (this pass)
1 google/gemini-2.5-flash-litestandard 100.0% (15/15)
$0.00003
0.55s / 0.90s 15 (100% 1st-try) none
2 openai/gpt-4o-ministandard 100.0% (15/15)
$0.00005
0.77s / 1.30s 15 (100% 1st-try) none
3 google/gemini-2.5-flashstandard 100.0% (15/15)
$0.00013
0.95s / 1.37s 15 (100% 1st-try) none
4 openai/gpt-4ostandard 100.0% (15/15)
$0.00078
0.52s / 1.00s 15 (100% 1st-try) none
5 anthropic/claude-sonnet-4-6standard 100.0% (15/15)
$0.00120
1.06s / 1.37s 15 (100% 1st-try) none
6 anthropic/claude-haiku-4-5standard 0.0% (0/15)
infinite
0.73s / 4.25s 45 (0% 1st-try) 45schema_break
7 anthropic/claude-opus-4-7standard 0.0% (0/15)
infinite
0.12s / 0.22s 45 (0% 1st-try) 45error