Critique-Pass delta (latest clean off vs latest clean on)
Per-model deltas between the critique-off baseline and the critique-on track per METHODOLOGY s13.3. Positive cost delta means the critique pass made it more expensive; positive success delta means it improved accuracy. Negative deltas are honestly reported.
| Provider / Model | Off eff_TCoT | On eff_TCoT | Δ eff_TCoT | Off success | On success | Δ success |
|---|---|---|---|---|---|---|
| anthropic/claude-haiku-4-5standard | $0.00053 | $0.00120 | +127.4% | 66.7% | 66.7% | +0.0 pp |
| anthropic/claude-opus-4-7standard | infinite | infinite | n/a | 0.0% | 0.0% | +0.0 pp |
| anthropic/claude-sonnet-4-6standard | $0.00142 | $0.00323 | +128.3% | 66.7% | 66.7% | +0.0 pp |
| google/gemini-2.5-flashstandard | $0.00012 | $0.00028 | +130.3% | 66.7% | 66.7% | +0.0 pp |
| google/gemini-2.5-flash-litestandard | $0.00004 | $0.00009 | +123.1% | 66.7% | 66.7% | +0.0 pp |
| google/gemini-2.5-prostandard | $0.00046 | $0.00114 | +148.2% | 83.3% | 66.7% | -16.7 pp |
| openai/gpt-4ostandard | $0.00090 | $0.00208 | +131.8% | 66.7% | 66.7% | +0.0 pp |
| openai/gpt-4o-ministandard | $0.00006 | $0.00013 | +125.4% | 66.7% | 66.7% | +0.0 pp |
| openai/o3reasoning | $0.00661 | $0.01409 | +113.1% | 66.7% | 66.7% | +0.0 pp |
| openai/o3-minireasoning | $0.00289 | $0.00506 | +75.4% | 66.7% | 66.7% | +0.0 pp |
| openai/o4-minireasoning | $0.00314 | $0.00720 | +128.9% | 66.7% | 66.7% | +0.0 pp |
| openrouter/cohere/command-r-plusstandard | infinite | infinite | n/a | 0.0% | 0.0% | +0.0 pp |
| openrouter/deepseek/deepseek-chatstandard | $0.00010 | $0.00024 | +128.1% | 66.7% | 66.7% | +0.0 pp |
| openrouter/deepseek/deepseek-r1reasoning | $0.00266 | $0.00613 | +130.2% | 66.7% | 66.7% | +0.0 pp |
| openrouter/meta-llama/llama-3.3-70b-instructstandard | $0.00015 | $0.00058 | +288.8% | 66.7% | 50.0% | -16.7 pp |
| openrouter/meta-llama/llama-4-maverickstandard | $0.00007 | $0.00016 | +123.6% | 66.7% | 66.7% | +0.0 pp |
| openrouter/meta-llama/llama-4-scoutstandard | $0.00004 | $0.00009 | +91.0% | 66.7% | 66.7% | +0.0 pp |
| openrouter/mistralai/mistral-largestandard | $0.00058 | $0.00169 | +190.6% | 83.3% | 66.7% | -16.7 pp |
| openrouter/qwen/qwen-3-235b-a22bstandard | infinite | infinite | n/a | 0.0% | 0.0% | +0.0 pp |
| perplexity/sonarsearch | $0.00033 | $0.00079 | +136.8% | 66.7% | 66.7% | +0.0 pp |
| perplexity/sonar-prosearch | $0.00149 | $0.00252 | +69.6% | 66.7% | 66.7% | +0.0 pp |
| perplexity/sonar-reasoningreasoning | infinite | infinite | n/a | 0.0% | 0.0% | +0.0 pp |
| perplexity/sonar-reasoning-proreasoning | infinite | $0.00168 | n/a | 0.0% | 66.7% | +66.7 pp |
| xai/grok-3standard | $0.00037 | $0.00472 | +1182.1% | 100.0% | 66.7% | -33.3 pp |
| xai/grok-3-ministandard | $0.00011 | $0.00043 | +297.4% | 66.7% | 66.7% | +0.0 pp |
| xai/grok-4standard | $0.00629 | $0.00454 | -27.9% | 66.7% | 66.7% | +0.0 pp |
| xai/grok-4-faststandard | $0.00015 | $0.00030 | +102.4% | 66.7% | 66.7% | +0.0 pp |
Reproducibility (latest pass vs previous)
| Provider / Model | Latest eff_TCoT | Previous eff_TCoT | Δ eff_TCoT | Δ success rate |
|---|---|---|---|---|
| anthropic/claude-haiku-4-5 | $0.00053 | $0.00057 | -7.3% | +0.0 pp |
| anthropic/claude-opus-4-7 | $n/a | $n/a | n/a | +0.0 pp |
| anthropic/claude-sonnet-4-6 | $0.00142 | $0.00174 | -18.5% | +6.7 pp |
| google/gemini-2.5-flash | $0.00012 | $0.00012 | +2.5% | -6.7 pp |
| google/gemini-2.5-flash-lite | $0.00004 | $0.00004 | -13.4% | +6.7 pp |
| google/gemini-2.5-pro | $0.00046 | $0.00049 | -5.8% | +10.0 pp |
| openai/gpt-4o | $0.00090 | $0.00110 | -18.1% | +6.7 pp |
| openai/gpt-4o-mini | $0.00006 | $0.00007 | -18.5% | +6.7 pp |
Failure-mode totals across all passes
Counts per failure mode summed across every attempt across every pass; useful for spotting which providers exhibit which failure patterns systematically.
| Provider / Model | Failure modes (total counts) |
|---|---|
| google/gemini-2.5-flash-lite | 48confabulation 1error |
| openrouter/meta-llama/llama-4-scout | 12confabulation 2schema_break |
| openai/gpt-4o-mini | 48confabulation |
| openrouter/meta-llama/llama-4-maverick | 12confabulation |
| openrouter/deepseek/deepseek-chat | 12confabulation |
| google/gemini-2.5-flash | 28confabulation |
| xai/grok-4-fast | 12confabulation |
| xai/grok-3-mini | 12confabulation |
| openrouter/meta-llama/llama-3.3-70b-instruct | 15confabulation |
| perplexity/sonar | 12confabulation |
| google/gemini-2.5-pro | 19confabulation 8schema_break |
| anthropic/claude-haiku-4-5 | 44confabulation |
| perplexity/sonar-reasoning-pro | 24confabulation |
| openrouter/mistralai/mistral-large | 11confabulation |
| openai/gpt-4o | 48confabulation |
| perplexity/sonar-pro | 12confabulation |
| anthropic/claude-sonnet-4-6 | 48confabulation |
| xai/grok-4 | 12confabulation |
| xai/grok-3 | 6confabulation |
| openai/o3-mini | 12confabulation |
| openrouter/deepseek/deepseek-r1 | 12confabulation |
| openai/o4-mini | 12confabulation |
| openai/o3 | 12confabulation |
| anthropic/claude-opus-4-7 | 126error |
| openrouter/cohere/command-r-plus | 36error |
| openrouter/qwen/qwen-3-235b-a22b | 36error |
| perplexity/sonar-reasoning | 36error |
All bench passes (chronological, newest first)
| # | Provider / Model | Success | effective_TCoT | Latency p50 / p95 | Attempts | Failure modes (this pass) |
|---|---|---|---|---|---|---|
| 1 | google/gemini-2.5-flash-litestandard | 66.7% (4/6) | 0.99s / 1.23s | 10 (67% 1st-try) | 6confabulation | |
| 2 | openrouter/meta-llama/llama-4-scoutstandard | 66.7% (4/6) | 0.43s / 0.77s | 10 (67% 1st-try) | 6confabulation | |
| 3 | openai/gpt-4o-ministandard | 66.7% (4/6) | 1.13s / 1.66s | 10 (67% 1st-try) | 6confabulation | |
| 4 | openrouter/meta-llama/llama-4-maverickstandard | 66.7% (4/6) | 1.07s / 2.01s | 10 (67% 1st-try) | 6confabulation | |
| 5 | openrouter/deepseek/deepseek-chatstandard | 66.7% (4/6) | 2.54s / 8.29s | 10 (67% 1st-try) | 6confabulation | |
| 6 | google/gemini-2.5-flashstandard | 66.7% (4/6) | 2.02s / 19.72s | 10 (67% 1st-try) | 6confabulation | |
| 7 | xai/grok-4-faststandard | 66.7% (4/6) | 6.05s / 20.94s | 10 (67% 1st-try) | 6confabulation | |
| 8 | xai/grok-3-ministandard | 66.7% (4/6) | 6.21s / 11.95s | 10 (67% 1st-try) | 6confabulation | |
| 9 | openrouter/meta-llama/llama-3.3-70b-instructstandard | 50.0% (3/6) | 1.17s / 8.35s | 12 (50% 1st-try) | 9confabulation | |
| 10 | perplexity/sonarsearch | 66.7% (4/6) | 3.74s / 3.96s | 10 (67% 1st-try) | 6confabulation | |
| 11 | google/gemini-2.5-prostandard | 66.7% (4/6) | 9.77s / 31.75s | 10 (67% 1st-try) | 4schema_break 2confabulation | |
| 12 | anthropic/claude-haiku-4-5standard | 66.7% (4/6) | 1.25s / 1.34s | 10 (67% 1st-try) | 6confabulation | |
| 13 | perplexity/sonar-reasoning-proreasoning | 66.7% (4/6) | 4.62s / 7.54s | 10 (67% 1st-try) | 6confabulation | |
| 14 | openrouter/mistralai/mistral-largestandard | 66.7% (4/6) | 1.09s / 1.46s | 10 (67% 1st-try) | 6confabulation | |
| 15 | openai/gpt-4ostandard | 66.7% (4/6) | 1.10s / 6.71s | 10 (67% 1st-try) | 6confabulation | |
| 16 | perplexity/sonar-prosearch | 66.7% (4/6) | 3.92s / 5.44s | 10 (67% 1st-try) | 6confabulation | |
| 17 | anthropic/claude-sonnet-4-6standard | 66.7% (4/6) | 2.12s / 2.40s | 10 (67% 1st-try) | 6confabulation | |
| 18 | xai/grok-4standard | 66.7% (4/6) | 5.73s / 15.56s | 10 (67% 1st-try) | 6confabulation | |
| 19 | xai/grok-3standard | 66.7% (4/6) | 7.14s / 16.23s | 10 (67% 1st-try) | 6confabulation | |
| 20 | openai/o3-minireasoning | 66.7% (4/6) | 3.92s / 14.39s | 10 (67% 1st-try) | 6confabulation | |
| 21 | openrouter/deepseek/deepseek-r1reasoning | 66.7% (4/6) | 25.88s / 167.43s | 10 (67% 1st-try) | 6confabulation | |
| 22 | openai/o4-minireasoning | 66.7% (4/6) | 3.53s / 8.92s | 10 (67% 1st-try) | 6confabulation | |
| 23 | openai/o3reasoning | 66.7% (4/6) | 2.39s / 17.25s | 10 (67% 1st-try) | 6confabulation | |
| 24 | anthropic/claude-opus-4-7standard | 0.0% (0/6) | 0.11s / 0.31s | 18 (0% 1st-try) | 18error | |
| 25 | openrouter/cohere/command-r-plusstandard | 0.0% (0/6) | 0.03s / 0.10s | 18 (0% 1st-try) | 18error | |
| 26 | openrouter/qwen/qwen-3-235b-a22bstandard | 0.0% (0/6) | 0.04s / 0.11s | 18 (0% 1st-try) | 18error | |
| 27 | perplexity/sonar-reasoningreasoning | 0.0% (0/6) | 0.07s / 0.15s | 18 (0% 1st-try) | 18error |
| # | Provider / Model | Success | effective_TCoT | Latency p50 / p95 | Attempts | Failure modes (this pass) |
|---|---|---|---|---|---|---|
| 1 | google/gemini-2.5-flash-litestandard | 66.7% (4/6) | 6.13s / 18.96s | 11 (50% 1st-try) | 6confabulation 1error | |
| 2 | openrouter/meta-llama/llama-4-scoutstandard | 66.7% (4/6) | 0.41s / 0.58s | 12 (33% 1st-try) | 6confabulation 2schema_break | |
| 3 | openai/gpt-4o-ministandard | 66.7% (4/6) | 0.62s / 0.76s | 10 (67% 1st-try) | 6confabulation | |
| 4 | openrouter/meta-llama/llama-4-maverickstandard | 66.7% (4/6) | 0.85s / 3.69s | 10 (67% 1st-try) | 6confabulation | |
| 5 | openrouter/deepseek/deepseek-chatstandard | 66.7% (4/6) | 0.93s / 1.49s | 10 (67% 1st-try) | 6confabulation | |
| 6 | xai/grok-3-ministandard | 66.7% (4/6) | 12.55s / 16.70s | 10 (67% 1st-try) | 6confabulation | |
| 7 | google/gemini-2.5-flashstandard | 66.7% (4/6) | 1.03s / 10.84s | 10 (67% 1st-try) | 6confabulation | |
| 8 | xai/grok-4-faststandard | 66.7% (4/6) | 1.83s / 7.16s | 10 (67% 1st-try) | 6confabulation | |
| 9 | openrouter/meta-llama/llama-3.3-70b-instructstandard | 66.7% (4/6) | 0.57s / 4.98s | 10 (67% 1st-try) | 6confabulation | |
| 10 | perplexity/sonarsearch | 66.7% (4/6) | 1.42s / 1.75s | 10 (67% 1st-try) | 6confabulation | |
| 11 | xai/grok-3standard | 100.0% (6/6) | 0.90s / 2.67s | 6 (100% 1st-try) | none | |
| 12 | google/gemini-2.5-prostandard | 83.3% (5/6) | 6.97s / 13.99s | 10 (67% 1st-try) | 5confabulation | |
| 13 | anthropic/claude-haiku-4-5standard | 66.7% (4/6) | 0.67s / 0.79s | 10 (67% 1st-try) | 6confabulation | |
| 14 | openrouter/mistralai/mistral-largestandard | 83.3% (5/6) | 0.62s / 2.50s | 10 (67% 1st-try) | 5confabulation | |
| 15 | openai/gpt-4ostandard | 66.7% (4/6) | 0.77s / 1.69s | 10 (67% 1st-try) | 6confabulation | |
| 16 | anthropic/claude-sonnet-4-6standard | 66.7% (4/6) | 1.02s / 1.52s | 10 (67% 1st-try) | 6confabulation | |
| 17 | perplexity/sonar-prosearch | 66.7% (4/6) | 1.53s / 2.82s | 10 (67% 1st-try) | 6confabulation | |
| 18 | openrouter/deepseek/deepseek-r1reasoning | 66.7% (4/6) | 10.87s / 68.12s | 10 (67% 1st-try) | 6confabulation | |
| 19 | openai/o3-minireasoning | 66.7% (4/6) | 2.11s / 5.38s | 10 (67% 1st-try) | 6confabulation | |
| 20 | openai/o4-minireasoning | 66.7% (4/6) | 2.13s / 7.69s | 10 (67% 1st-try) | 6confabulation | |
| 21 | xai/grok-4standard | 66.7% (4/6) | 5.10s / 84.15s | 10 (67% 1st-try) | 6confabulation | |
| 22 | openai/o3reasoning | 66.7% (4/6) | 3.80s / 7.88s | 10 (67% 1st-try) | 6confabulation | |
| 23 | anthropic/claude-opus-4-7standard | 0.0% (0/6) | 0.12s / 0.51s | 18 (0% 1st-try) | 18error | |
| 24 | openrouter/cohere/command-r-plusstandard | 0.0% (0/6) | 0.02s / 0.08s | 18 (0% 1st-try) | 18error | |
| 25 | openrouter/qwen/qwen-3-235b-a22bstandard | 0.0% (0/6) | 0.02s / 0.10s | 18 (0% 1st-try) | 18error | |
| 26 | perplexity/sonar-reasoning-proreasoning | 0.0% (0/6) | 7.37s / 11.67s | 18 (0% 1st-try) | 18confabulation | |
| 27 | perplexity/sonar-reasoningreasoning | 0.0% (0/6) | 0.07s / 0.12s | 18 (0% 1st-try) | 18error |
| # | Provider / Model | Success | effective_TCoT | Latency p50 / p95 | Attempts | Failure modes (this pass) |
|---|---|---|---|---|---|---|
| 1 | google/gemini-2.5-flash-litestandard | 60.0% (9/15) | 0.50s / 0.73s | 27 (60% 1st-try) | 18confabulation | |
| 2 | openai/gpt-4o-ministandard | 60.0% (9/15) | 0.67s / 1.34s | 27 (60% 1st-try) | 18confabulation | |
| 3 | google/gemini-2.5-flashstandard | 73.3% (11/15) | 1.15s / 10.89s | 27 (60% 1st-try) | 16confabulation | |
| 4 | google/gemini-2.5-prostandard | 73.3% (11/15) | 6.45s / 18.34s | 27 (60% 1st-try) | 12confabulation 4schema_break | |
| 5 | anthropic/claude-haiku-4-5standard | 66.7% (10/15) | 0.70s / 1.51s | 27 (60% 1st-try) | 17confabulation | |
| 6 | openai/gpt-4ostandard | 60.0% (9/15) | 0.51s / 0.73s | 27 (60% 1st-try) | 18confabulation | |
| 7 | anthropic/claude-sonnet-4-6standard | 60.0% (9/15) | 1.08s / 3.36s | 27 (60% 1st-try) | 18confabulation | |
| 8 | anthropic/claude-opus-4-7standard | 0.0% (0/15) | 0.13s / 0.41s | 45 (0% 1st-try) | 45error |
| # | Provider / Model | Success | effective_TCoT | Latency p50 / p95 | Attempts | Failure modes (this pass) |
|---|---|---|---|---|---|---|
| 1 | google/gemini-2.5-flash-litestandard | 60.0% (9/15) | 0.49s / 0.80s | 27 (60% 1st-try) | 18confabulation | |
| 2 | openai/gpt-4o-ministandard | 60.0% (9/15) | 0.61s / 0.88s | 27 (60% 1st-try) | 18confabulation | |
| 3 | anthropic/claude-haiku-4-5standard | 80.0% (12/15) | 0.61s / 1.28s | 27 (60% 1st-try) | 15confabulation | |
| 4 | openai/gpt-4ostandard | 60.0% (9/15) | 0.49s / 0.81s | 27 (60% 1st-try) | 18confabulation | |
| 5 | anthropic/claude-sonnet-4-6standard | 60.0% (9/15) | 1.11s / 1.91s | 27 (60% 1st-try) | 18confabulation | |
| 6 | anthropic/claude-opus-4-7standard | 0.0% (0/15) | 0.11s / 0.27s | 45 (0% 1st-try) | 45error |