Critique-Pass delta (latest clean off vs latest clean on)
Per-model deltas between the critique-off baseline and the critique-on track per METHODOLOGY s13.3. Positive cost delta means the critique pass made it more expensive; positive success delta means it improved accuracy. Negative deltas are honestly reported.
| Provider / Model | Off eff_TCoT | On eff_TCoT | Δ eff_TCoT | Off success | On success | Δ success |
|---|---|---|---|---|---|---|
| anthropic/claude-haiku-4-5standard | infinite | infinite | n/a | 0.0% | 0.0% | +0.0 pp |
| anthropic/claude-opus-4-7standard | infinite | infinite | n/a | 0.0% | 0.0% | +0.0 pp |
| anthropic/claude-sonnet-4-6standard | $0.00119 | $0.00260 | +118.2% | 100.0% | 100.0% | +0.0 pp |
| google/gemini-2.5-flashstandard | $0.00013 | $0.00028 | +115.4% | 100.0% | 100.0% | +0.0 pp |
| google/gemini-2.5-flash-litestandard | $0.00003 | $0.00007 | +113.0% | 100.0% | 100.0% | +0.0 pp |
| google/gemini-2.5-prostandard | $0.01093 | $0.01114 | +1.9% | 16.7% | 33.3% | +16.7 pp |
| openai/gpt-4ostandard | $0.00077 | $0.00170 | +120.6% | 100.0% | 100.0% | +0.0 pp |
| openai/gpt-4o-ministandard | $0.00005 | $0.00010 | +120.6% | 100.0% | 100.0% | +0.0 pp |
| openai/o3reasoning | $0.00076 | $0.00198 | +160.1% | 100.0% | 100.0% | +0.0 pp |
| openai/o3-minireasoning | $0.00049 | $0.00129 | +163.8% | 100.0% | 100.0% | +0.0 pp |
| openai/o4-minireasoning | $0.00055 | $0.00154 | +177.8% | 100.0% | 100.0% | +0.0 pp |
| openrouter/cohere/command-r-plusstandard | infinite | infinite | n/a | 0.0% | 0.0% | +0.0 pp |
| openrouter/deepseek/deepseek-chatstandard | $0.00011 | $0.00020 | +85.4% | 100.0% | 100.0% | +0.0 pp |
| openrouter/deepseek/deepseek-r1reasoning | $0.00048 | $0.00129 | +170.4% | 100.0% | 100.0% | +0.0 pp |
| openrouter/meta-llama/llama-3.3-70b-instructstandard | $0.00010 | $0.00023 | +127.0% | 100.0% | 100.0% | +0.0 pp |
| openrouter/meta-llama/llama-4-maverickstandard | $0.00005 | $0.00012 | +122.1% | 100.0% | 100.0% | +0.0 pp |
| openrouter/meta-llama/llama-4-scoutstandard | $0.00003 | $0.00007 | +122.4% | 100.0% | 100.0% | +0.0 pp |
| openrouter/mistralai/mistral-largestandard | $0.00060 | $0.00132 | +121.8% | 100.0% | 100.0% | +0.0 pp |
| openrouter/qwen/qwen-3-235b-a22bstandard | infinite | infinite | n/a | 0.0% | 0.0% | +0.0 pp |
| perplexity/sonarsearch | $0.00034 | $0.00054 | +55.7% | 100.0% | 100.0% | +0.0 pp |
| perplexity/sonar-prosearch | $0.00167 | $0.00209 | +25.1% | 83.3% | 100.0% | +16.7 pp |
| perplexity/sonar-reasoningreasoning | infinite | infinite | n/a | 0.0% | 0.0% | +0.0 pp |
| perplexity/sonar-reasoning-proreasoning | infinite | $0.00134 | n/a | 0.0% | 100.0% | +100.0 pp |
| xai/grok-3standard | $0.00098 | $0.00290 | +195.0% | 100.0% | 100.0% | +0.0 pp |
| xai/grok-3-ministandard | $0.00008 | $0.00025 | +218.0% | 100.0% | 100.0% | +0.0 pp |
| xai/grok-4standard | $0.00302 | $0.00290 | -3.8% | 100.0% | 100.0% | +0.0 pp |
| xai/grok-4-faststandard | $0.00009 | $0.00017 | +102.7% | 100.0% | 100.0% | +0.0 pp |
Reproducibility (latest pass vs previous)
| Provider / Model | Latest eff_TCoT | Previous eff_TCoT | Δ eff_TCoT | Δ success rate |
|---|---|---|---|---|
| anthropic/claude-haiku-4-5 | $n/a | $n/a | n/a | +0.0 pp |
| anthropic/claude-opus-4-7 | $n/a | $n/a | n/a | +0.0 pp |
| anthropic/claude-sonnet-4-6 | $0.00119 | $0.00120 | -0.7% | +0.0 pp |
| google/gemini-2.5-flash | $0.00013 | $0.00013 | -1.8% | +0.0 pp |
| google/gemini-2.5-flash-lite | $0.00003 | $0.00003 | +1.8% | +0.0 pp |
| google/gemini-2.5-pro | $0.01093 | $0.00418 | +161.3% | -23.3 pp |
| openai/gpt-4o | $0.00077 | $0.00078 | -0.7% | +0.0 pp |
| openai/gpt-4o-mini | $0.00005 | $0.00005 | -0.7% | +0.0 pp |
Failure-mode totals across all passes
Counts per failure mode summed across every attempt across every pass; useful for spotting which providers exhibit which failure patterns systematically.
| Provider / Model | Failure modes (total counts) |
|---|---|
| google/gemini-2.5-pro | 63schema_break |
| anthropic/claude-haiku-4-5 | 126schema_break |
| anthropic/claude-opus-4-7 | 126error |
| openrouter/cohere/command-r-plus | 36error |
| openrouter/qwen/qwen-3-235b-a22b | 36error |
| perplexity/sonar-reasoning | 36error |
| google/gemini-2.5-flash-lite | 1error |
| openrouter/deepseek/deepseek-chat | 1schema_break |
| perplexity/sonar | 2schema_break |
| perplexity/sonar-pro | 3schema_break |
| perplexity/sonar-reasoning-pro | 18schema_break |
All bench passes (chronological, newest first)
| # | Provider / Model | Success | effective_TCoT | Latency p50 / p95 | Attempts | Failure modes (this pass) |
|---|---|---|---|---|---|---|
| 1 | openrouter/meta-llama/llama-4-scoutstandard | 100.0% (6/6) | 0.84s / 1.01s | 6 (100% 1st-try) | none | |
| 2 | google/gemini-2.5-flash-litestandard | 100.0% (6/6) | 1.47s / 1.92s | 6 (100% 1st-try) | none | |
| 3 | openai/gpt-4o-ministandard | 100.0% (6/6) | 1.37s / 1.97s | 6 (100% 1st-try) | none | |
| 4 | openrouter/meta-llama/llama-4-maverickstandard | 100.0% (6/6) | 1.50s / 1.75s | 6 (100% 1st-try) | none | |
| 5 | xai/grok-4-faststandard | 100.0% (6/6) | 4.63s / 6.42s | 6 (100% 1st-try) | none | |
| 6 | openrouter/deepseek/deepseek-chatstandard | 100.0% (6/6) | 3.81s / 4.94s | 6 (100% 1st-try) | none | |
| 7 | openrouter/meta-llama/llama-3.3-70b-instructstandard | 100.0% (6/6) | 1.82s / 2.36s | 6 (100% 1st-try) | none | |
| 8 | xai/grok-3-ministandard | 100.0% (6/6) | 5.48s / 10.55s | 6 (100% 1st-try) | none | |
| 9 | google/gemini-2.5-flashstandard | 100.0% (6/6) | 1.98s / 2.40s | 6 (100% 1st-try) | none | |
| 10 | perplexity/sonarsearch | 100.0% (6/6) | 3.36s / 5.11s | 6 (100% 1st-try) | none | |
| 11 | openrouter/deepseek/deepseek-r1reasoning | 100.0% (6/6) | 130.54s / 264.24s | 6 (100% 1st-try) | none | |
| 12 | openai/o3-minireasoning | 100.0% (6/6) | 2.39s / 3.09s | 6 (100% 1st-try) | none | |
| 13 | openrouter/mistralai/mistral-largestandard | 100.0% (6/6) | 1.46s / 1.76s | 6 (100% 1st-try) | none | |
| 14 | perplexity/sonar-reasoning-proreasoning | 100.0% (6/6) | 4.42s / 6.06s | 6 (100% 1st-try) | none | |
| 15 | openai/o4-minireasoning | 100.0% (6/6) | 3.39s / 5.52s | 6 (100% 1st-try) | none | |
| 16 | openai/gpt-4ostandard | 100.0% (6/6) | 1.64s / 2.54s | 6 (100% 1st-try) | none | |
| 17 | openai/o3reasoning | 100.0% (6/6) | 2.79s / 3.19s | 6 (100% 1st-try) | none | |
| 18 | perplexity/sonar-prosearch | 100.0% (6/6) | 3.47s / 3.79s | 6 (100% 1st-try) | none | |
| 19 | anthropic/claude-sonnet-4-6standard | 100.0% (6/6) | 2.15s / 2.66s | 6 (100% 1st-try) | none | |
| 20 | xai/grok-3standard | 100.0% (6/6) | 5.31s / 6.40s | 6 (100% 1st-try) | none | |
| 21 | xai/grok-4standard | 100.0% (6/6) | 5.12s / 6.54s | 6 (100% 1st-try) | none | |
| 22 | google/gemini-2.5-prostandard | 33.3% (2/6) | 12.43s / 22.54s | 16 (0% 1st-try) | 14schema_break | |
| 23 | anthropic/claude-haiku-4-5standard | 0.0% (0/6) | 1.37s / 2.10s | 18 (0% 1st-try) | 18schema_break | |
| 24 | anthropic/claude-opus-4-7standard | 0.0% (0/6) | 0.12s / 0.24s | 18 (0% 1st-try) | 18error | |
| 25 | openrouter/cohere/command-r-plusstandard | 0.0% (0/6) | 0.03s / 0.10s | 18 (0% 1st-try) | 18error | |
| 26 | openrouter/qwen/qwen-3-235b-a22bstandard | 0.0% (0/6) | 0.03s / 0.14s | 18 (0% 1st-try) | 18error | |
| 27 | perplexity/sonar-reasoningreasoning | 0.0% (0/6) | 0.07s / 0.14s | 18 (0% 1st-try) | 18error |
| # | Provider / Model | Success | effective_TCoT | Latency p50 / p95 | Attempts | Failure modes (this pass) |
|---|---|---|---|---|---|---|
| 1 | openrouter/meta-llama/llama-4-scoutstandard | 100.0% (6/6) | 0.88s / 1.17s | 6 (100% 1st-try) | none | |
| 2 | google/gemini-2.5-flash-litestandard | 100.0% (6/6) | 6.06s / 17.35s | 7 (83% 1st-try) | 1error | |
| 3 | openai/gpt-4o-ministandard | 100.0% (6/6) | 0.86s / 1.03s | 6 (100% 1st-try) | none | |
| 4 | openrouter/meta-llama/llama-4-maverickstandard | 100.0% (6/6) | 2.67s / 6.05s | 6 (100% 1st-try) | none | |
| 5 | xai/grok-3-ministandard | 100.0% (6/6) | 5.98s / 6.12s | 6 (100% 1st-try) | none | |
| 6 | xai/grok-4-faststandard | 100.0% (6/6) | 2.20s / 4.47s | 6 (100% 1st-try) | none | |
| 7 | openrouter/meta-llama/llama-3.3-70b-instructstandard | 100.0% (6/6) | 0.78s / 4.34s | 6 (100% 1st-try) | none | |
| 8 | openrouter/deepseek/deepseek-chatstandard | 100.0% (6/6) | 1.31s / 1.92s | 7 (83% 1st-try) | 1schema_break | |
| 9 | google/gemini-2.5-flashstandard | 100.0% (6/6) | 0.99s / 1.29s | 6 (100% 1st-try) | none | |
| 10 | perplexity/sonarsearch | 100.0% (6/6) | 1.66s / 3.36s | 8 (67% 1st-try) | 2schema_break | |
| 11 | openrouter/deepseek/deepseek-r1reasoning | 100.0% (6/6) | 6.30s / 12.02s | 6 (100% 1st-try) | none | |
| 12 | openai/o3-minireasoning | 100.0% (6/6) | 1.73s / 3.20s | 6 (100% 1st-try) | none | |
| 13 | openai/o4-minireasoning | 100.0% (6/6) | 2.01s / 2.24s | 6 (100% 1st-try) | none | |
| 14 | openrouter/mistralai/mistral-largestandard | 100.0% (6/6) | 1.37s / 5.16s | 6 (100% 1st-try) | none | |
| 15 | openai/o3reasoning | 100.0% (6/6) | 1.76s / 2.89s | 6 (100% 1st-try) | none | |
| 16 | openai/gpt-4ostandard | 100.0% (6/6) | 1.24s / 2.45s | 6 (100% 1st-try) | none | |
| 17 | xai/grok-3standard | 100.0% (6/6) | 1.29s / 2.08s | 6 (100% 1st-try) | none | |
| 18 | anthropic/claude-sonnet-4-6standard | 100.0% (6/6) | 1.19s / 1.73s | 6 (100% 1st-try) | none | |
| 19 | perplexity/sonar-prosearch | 83.3% (5/6) | 1.60s / 3.30s | 8 (83% 1st-try) | 3schema_break | |
| 20 | xai/grok-4standard | 100.0% (6/6) | 3.85s / 5.50s | 6 (100% 1st-try) | none | |
| 21 | google/gemini-2.5-prostandard | 16.7% (1/6) | 5.62s / 12.08s | 17 (0% 1st-try) | 16schema_break | |
| 22 | anthropic/claude-haiku-4-5standard | 0.0% (0/6) | 0.71s / 1.81s | 18 (0% 1st-try) | 18schema_break | |
| 23 | anthropic/claude-opus-4-7standard | 0.0% (0/6) | 0.12s / 0.17s | 18 (0% 1st-try) | 18error | |
| 24 | openrouter/cohere/command-r-plusstandard | 0.0% (0/6) | 0.03s / 0.09s | 18 (0% 1st-try) | 18error | |
| 25 | openrouter/qwen/qwen-3-235b-a22bstandard | 0.0% (0/6) | 0.02s / 0.08s | 18 (0% 1st-try) | 18error | |
| 26 | perplexity/sonar-reasoning-proreasoning | 0.0% (0/6) | 4.00s / 7.46s | 18 (0% 1st-try) | 18schema_break | |
| 27 | perplexity/sonar-reasoningreasoning | 0.0% (0/6) | 0.08s / 0.14s | 18 (0% 1st-try) | 18error |
| # | Provider / Model | Success | effective_TCoT | Latency p50 / p95 | Attempts | Failure modes (this pass) |
|---|---|---|---|---|---|---|
| 1 | google/gemini-2.5-flash-litestandard | 100.0% (15/15) | 0.48s / 0.62s | 15 (100% 1st-try) | none | |
| 2 | openai/gpt-4o-ministandard | 100.0% (15/15) | 0.82s / 1.06s | 15 (100% 1st-try) | none | |
| 3 | google/gemini-2.5-flashstandard | 100.0% (15/15) | 0.92s / 0.97s | 15 (100% 1st-try) | none | |
| 4 | openai/gpt-4ostandard | 100.0% (15/15) | 0.52s / 0.82s | 15 (100% 1st-try) | none | |
| 5 | anthropic/claude-sonnet-4-6standard | 100.0% (15/15) | 1.17s / 1.72s | 15 (100% 1st-try) | none | |
| 6 | google/gemini-2.5-prostandard | 40.0% (6/15) | 6.06s / 13.01s | 39 (0% 1st-try) | 33schema_break | |
| 7 | anthropic/claude-haiku-4-5standard | 0.0% (0/15) | 0.80s / 1.85s | 45 (0% 1st-try) | 45schema_break | |
| 8 | anthropic/claude-opus-4-7standard | 0.0% (0/15) | 0.11s / 0.22s | 45 (0% 1st-try) | 45error |
| # | Provider / Model | Success | effective_TCoT | Latency p50 / p95 | Attempts | Failure modes (this pass) |
|---|---|---|---|---|---|---|
| 1 | google/gemini-2.5-flash-litestandard | 100.0% (15/15) | 0.55s / 0.90s | 15 (100% 1st-try) | none | |
| 2 | openai/gpt-4o-ministandard | 100.0% (15/15) | 0.77s / 1.30s | 15 (100% 1st-try) | none | |
| 3 | google/gemini-2.5-flashstandard | 100.0% (15/15) | 0.95s / 1.37s | 15 (100% 1st-try) | none | |
| 4 | openai/gpt-4ostandard | 100.0% (15/15) | 0.52s / 1.00s | 15 (100% 1st-try) | none | |
| 5 | anthropic/claude-sonnet-4-6standard | 100.0% (15/15) | 1.06s / 1.37s | 15 (100% 1st-try) | none | |
| 6 | anthropic/claude-haiku-4-5standard | 0.0% (0/15) | 0.73s / 4.25s | 45 (0% 1st-try) | 45schema_break | |
| 7 | anthropic/claude-opus-4-7standard | 0.0% (0/15) | 0.12s / 0.22s | 45 (0% 1st-try) | 45error |