Headline metric: effective_TCoT is total spend per successful task completion, including the cost of failed retries. Lower is better. The methodology is the contribution; the leaderboard is its proof.
Validation is machine-checkable (no LLM-as-judge). Failures are classified into eight modes (refusal, confabulation, schema break, truncation, partial, off-task, timeout, error). One canonical prompt per task across all providers. Temperature 0; N=3 runs per task instance; cost-per-million-token pricing verified per provider.
See the full methodology for formulas, retry policy, validator contract, and reproducibility caveats. New to the metrics? The glossary defines each term.
Cross-task ranking
Each cell is the model's rank on that task in its latest clean pass. Lower is better. Rank 1 highlighted in purple. Rows are grouped by model_class (standard, then reasoning, then search) and sorted within each class by average observed rank. Cross-class comparisons mislead because reasoning models burn 5x to 20x more output tokens per attempt and search models pay an unmetered per-search fee; see methodology s2.7.
| Provider / Model | Class | structured_extraction |
function_call_routing |
synthetic_rag |
avg rank |
|---|---|---|---|---|---|
| openai/gpt-4o-mini | standard | #1 | #3 | #3 | 2.3 |
| openrouter/meta-llama/llama-4-scout | standard | #4 | #1 | #2 | 2.3 |
| openrouter/meta-llama/llama-4-maverick | standard | #2 | #4 | #4 | 3.3 |
| google/gemini-2.5-flash-lite | standard | #8 | #2 | #1 | 3.7 |
| xai/grok-3-mini | standard | #3 | #5 | #6 | 4.7 |
| openrouter/deepseek/deepseek-chat | standard | #7 | #8 | #5 | 6.7 |
| xai/grok-4-fast | standard | #6 | #6 | #8 | 6.7 |
| openrouter/meta-llama/llama-3.3-70b-instruct | standard | #5 | #7 | #9 | 7.0 |
| google/gemini-2.5-flash | standard | #18 | #9 | #7 | 11.3 |
| anthropic/claude-haiku-4-5 | standard | #21* | #22* | #13 | 13.0 |
| openai/gpt-4o | standard | #10 | #16 | #15 | 13.7 |
| openrouter/mistralai/mistral-large | standard | #14 | #14 | #14 | 14.0 |
| xai/grok-3 | standard | #15 | #17 | #11 | 14.3 |
| google/gemini-2.5-pro | standard | #23* | #21 | #12 | 16.5 |
| anthropic/claude-sonnet-4-6 | standard | #16 | #18 | #16 | 16.7 |
| xai/grok-4 | standard | #20 | #20 | #21 | 20.3 |
| anthropic/claude-opus-4-7 | standard | #22* | #23* | #23* | n/a |
| openrouter/cohere/command-r-plus | standard | #24* | #24* | #24* | n/a |
| openrouter/qwen/qwen-3-235b-a22b | standard | #25* | #25* | #25* | n/a |
| openai/o3-mini | reasoning | #11 | #12 | #19 | 14.0 |
| openai/o4-mini | reasoning | #12 | #13 | #20 | 15.0 |
| openrouter/deepseek/deepseek-r1 | reasoning | #17 | #11 | #18 | 15.3 |
| openai/o3 | reasoning | #13 | #15 | #22 | 16.7 |
| perplexity/sonar-reasoning | reasoning | #27* | #27* | #27* | n/a |
| perplexity/sonar-reasoning-pro | reasoning | #26* | #26* | #26* | n/a |
| perplexity/sonar | search | #9 | #10 | #10 | 9.7 |
| perplexity/sonar-pro | search | #19 | #19 | #17 | 18.3 |
Task: structured_extraction
detailed page ›
Latest clean pass · 2026-05-08 15:11:05 UTC · 46ce0a5
| # | Provider / Model | Success | effective_TCoT | Latency p50 / p95 | 1st-try pass | Failure modes (this pass) |
|---|---|---|---|---|---|---|
| 1 | openai/gpt-4o-mini standard | 100.0% (6/6) | 1.45s / 1.58s | 100% (6/6) | none | |
| 2 | openrouter/meta-llama/llama-4-maverick standard | 100.0% (6/6) | 2.38s / 5.55s | 100% (6/6) | none | |
| 3 | xai/grok-3-mini standard | 100.0% (6/6) | 7.67s / 8.35s | 100% (6/6) | none | |
| 4 | openrouter/meta-llama/llama-4-scout standard | 83.3% (5/6) | 1.17s / 2.18s | 33% (2/6) | 6schema_break | |
| 5 | openrouter/meta-llama/llama-3.3-70b-instruct standard | 100.0% (6/6) | 1.25s / 1.86s | 100% (6/6) | none | |
| 6 | xai/grok-4-fast standard | 100.0% (6/6) | 2.52s / 3.28s | 100% (6/6) | none | |
| 7 | openrouter/deepseek/deepseek-chat standard | 100.0% (6/6) | 2.62s / 10.68s | 83% (5/6) | 1schema_break | |
| 8 | google/gemini-2.5-flash-lite standard | 100.0% (6/6) | 6.92s / 17.68s | 0% (0/6) | 8schema_break | |
| 9 | perplexity/sonar search | 100.0% (6/6) | 1.92s / 2.73s | 33% (2/6) | 7schema_break | |
| 10 | openai/gpt-4o standard | 100.0% (6/6) | 1.07s / 2.00s | 100% (6/6) | none | |
| 11 | openai/o3-mini reasoning | 100.0% (6/6) | 4.15s / 14.22s | 100% (6/6) | none | |
| 12 | openai/o4-mini reasoning | 100.0% (6/6) | 1.73s / 11.97s | 100% (6/6) | none | |
| 13 | openai/o3 reasoning | 100.0% (6/6) | 2.04s / 3.10s | 100% (6/6) | none | |
| 14 | openrouter/mistralai/mistral-large standard | 100.0% (6/6) | 1.19s / 1.40s | 50% (3/6) | 3schema_break | |
| 15 | xai/grok-3 standard | 100.0% (6/6) | 15.30s / 46.68s | 100% (6/6) | none | |
| 16 | anthropic/claude-sonnet-4-6 standard | 100.0% (6/6) | 8.80s / 26.46s | 83% (5/6) | 2error | |
| 17 | openrouter/deepseek/deepseek-r1 reasoning | 100.0% (6/6) | 17.26s / 33.41s | 100% (6/6) | none | |
| 18 | google/gemini-2.5-flash standard | 50.0% (3/6) | 2.31s / 3.21s | 0% (0/6) | 13schema_break | |
| 19 | perplexity/sonar-pro search | 83.3% (5/6) | 1.98s / 2.50s | 50% (3/6) | 6schema_break | |
| 20 | xai/grok-4 standard | 100.0% (6/6) | 6.06s / 6.69s | 100% (6/6) | none | |
| 21 | anthropic/claude-haiku-4-5 standard | 0.0% (0/6) | 0.80s / 1.31s | 0% (0/6) | 18schema_break | |
| 22 | anthropic/claude-opus-4-7 standard | 0.0% (0/6) | 0.14s / 0.26s | 0% (0/6) | 18error | |
| 23 | google/gemini-2.5-pro standard | 0.0% (0/6) | 7.03s / 9.84s | 0% (0/6) | 18schema_break | |
| 24 | openrouter/cohere/command-r-plus standard | 0.0% (0/6) | 0.02s / 0.08s | 0% (0/6) | 18error | |
| 25 | openrouter/qwen/qwen-3-235b-a22b standard | 0.0% (0/6) | 0.03s / 0.08s | 0% (0/6) | 18error | |
| 26 | perplexity/sonar-reasoning-pro reasoning | 0.0% (0/6) | 4.42s / 6.46s | 0% (0/6) | 18schema_break | |
| 27 | perplexity/sonar-reasoning reasoning | 0.0% (0/6) | 0.07s / 0.16s | 0% (0/6) | 18error |
Rank rows marked with “≈” are within 5% of the rank above on effective_TCoT — treat as tied within bench noise. See methodology s7.
Cost calculator
| Provider / Model | Monthly cost (USD) |
|---|---|
| openai/gpt-4o-ministandard | $0.44 |
| openrouter/meta-llama/llama-4-maverickstandard | $0.47 |
| xai/grok-3-ministandard | $0.56 |
| openrouter/meta-llama/llama-4-scoutstandard | $0.64 |
| openrouter/meta-llama/llama-3.3-70b-instructstandard | $0.67 |
| xai/grok-4-faststandard | $0.68 |
| openrouter/deepseek/deepseek-chatstandard | $0.90 |
| google/gemini-2.5-flash-litestandard | $1.06 |
| perplexity/sonarsearch | $4.78 |
| openai/gpt-4ostandard | $6.94 |
| openai/o3-minireasoning | $7.36 |
| openai/o4-minireasoning | $8.15 |
| openai/o3reasoning | $9.69 |
| openrouter/mistralai/mistral-largestandard | $9.79 |
| xai/grok-3standard | $10.17 |
| anthropic/claude-sonnet-4-6standard | $10.79 |
| openrouter/deepseek/deepseek-r1reasoning | $11.43 |
| google/gemini-2.5-flashstandard | $12.05 |
| perplexity/sonar-prosearch | $25.67 |
| xai/grok-4standard | $28.56 |
| anthropic/claude-haiku-4-5standard | n/a |
| anthropic/claude-opus-4-7standard | n/a |
| google/gemini-2.5-prostandard | n/a |
| openrouter/cohere/command-r-plusstandard | n/a |
| openrouter/qwen/qwen-3-235b-a22bstandard | n/a |
| perplexity/sonar-reasoning-proreasoning | n/a |
| perplexity/sonar-reasoningreasoning | n/a |
Computed from this pass's effective_TCoT. Real production cost will vary with prompt length, model snapshot drift, and retry distribution on your data.
Critique-Pass track (METHODOLOGY s13)
| # | Provider / Model | Success | effective_TCoT | Δ TCoT | Δ success | Latency p50 / p95 |
|---|---|---|---|---|---|---|
| 1 | openai/gpt-4o-mini standard | 100.0% (6/6) | $0.00010 | +130.2% | +0.0 pp | 3.21s / 7.87s |
| 2 | openrouter/meta-llama/llama-4-maverick standard | 100.0% (6/6) | $0.00010 | +123.5% | +0.0 pp | 4.77s / 5.84s |
| 3 | openrouter/meta-llama/llama-4-scout standard | 100.0% (6/6) | $0.00012 | +79.7% | +16.7 pp | 1.92s / 2.27s |
| 4 | xai/grok-4-fast standard | 100.0% (6/6) | $0.00016 | +130.2% | +0.0 pp | 7.33s / 10.07s |
| 5 | openrouter/meta-llama/llama-3.3-70b-instruct standard | 100.0% (6/6) | $0.00016 | +137.3% | +0.0 pp | 3.07s / 7.90s |
| 6 | openrouter/deepseek/deepseek-chat standard | 100.0% (6/6) | $0.00017 | +91.0% | +0.0 pp | 4.52s / 6.01s |
| 7 | xai/grok-3-mini standard | 100.0% (6/6) | $0.00021 | +280.0% | +0.0 pp | 7.14s / 178.48s |
| 8 | google/gemini-2.5-flash-lite standard | 66.7% (4/6) | $0.00034 | +225.3% | -33.3 pp | 7.81s / 22.92s |
| 9 | perplexity/sonar search | 100.0% (6/6) | $0.00036 | -24.9% | +0.0 pp | 4.25s / 4.91s |
| 10 | perplexity/sonar-reasoning-pro reasoning | 100.0% (6/6) | $0.00110 | n/a | +100.0 pp | 5.38s / 6.08s |
| 11 | openai/o3-mini reasoning | 100.0% (6/6) | $0.00132 | +79.5% | +0.0 pp | 3.15s / 3.60s |
| 12 | openrouter/mistralai/mistral-large standard | 100.0% (6/6) | $0.00136 | +39.4% | +0.0 pp | 2.47s / 3.18s |
| 13 | openai/gpt-4o standard | 100.0% (6/6) | $0.00150 | +116.4% | +0.0 pp | 2.41s / 5.04s |
| 14 | google/gemini-2.5-flash standard | 83.3% (5/6) | $0.00154 | +27.5% | +33.3 pp | 5.39s / 7.06s |
| 15 | perplexity/sonar-pro search | 100.0% (6/6) | $0.00193 | -24.9% | +16.7 pp | 4.30s / 4.77s |
| 16 | openai/o4-mini reasoning | 100.0% (6/6) | $0.00199 | +144.8% | +0.0 pp | 3.14s / 3.42s |
| 17 | openrouter/deepseek/deepseek-r1 reasoning | 100.0% (6/6) | $0.00214 | +87.5% | +0.0 pp | 26.19s / 193.73s |
| 18 | anthropic/claude-sonnet-4-6 standard | 100.0% (6/6) | $0.00223 | +106.5% | +0.0 pp | 2.47s / 2.55s |
| 19 | openai/o3 reasoning | 100.0% (6/6) | $0.00277 | +186.1% | +0.0 pp | 4.36s / 5.01s |
| 20 | xai/grok-3 standard | 100.0% (6/6) | $0.00305 | +199.7% | +0.0 pp | 5.82s / 145.38s |
| 21 | xai/grok-4 standard | 100.0% (6/6) | $0.00305 | +6.7% | +0.0 pp | 11.97s / 70.09s |
| 22 | anthropic/claude-haiku-4-5 standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 1.41s / 2.13s |
| 23 | anthropic/claude-opus-4-7 standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.12s / 0.17s |
| 24 | google/gemini-2.5-pro standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 14.60s / 17.72s |
| 25 | openrouter/cohere/command-r-plus standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.03s / 0.09s |
| 26 | openrouter/qwen/qwen-3-235b-a22b standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.02s / 0.10s |
| 27 | perplexity/sonar-reasoning reasoning | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.06s / 0.13s |
7 bench passes recorded for this task. Full pass-by-pass breakdown, reproducibility deltas, and per-provider failure-mode aggregates are on the detailed task page.
Task: function_call_routing
detailed page ›
Latest clean pass · 2026-05-08 15:11:05 UTC · 46ce0a5
| # | Provider / Model | Success | effective_TCoT | Latency p50 / p95 | 1st-try pass | Failure modes (this pass) |
|---|---|---|---|---|---|---|
| 1 | openrouter/meta-llama/llama-4-scout standard | 100.0% (6/6) | 0.88s / 1.17s | 100% (6/6) | none | |
| 2 | google/gemini-2.5-flash-lite standard | 100.0% (6/6) | 6.06s / 17.35s | 83% (5/6) | 1error | |
| 3 | openai/gpt-4o-mini standard | 100.0% (6/6) | 0.86s / 1.03s | 100% (6/6) | none | |
| 4 | openrouter/meta-llama/llama-4-maverick standard | 100.0% (6/6) | 2.67s / 6.05s | 100% (6/6) | none | |
| 5 | xai/grok-3-mini standard | 100.0% (6/6) | 5.98s / 6.12s | 100% (6/6) | none | |
| 6 | xai/grok-4-fast standard | 100.0% (6/6) | 2.20s / 4.47s | 100% (6/6) | none | |
| 7 | openrouter/meta-llama/llama-3.3-70b-instruct standard | 100.0% (6/6) | 0.78s / 4.34s | 100% (6/6) | none | |
| 8 | openrouter/deepseek/deepseek-chat standard | 100.0% (6/6) | 1.31s / 1.92s | 83% (5/6) | 1schema_break | |
| 9 | google/gemini-2.5-flash standard | 100.0% (6/6) | 0.99s / 1.29s | 100% (6/6) | none | |
| 10 | perplexity/sonar search | 100.0% (6/6) | 1.66s / 3.36s | 67% (4/6) | 2schema_break | |
| 11 | openrouter/deepseek/deepseek-r1 reasoning | 100.0% (6/6) | 6.30s / 12.02s | 100% (6/6) | none | |
| 12 | openai/o3-mini reasoning | 100.0% (6/6) | 1.73s / 3.20s | 100% (6/6) | none | |
| 13 | openai/o4-mini reasoning | 100.0% (6/6) | 2.01s / 2.24s | 100% (6/6) | none | |
| 14 | openrouter/mistralai/mistral-large standard | 100.0% (6/6) | 1.37s / 5.16s | 100% (6/6) | none | |
| 15 | openai/o3 reasoning | 100.0% (6/6) | 1.76s / 2.89s | 100% (6/6) | none | |
| 16 | openai/gpt-4o standard | 100.0% (6/6) | 1.24s / 2.45s | 100% (6/6) | none | |
| 17 | xai/grok-3 standard | 100.0% (6/6) | 1.29s / 2.08s | 100% (6/6) | none | |
| 18 | anthropic/claude-sonnet-4-6 standard | 100.0% (6/6) | 1.19s / 1.73s | 100% (6/6) | none | |
| 19 | perplexity/sonar-pro search | 83.3% (5/6) | 1.60s / 3.30s | 83% (5/6) | 3schema_break | |
| 20 | xai/grok-4 standard | 100.0% (6/6) | 3.85s / 5.50s | 100% (6/6) | none | |
| 21 | google/gemini-2.5-pro standard | 16.7% (1/6) | 5.62s / 12.08s | 0% (0/6) | 16schema_break | |
| 22 | anthropic/claude-haiku-4-5 standard | 0.0% (0/6) | 0.71s / 1.81s | 0% (0/6) | 18schema_break | |
| 23 | anthropic/claude-opus-4-7 standard | 0.0% (0/6) | 0.12s / 0.17s | 0% (0/6) | 18error | |
| 24 | openrouter/cohere/command-r-plus standard | 0.0% (0/6) | 0.03s / 0.09s | 0% (0/6) | 18error | |
| 25 | openrouter/qwen/qwen-3-235b-a22b standard | 0.0% (0/6) | 0.02s / 0.08s | 0% (0/6) | 18error | |
| 26 | perplexity/sonar-reasoning-pro reasoning | 0.0% (0/6) | 4.00s / 7.46s | 0% (0/6) | 18schema_break | |
| 27 | perplexity/sonar-reasoning reasoning | 0.0% (0/6) | 0.08s / 0.14s | 0% (0/6) | 18error |
Rank rows marked with “≈” are within 5% of the rank above on effective_TCoT — treat as tied within bench noise. See methodology s7.
Cost calculator
| Provider / Model | Monthly cost (USD) |
|---|---|
| openrouter/meta-llama/llama-4-scoutstandard | $0.30 |
| google/gemini-2.5-flash-litestandard | $0.35 |
| openai/gpt-4o-ministandard | $0.46 |
| openrouter/meta-llama/llama-4-maverickstandard | $0.54 |
| xai/grok-3-ministandard | $0.79 |
| xai/grok-4-faststandard | $0.86 |
| openrouter/meta-llama/llama-3.3-70b-instructstandard | $1.00 |
| openrouter/deepseek/deepseek-chatstandard | $1.08 |
| google/gemini-2.5-flashstandard | $1.29 |
| perplexity/sonarsearch | $3.44 |
| openrouter/deepseek/deepseek-r1reasoning | $4.75 |
| openai/o3-minireasoning | $4.90 |
| openai/o4-minireasoning | $5.54 |
| openrouter/mistralai/mistral-largestandard | $5.95 |
| openai/o3reasoning | $7.60 |
| openai/gpt-4ostandard | $7.72 |
| xai/grok-3standard | $9.84 |
| anthropic/claude-sonnet-4-6standard | $11.91 |
| perplexity/sonar-prosearch | $16.72 |
| xai/grok-4standard | $30.18 |
| google/gemini-2.5-prostandard | $109.30 |
| anthropic/claude-haiku-4-5standard | n/a |
| anthropic/claude-opus-4-7standard | n/a |
| openrouter/cohere/command-r-plusstandard | n/a |
| openrouter/qwen/qwen-3-235b-a22bstandard | n/a |
| perplexity/sonar-reasoning-proreasoning | n/a |
| perplexity/sonar-reasoningreasoning | n/a |
Computed from this pass's effective_TCoT. Real production cost will vary with prompt length, model snapshot drift, and retry distribution on your data.
Critique-Pass track (METHODOLOGY s13)
| # | Provider / Model | Success | effective_TCoT | Δ TCoT | Δ success | Latency p50 / p95 |
|---|---|---|---|---|---|---|
| 1 | openrouter/meta-llama/llama-4-scout standard | 100.0% (6/6) | $0.00007 | +122.4% | +0.0 pp | 0.84s / 1.01s |
| 2 | google/gemini-2.5-flash-lite standard | 100.0% (6/6) | $0.00007 | +113.0% | +0.0 pp | 1.47s / 1.92s |
| 3 | openai/gpt-4o-mini standard | 100.0% (6/6) | $0.00010 | +120.6% | +0.0 pp | 1.37s / 1.97s |
| 4 | openrouter/meta-llama/llama-4-maverick standard | 100.0% (6/6) | $0.00012 | +122.1% | +0.0 pp | 1.50s / 1.75s |
| 5 | xai/grok-4-fast standard | 100.0% (6/6) | $0.00017 | +102.7% | +0.0 pp | 4.63s / 6.42s |
| 6 | openrouter/deepseek/deepseek-chat standard | 100.0% (6/6) | $0.00020 | +85.4% | +0.0 pp | 3.81s / 4.94s |
| 7 | openrouter/meta-llama/llama-3.3-70b-instruct standard | 100.0% (6/6) | $0.00023 | +127.0% | +0.0 pp | 1.82s / 2.36s |
| 8 | xai/grok-3-mini standard | 100.0% (6/6) | $0.00025 | +218.0% | +0.0 pp | 5.48s / 10.55s |
| 9 | google/gemini-2.5-flash standard | 100.0% (6/6) | $0.00028 | +115.4% | +0.0 pp | 1.98s / 2.40s |
| 10 | perplexity/sonar search | 100.0% (6/6) | $0.00054 | +55.7% | +0.0 pp | 3.36s / 5.11s |
| 11 | openrouter/deepseek/deepseek-r1 reasoning | 100.0% (6/6) | $0.00129 | +170.4% | +0.0 pp | 130.54s / 264.24s |
| 12 | openai/o3-mini reasoning | 100.0% (6/6) | $0.00129 | +163.8% | +0.0 pp | 2.39s / 3.09s |
| 13 | openrouter/mistralai/mistral-large standard | 100.0% (6/6) | $0.00132 | +121.8% | +0.0 pp | 1.46s / 1.76s |
| 14 | perplexity/sonar-reasoning-pro reasoning | 100.0% (6/6) | $0.00134 | n/a | +100.0 pp | 4.42s / 6.06s |
| 15 | openai/o4-mini reasoning | 100.0% (6/6) | $0.00154 | +177.8% | +0.0 pp | 3.39s / 5.52s |
| 16 | openai/gpt-4o standard | 100.0% (6/6) | $0.00170 | +120.6% | +0.0 pp | 1.64s / 2.54s |
| 17 | openai/o3 reasoning | 100.0% (6/6) | $0.00198 | +160.1% | +0.0 pp | 2.79s / 3.19s |
| 18 | perplexity/sonar-pro search | 100.0% (6/6) | $0.00209 | +25.1% | +16.7 pp | 3.47s / 3.79s |
| 19 | anthropic/claude-sonnet-4-6 standard | 100.0% (6/6) | $0.00260 | +118.2% | +0.0 pp | 2.15s / 2.66s |
| 20 | xai/grok-3 standard | 100.0% (6/6) | $0.00290 | +195.0% | +0.0 pp | 5.31s / 6.40s |
| 21 | xai/grok-4 standard | 100.0% (6/6) | $0.00290 | -3.8% | +0.0 pp | 5.12s / 6.54s |
| 22 | google/gemini-2.5-pro standard | 33.3% (2/6) | $0.01114 | +1.9% | +16.7 pp | 12.43s / 22.54s |
| 23 | anthropic/claude-haiku-4-5 standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 1.37s / 2.10s |
| 24 | anthropic/claude-opus-4-7 standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.12s / 0.24s |
| 25 | openrouter/cohere/command-r-plus standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.03s / 0.10s |
| 26 | openrouter/qwen/qwen-3-235b-a22b standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.03s / 0.14s |
| 27 | perplexity/sonar-reasoning reasoning | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.07s / 0.14s |
4 bench passes recorded for this task. Full pass-by-pass breakdown, reproducibility deltas, and per-provider failure-mode aggregates are on the detailed task page.
Task: synthetic_rag
detailed page ›
Latest clean pass · 2026-05-08 15:11:05 UTC · 46ce0a5
| # | Provider / Model | Success | effective_TCoT | Latency p50 / p95 | 1st-try pass | Failure modes (this pass) |
|---|---|---|---|---|---|---|
| 1 | google/gemini-2.5-flash-lite standard | 66.7% (4/6) | 6.13s / 18.96s | 50% (3/6) | 6confabulation 1error | |
| 2 | openrouter/meta-llama/llama-4-scout standard | 66.7% (4/6) | 0.41s / 0.58s | 33% (2/6) | 6confabulation 2schema_break | |
| 3 | openai/gpt-4o-mini standard | 66.7% (4/6) | 0.62s / 0.76s | 67% (4/6) | 6confabulation | |
| 4 | openrouter/meta-llama/llama-4-maverick standard | 66.7% (4/6) | 0.85s / 3.69s | 67% (4/6) | 6confabulation | |
| 5 | openrouter/deepseek/deepseek-chat standard | 66.7% (4/6) | 0.93s / 1.49s | 67% (4/6) | 6confabulation | |
| 6 | xai/grok-3-mini standard | 66.7% (4/6) | 12.55s / 16.70s | 67% (4/6) | 6confabulation | |
| 7 | google/gemini-2.5-flash standard | 66.7% (4/6) | 1.03s / 10.84s | 67% (4/6) | 6confabulation | |
| 8 | xai/grok-4-fast standard | 66.7% (4/6) | 1.83s / 7.16s | 67% (4/6) | 6confabulation | |
| 9 | openrouter/meta-llama/llama-3.3-70b-instruct standard | 66.7% (4/6) | 0.57s / 4.98s | 67% (4/6) | 6confabulation | |
| 10 | perplexity/sonar search | 66.7% (4/6) | 1.42s / 1.75s | 67% (4/6) | 6confabulation | |
| 11 | xai/grok-3 standard | 100.0% (6/6) | 0.90s / 2.67s | 100% (6/6) | none | |
| 12 | google/gemini-2.5-pro standard | 83.3% (5/6) | 6.97s / 13.99s | 67% (4/6) | 5confabulation | |
| 13 | anthropic/claude-haiku-4-5 standard | 66.7% (4/6) | 0.67s / 0.79s | 67% (4/6) | 6confabulation | |
| 14 | openrouter/mistralai/mistral-large standard | 83.3% (5/6) | 0.62s / 2.50s | 67% (4/6) | 5confabulation | |
| 15 | openai/gpt-4o standard | 66.7% (4/6) | 0.77s / 1.69s | 67% (4/6) | 6confabulation | |
| 16 | anthropic/claude-sonnet-4-6 standard | 66.7% (4/6) | 1.02s / 1.52s | 67% (4/6) | 6confabulation | |
| 17 | perplexity/sonar-pro search | 66.7% (4/6) | 1.53s / 2.82s | 67% (4/6) | 6confabulation | |
| 18 | openrouter/deepseek/deepseek-r1 reasoning | 66.7% (4/6) | 10.87s / 68.12s | 67% (4/6) | 6confabulation | |
| 19 | openai/o3-mini reasoning | 66.7% (4/6) | 2.11s / 5.38s | 67% (4/6) | 6confabulation | |
| 20 | openai/o4-mini reasoning | 66.7% (4/6) | 2.13s / 7.69s | 67% (4/6) | 6confabulation | |
| 21 | xai/grok-4 standard | 66.7% (4/6) | 5.10s / 84.15s | 67% (4/6) | 6confabulation | |
| 22 | openai/o3 reasoning | 66.7% (4/6) | 3.80s / 7.88s | 67% (4/6) | 6confabulation | |
| 23 | anthropic/claude-opus-4-7 standard | 0.0% (0/6) | 0.12s / 0.51s | 0% (0/6) | 18error | |
| 24 | openrouter/cohere/command-r-plus standard | 0.0% (0/6) | 0.02s / 0.08s | 0% (0/6) | 18error | |
| 25 | openrouter/qwen/qwen-3-235b-a22b standard | 0.0% (0/6) | 0.02s / 0.10s | 0% (0/6) | 18error | |
| 26 | perplexity/sonar-reasoning-pro reasoning | 0.0% (0/6) | 7.37s / 11.67s | 0% (0/6) | 18confabulation | |
| 27 | perplexity/sonar-reasoning reasoning | 0.0% (0/6) | 0.07s / 0.12s | 0% (0/6) | 18error |
Rank rows marked with “≈” are within 5% of the rank above on effective_TCoT — treat as tied within bench noise. See methodology s7.
Cost calculator
| Provider / Model | Monthly cost (USD) |
|---|---|
| google/gemini-2.5-flash-litestandard | $0.38 |
| openrouter/meta-llama/llama-4-scoutstandard | $0.45 |
| openai/gpt-4o-ministandard | $0.59 |
| openrouter/meta-llama/llama-4-maverickstandard | $0.74 |
| openrouter/deepseek/deepseek-chatstandard | $1.05 |
| xai/grok-3-ministandard | $1.07 |
| google/gemini-2.5-flashstandard | $1.20 |
| xai/grok-4-faststandard | $1.47 |
| openrouter/meta-llama/llama-3.3-70b-instructstandard | $1.48 |
| perplexity/sonarsearch | $3.33 |
| xai/grok-3standard | $3.68 |
| google/gemini-2.5-prostandard | $4.58 |
| anthropic/claude-haiku-4-5standard | $5.27 |
| openrouter/mistralai/mistral-largestandard | $5.81 |
| openai/gpt-4ostandard | $8.98 |
| anthropic/claude-sonnet-4-6standard | $14.16 |
| perplexity/sonar-prosearch | $14.87 |
| openrouter/deepseek/deepseek-r1reasoning | $26.63 |
| openai/o3-minireasoning | $28.86 |
| openai/o4-minireasoning | $31.43 |
| xai/grok-4standard | $62.89 |
| openai/o3reasoning | $66.10 |
| anthropic/claude-opus-4-7standard | n/a |
| openrouter/cohere/command-r-plusstandard | n/a |
| openrouter/qwen/qwen-3-235b-a22bstandard | n/a |
| perplexity/sonar-reasoning-proreasoning | n/a |
| perplexity/sonar-reasoningreasoning | n/a |
Computed from this pass's effective_TCoT. Real production cost will vary with prompt length, model snapshot drift, and retry distribution on your data.
Critique-Pass track (METHODOLOGY s13)
| # | Provider / Model | Success | effective_TCoT | Δ TCoT | Δ success | Latency p50 / p95 |
|---|---|---|---|---|---|---|
| 1 | google/gemini-2.5-flash-lite standard | 66.7% (4/6) | $0.00009 | +123.1% | +0.0 pp | 0.99s / 1.23s |
| 2 | openrouter/meta-llama/llama-4-scout standard | 66.7% (4/6) | $0.00009 | +91.0% | +0.0 pp | 0.43s / 0.77s |
| 3 | openai/gpt-4o-mini standard | 66.7% (4/6) | $0.00013 | +125.4% | +0.0 pp | 1.13s / 1.66s |
| 4 | openrouter/meta-llama/llama-4-maverick standard | 66.7% (4/6) | $0.00016 | +123.6% | +0.0 pp | 1.07s / 2.01s |
| 5 | openrouter/deepseek/deepseek-chat standard | 66.7% (4/6) | $0.00024 | +128.1% | +0.0 pp | 2.54s / 8.29s |
| 6 | google/gemini-2.5-flash standard | 66.7% (4/6) | $0.00028 | +130.3% | +0.0 pp | 2.02s / 19.72s |
| 7 | xai/grok-4-fast standard | 66.7% (4/6) | $0.00030 | +102.4% | +0.0 pp | 6.05s / 20.94s |
| 8 | xai/grok-3-mini standard | 66.7% (4/6) | $0.00043 | +297.4% | +0.0 pp | 6.21s / 11.95s |
| 9 | openrouter/meta-llama/llama-3.3-70b-instruct standard | 50.0% (3/6) | $0.00058 | +288.8% | -16.7 pp | 1.17s / 8.35s |
| 10 | perplexity/sonar search | 66.7% (4/6) | $0.00079 | +136.8% | +0.0 pp | 3.74s / 3.96s |
| 11 | google/gemini-2.5-pro standard | 66.7% (4/6) | $0.00114 | +148.2% | -16.7 pp | 9.77s / 31.75s |
| 12 | anthropic/claude-haiku-4-5 standard | 66.7% (4/6) | $0.00120 | +127.4% | +0.0 pp | 1.25s / 1.34s |
| 13 | perplexity/sonar-reasoning-pro reasoning | 66.7% (4/6) | $0.00168 | n/a | +66.7 pp | 4.62s / 7.54s |
| 14 | openrouter/mistralai/mistral-large standard | 66.7% (4/6) | $0.00169 | +190.6% | -16.7 pp | 1.09s / 1.46s |
| 15 | openai/gpt-4o standard | 66.7% (4/6) | $0.00208 | +131.8% | +0.0 pp | 1.10s / 6.71s |
| 16 | perplexity/sonar-pro search | 66.7% (4/6) | $0.00252 | +69.6% | +0.0 pp | 3.92s / 5.44s |
| 17 | anthropic/claude-sonnet-4-6 standard | 66.7% (4/6) | $0.00323 | +128.3% | +0.0 pp | 2.12s / 2.40s |
| 18 | xai/grok-4 standard | 66.7% (4/6) | $0.00454 | -27.9% | +0.0 pp | 5.73s / 15.56s |
| 19 | xai/grok-3 standard | 66.7% (4/6) | $0.00472 | +1182.1% | -33.3 pp | 7.14s / 16.23s |
| 20 | openai/o3-mini reasoning | 66.7% (4/6) | $0.00506 | +75.4% | +0.0 pp | 3.92s / 14.39s |
| 21 | openrouter/deepseek/deepseek-r1 reasoning | 66.7% (4/6) | $0.00613 | +130.2% | +0.0 pp | 25.88s / 167.43s |
| 22 | openai/o4-mini reasoning | 66.7% (4/6) | $0.00720 | +128.9% | +0.0 pp | 3.53s / 8.92s |
| 23 | openai/o3 reasoning | 66.7% (4/6) | $0.01409 | +113.1% | +0.0 pp | 2.39s / 17.25s |
| 24 | anthropic/claude-opus-4-7 standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.11s / 0.31s |
| 25 | openrouter/cohere/command-r-plus standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.03s / 0.10s |
| 26 | openrouter/qwen/qwen-3-235b-a22b standard | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.04s / 0.11s |
| 27 | perplexity/sonar-reasoning reasoning | 0.0% (0/6) | infinite | n/a | +0.0 pp | 0.07s / 0.15s |
4 bench passes recorded for this task. Full pass-by-pass breakdown, reproducibility deltas, and per-provider failure-mode aggregates are on the detailed task page.
Coming in v1: document_extraction_ocrv1
Multimodal document OCR task on a redistribution-compatible open corpus (FUNSD, SROIE, CORD candidates, gated on license verification). Tests the procurement-question stressors that synthetic-render demos miss:
- Handwriting tolerance
- Multi-column and table layouts
- Scan artifacts: skew, noise, JPEG compression, low resolution
- Mixed text + graphics; multi-page documents
Same exact-field-match validator as structured_extraction so the delta between text-input and image-input isolates the OCR layer's per-provider contribution. Methodology bump 0.2 to 0.3 lands alongside: image-input pricing in pricing.py, modality field on the Task protocol, attempt-and-classify vision detection (no hardcoded supports_vision). See METHODOLOGY s10 for why the synthetic-render shortcut was rejected.
Coming in v0.5
- Async runner. Parallelize adapter calls across providers within a task. Cuts wall-clock roughly proportional to provider count.
- BYO task config. TOML schema plus a
ConfigDrivenTaskthat loads--task-config PATH. Validator types:json_field_match,exact_match,regex_match,schema_match. Goal: bench your own prompts without writing Python. - Plugin loader.
--plugins-dir DIRloads tasks and providers from arbitrary paths. Prerequisite for the v1 hosted-run service. - Bootstrap confidence intervals on
effective_TCoTto replace the v0.3 5%-ratio tied-rank heuristic. - N=5 default with bootstrap CI infrastructure in place.
- Search-cost accounting in TCoT for Perplexity Sonar entries. Per-search fees (about $5 per 1k) are currently excluded; v0.5 closes the gap.
- Tuned-prompt track formalization. Real contract for who tunes, against which split, to what convergence criterion. Renders as a "tuned delta" column.
- Historical leaderboard with trend lines. Per-provider
effective_TCoTandsuccess_rateacross model snapshots, inline SVG sparklines.
Recently shipped
- v0.2 (this release): Critique-Pass evaluation track per METHODOLOGY s13. Optional
--critique-passwraps each attempt with a locked single-shot self-revision step; twin-leaderboard reporting with per-model delta columns surfaces the cost-vs-accuracy tradeoff. - v0.4: 27 model coverage across 6 providers,
model_classfield separating reasoning and search from standard chat, OpenAI-compatible adapter generalization for xAI / Perplexity / OpenRouter. - v0.3: 3 tasks (added
function_call_routingandsynthetic_rag), mean ± std reporting, tied-rank marker, per-task drill-down pages, cost calculator widget.