π Read the CoEval paper online Β Β·Β Download the Word version
CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks β read it online (HTML with rendered math) or download the Word version.
Alexander Apartsin (Holon Institute of Technology) Β· Yehudit Aperstein (Afeka Tel Aviv Academic College of Engineering)
CoEval ranks models for a custom task or domain in the hardest setting: when no task-specific labeled data exists and public benchmarks cannot be trusted because their items have likely leaked into pretraining. From only a task description, a teacher model synthesizes a fresh, contamination-free benchmark and a cross-family judge ensemble ranks the candidates, with no human labels or raters.
| Result | Evidence |
|---|---|
| Recovers the true model ranking with no labeled data | Spearman Ο = 0.86 vs ground-truth correctness, 95% CI [0.77, 0.94] |
| Doubly-robust ranking recovers the true ordering and resists rogue judges | reliability and discrimination weighting lift rank-recovery to Spearman 0.95 across 13 models; an injected random judge receives weight 0.00 |
| Rankings are domain-specific, so a generic leaderboard misleads | three different models top four CoEval-generated domains; the pooled-best model is domain-best in only 1 of 4 |
| Cancels a verbosity bias no single judge avoids | ensemble r = +0.010 (CI spans zero), a 93% reduction |
| Composition over size: panel diversity, not panel size, drives reliability | ICC(3,k) peaks at two well-chosen judges, falls as low-agreement judges are added |
| Structurally precludes same-family self-preference | vendor-disjoint panel; aggregation shifts every score β€ 0.015 |
| Contamination-free generated items | 0.0000 verbatim 13-gram overlap with five major public benchmarks |
| Inexpensive enough to re-run per model release | 7,978 evaluations for USD 5.89, fully automated |
Evaluating and selecting off-the-shelf or fine-tuned models for a specific use case is difficult.
Choosing the right LLM means navigating a minefield of hidden pitfalls:
| Challenge | Why It Hurts | |
|---|---|---|
| π― | Generic benchmarks don't transfer | Public data and metrics often miss the nuances of your real-world requirements. |
| π§© | Custom benchmarks are hard to design | Defining representative tasks, building rubrics, and choosing robustness variations is non-trivial. |
| πΈ | Multi-model multi-task benchmarks are expensive to execute | Running every candidate model across every task and rubric quickly multiplies cost and compute. |
| π³οΈ | Leakage biases results | Public and private benchmark items (or near-duplicates) may lurk in training data, inflating scores via memorization. |
| βοΈ | Ops and cost are complex | Running evaluations across providers, inference modes, and scoring criteria demands careful orchestration. |
Bottom line: You can't trust a leaderboard number, and building your own eval is a project in itself.
Ensemble-based synthetic self-evaluation benchmarking β let the models evaluate each other.
CoEval generates a synthetic evaluation suite spanning multiple domain-specific tasks and scoring rubrics, then assembles an ensemble of models that rotate through three roles:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODEL ENSEMBLE β
β β
β βββββββββββββ βββββββββββββ βββββββββββββ β
β β Model A β β Model B β β Model C β ... β
β βββββββ¬ββββββ βββββββ¬ββββββ βββββββ¬ββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ROTATING ROLE ASSIGNMENT β β
β ββββββββββββ³βββββββββββββββββ³ββββββββββββββββββ³βββββ β
β βΌ βΌ βΌ β
β π TEACHER π STUDENT βοΈ JUDGE β
β Generate synthetic Models under Score outputs β
β challenges & evaluation take against the β
β reference answers the challenges rubric β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Not all teachers and judges are created equal. CoEval improves signal quality by identifying:
| Role | Selection Criterion | Intuition |
|---|---|---|
| π Teacher | Differentiating β produces challenges that separate student performance | A good exam question reveals who studied. |
| βοΈ Judge | Consensus β high agreement with ensemble majority | A reliable judge aligns with peer consensus. |
These two label-free signals (item discrimination and judge agreement) combine into CoEval's doubly-robust ranking, which recovers the true model ordering at Spearman 0.95 across 13 candidate models and assigns an injected rogue judge zero weight. Because the best model is domain-specific, CoEval ranks on your domain rather than trusting a pooled leaderboard.
Fully Automatic Semi-Automatic Manual
ββββββββββββββ ββββββββββββββββββ ββββββββββββββββ
β Tasks β β Tasks βοΈ β β Tasks βοΈ β
β Rubrics β βββΊ β Rubrics β βββΊ β Rubrics βοΈ β
β Attr. Space β β Attr. Space βοΈ β β Attr. Space βοΈβ
ββββββββββββββ ββββββββββββββββββ ββββββββββββββββ
AI-generated Human-guided Human-defined
Tasks, rubrics, and diversity/attribute spaces can be provisioned fully automatically, semi-automatically (human-in-the-loop), or manually β choose the level of control that fits your workflow.
CoEval is an end-to-end system β from benchmark design to interactive reporting.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β C o E v a l β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β π¦ Multi-Vendor Support β
β βββ Multiple LLM providers & interfaces out of the box β
β βββ Plug in proprietary / self-hosted models β
β β
β πΊοΈ Benchmark Design & Planning β
β βββ Automated task & rubric provisioning β
β βββ Run orchestration with cost optimization β
β β
β π Interactive Visual Reports β
β βββ Side-by-side model comparison β
β βββ Drill-down into tasks, rubrics & scores β
β β
β π Experiment Tracking β
β βββ Easy reruns & parameter sweeps β
β βββ Repair & resume after interruptions β
β β
β π Complete Documentation β
β βββ User guides & tutorials β
β βββ Developer API reference β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Feature | Description |
|---|---|
| Multi-vendor | Swap providers without changing your eval pipeline. |
| Auto-provisioning | Generate tasks, rubrics, and attribute spaces from a domain description. |
| Orchestration | Schedule and parallelize runs; optimize for cost and latency. |
| Visual reports | Interactive dashboards for deep-dive analysis. |
| Resilient tracking | Resume interrupted experiments; repair partial results. |
| Docs-first | Comprehensive guides for users and contributors alike. |
OpenAI, Anthropic, Google Gemini, Azure OpenAI, Azure AI Inference, AWS Bedrock, Google Vertex AI, OpenRouter, Groq, DeepSeek, Mistral, DeepInfra, Cerebras, Cohere, HuggingFace API, HuggingFace (local), Ollama
β Providers & Pricing β auth setup, batch discounts, pricing tables for all 18 interfaces.
# 1. Install
pip install coeval
# 2. Add your API keys (see: docs/tutorial.md Β§ 2)
cp keys.yaml.template keys.yaml # then fill in your provider keys
# 3. Probe all models β no tokens consumed (runnable example included in the repo)
coeval probe --config examples/quickstart.yaml
# 4. Estimate cost before spending anything
coeval plan --config examples/quickstart.yaml
# 5. Run the experiment (phases 1-5: infer attributes + rubric, generate, respond, judge)
coeval run --config examples/quickstart.yaml
# 6. Generate analysis reports
coeval analyze all --run ./Runs/quickstart --out ./Runs/quickstart/reportsCoEval accepts your intent at whichever level of detail you have, from a single sentence to a fully hand-written config:
| Level | You provide | CoEval infers | How |
|---|---|---|---|
| Objective | one-line goal | everything: tasks, attributes, rubric, model roles | coeval wizard --objective "..." |
| Most-automatic | task description + models | target attributes + rubric (Phases 1-2) | hand-write a minimal YAML (below) |
| Semi-automatic | description + some attributes/rubric | the rest | partial YAML, human-in-the-loop wizard |
| Manual | full config | nothing | complete YAML |
Generate a complete, runnable config from a single high-level objective (no questions asked) and run it:
# One sentence in, a validated config out
coeval wizard \
--objective "rank LLMs on classifying customer-support tickets into urgency levels" \
--models "gpt-4o-mini, claude-3-5-haiku" \
--items 8 \
--out ticket_urgency.yaml
coeval run --config ticket_urgency.yamlThe LLM proposes the tasks, target attributes, scoring rubric, and a
cross-family judge panel; the config is auto-validated (and auto-repaired on any
validation error) before it is written. Omit --objective for the interactive,
question-by-question wizard instead.
You give only a task description and the models; CoEval infers the target attributes
and the scoring rubric. See the complete runnable file at examples/quickstart.yaml.
models:
- name: gpt-4o-mini
interface: openrouter
parameters: { model: openai/gpt-4o-mini, temperature: 0.7, max_tokens: 512 }
roles: [teacher, student, judge]
- name: claude-haiku
interface: openrouter
parameters: { model: anthropic/claude-3.5-haiku, temperature: 0.0, max_tokens: 128 }
roles: [judge] # cross-family judge
tasks:
- name: regex_explanation
description: Explain in plain English what a given regular expression matches.
output_description: A clear one-to-three sentence plain-English explanation.
sampling: { total: 6 } # target_attributes + rubric are inferred (Phases 1-2)
evaluation_mode: single
experiment:
id: quickstart
storage_folder: ./eval_runsInteractive HTML examples β click to open rendered in browser:
| Example | Description |
|---|---|
| Education Benchmark β Planning View | Full experiment plan: 3 real-dataset tasks + 10 synthetic tasks, 6 models, per-phase call budget, cost table, and attribute maps |
| Mixed Benchmark β Planning View | Mixed benchmark plan: real benchmark datasets + OpenAI models |
| Paper Dual-Track β Planning View | Paper evaluation: dual-track design with benchmark + generative teachers |
Generate your own planning view:
coeval describe --config my_experiment.yaml --out my_experiment_plan.html
| Report | Description |
|---|---|
| Dashboard | Overview dashboard β all reports in one place with top-line rankings and navigation |
| Student Performance Report | Per-student score breakdowns, task rankings, rubric factor heatmaps |
| Judge Consistency Report | Inter-judge ICC agreement, calibration drift, flagged uncertain items |
| Robust Summary Report | Final model rankings with confidence intervals and robust ensemble weights |
| Score Distribution Report | High / Medium / Low histograms filterable by task, teacher, student, and judge |
| Teacher Report | Per-teacher source quality, attribute stratum coverage, data consistency |
| Interaction Matrix | Teacher Γ Student pair quality heatmap β spot which combinations succeed or fail |
| Coverage Summary | Attribute Coverage Ratio (ACR) and rare-attribute recall per task |
| Judge Report | Judge-level bias rates, score calibration, inter-rater reliability |
| Annotated Report Guide | Detailed annotated screenshots of every CoEval report with explanations of every visualization and metric |
Generate all reports from a completed run:
coeval analyze all --run ./Runs/my-experiment-v1 --out ./reports
| Guide | What it covers |
|---|---|
| Concepts Glossary | Every first-class concept explained: teacher, student, judge, attributes, rubric, datapoint, slot, phases, wizard, probing, planning, resume, repair, auto interface, batch API, and more |
| Evaluation Experiment Planning and Preparation Guide | End-to-end walkthrough: installation, config design, probing, running, analysis, and benchmark export |
| Command Line Option Reference | Every coeval subcommand, flag, and exit code β run, probe, plan, generate, status, models, analyze, describe, wizard, ingest, repair |
| Running Experiments | Phase modes, --continue, batch API, quota control, cost estimation, fault recovery, use-case examples |
| Providers & Pricing | All 18 interfaces with auth, batch support, code examples, and pricing tables |
| Analytics & Reports | 11 interactive HTML dashboards, paper-quality result tables, programmatic API, Excel workbook export |
| Configuration Guide | YAML config schema: models, tasks, attributes, rubric, sampling, prompt overrides, experiment settings |
| Benchmark Datasets | Pre-ingested datasets, coeval ingest, interface: benchmark virtual teacher, reproducing published results |
| Testing Guide | All 20 test files, how to run each suite, interpreting failures, CI/CD setup |
| System Feature Wishlist | 35-item prioritized roadmap: 10 benchmark additions, 12 system features, 13 new report types |
YAML Config β Phase 1: Attribute Mapping (teachers infer task dimensions)
β Phase 2: Rubric Mapping (teachers build evaluation criteria)
β Phase 3: Data Generation (teachers produce benchmark items)
β Phase 4: Response Collection (students answer benchmark prompts)
β Phase 5: Evaluation (judges score student responses)
β coeval analyze all (8 HTML reports + Excel workbook)
| Cloud β Async Batch β | Cloud β Real-time | OpenAI-Compatible | Local / Virtual |
|---|---|---|---|
openai |
azure_openaiΒΉ |
groq |
huggingface |
anthropic |
azure_ai |
deepseek |
ollama |
geminiΒ² |
bedrock |
mistral |
benchmark |
vertex |
deepinfra |
||
openrouter |
cerebras |
ΒΉ
azure_openaisupports Azure Global Batch API (50% discount) β enable viabatch: azure_openai:in config. Β²geminiuses concurrent requests (pseudo-batch) β no async discount.
| Capability | Detail |
|---|---|
| Cost estimation | Itemised call budget and cost table before any phases run; Batch API discounts modelled |
| Batch API | 50% async discount for OpenAI, Anthropic, and Azure OpenAI; Gemini uses concurrent mode (no discount) |
| Resume | --continue resumes at exact JSONL record; no duplicate API calls |
| Auto attributes | Teachers infer task dimensions from a description; no hand-labelling required |
| Auto rubric | Teachers propose rubric factors; merge-and-deduplicate across N teachers |
| Multi-judge ensemble | N judges β bias-resistant aggregate scores; outlier judges down-weighted |
| 8 HTML reports | Interactive charts, filterable tables, CSV export, fully self-contained (no CDN) |
| Model probe | Verify all 16 interfaces are reachable before spending a dollar |
| Virtual teachers | Pre-ingested public datasets supply zero-cost Phase 3 ground truth |
| Label accuracy | Judge-free exact-match for classification tasks (label_attributes) |
| Component | Files | LoC |
|---|---|---|
Code/runner β pipeline engine |
59 .py |
15,087 |
Code/analyzer β analysis & reports |
21 .py |
9,554 |
Public/benchmark β dataset utilities |
34 .py |
5,211 |
Tests β test suites |
41 .py |
16,845 |
docs β documentation |
35 .md |
12,521 |
CoEval Β· Multi-Model LLM Evaluation Framework
Designed for LLM developers, integrators, and evaluation practitioners who require robust model evaluation and ranking using custom use-case data and metrics.
Copyright (c) 2026 Alexander Apartsin. All rights reserved.
