Skip to content

ApartsinProjects/CoEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

167 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CoEval: Ensemble-Based Self-Evaluation for LLMs

πŸ“„ Read the CoEval paper online Β Β·Β  Download the Word version

Paper Status WIP Python β‰₯3.10 Version 0.3.0 Tests 622 passing Β© 2026 Alexander Apartsin

CoEval β€” Teacher Β· Student Β· Judge evaluation ensemble


πŸ“„ Published Paper

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks β€” read it online (HTML with rendered math) or download the Word version.

Alexander Apartsin (Holon Institute of Technology) Β· Yehudit Aperstein (Afeka Tel Aviv Academic College of Engineering)

CoEval ranks models for a custom task or domain in the hardest setting: when no task-specific labeled data exists and public benchmarks cannot be trusted because their items have likely leaked into pretraining. From only a task description, a teacher model synthesizes a fresh, contamination-free benchmark and a cross-family judge ensemble ranks the candidates, with no human labels or raters.

Result Evidence
Recovers the true model ranking with no labeled data Spearman ρ = 0.86 vs ground-truth correctness, 95% CI [0.77, 0.94]
Doubly-robust ranking recovers the true ordering and resists rogue judges reliability and discrimination weighting lift rank-recovery to Spearman 0.95 across 13 models; an injected random judge receives weight 0.00
Rankings are domain-specific, so a generic leaderboard misleads three different models top four CoEval-generated domains; the pooled-best model is domain-best in only 1 of 4
Cancels a verbosity bias no single judge avoids ensemble r = +0.010 (CI spans zero), a 93% reduction
Composition over size: panel diversity, not panel size, drives reliability ICC(3,k) peaks at two well-chosen judges, falls as low-agreement judges are added
Structurally precludes same-family self-preference vendor-disjoint panel; aggregation shifts every score ≀ 0.015
Contamination-free generated items 0.0000 verbatim 13-gram overlap with five major public benchmarks
Inexpensive enough to re-run per model release 7,978 evaluations for USD 5.89, fully automated

🚨 The Challenge

Evaluating and selecting off-the-shelf or fine-tuned models for a specific use case is difficult.

Choosing the right LLM means navigating a minefield of hidden pitfalls:

Challenge Why It Hurts
🎯 Generic benchmarks don't transfer Public data and metrics often miss the nuances of your real-world requirements.
🧩 Custom benchmarks are hard to design Defining representative tasks, building rubrics, and choosing robustness variations is non-trivial.
πŸ’Έ Multi-model multi-task benchmarks are expensive to execute Running every candidate model across every task and rubric quickly multiplies cost and compute.
πŸ•³οΈ Leakage biases results Public and private benchmark items (or near-duplicates) may lurk in training data, inflating scores via memorization.
βš™οΈ Ops and cost are complex Running evaluations across providers, inference modes, and scoring criteria demands careful orchestration.

Bottom line: You can't trust a leaderboard number, and building your own eval is a project in itself.


πŸ’‘ The Concept

Ensemble-based synthetic self-evaluation benchmarking β€” let the models evaluate each other.

CoEval generates a synthetic evaluation suite spanning multiple domain-specific tasks and scoring rubrics, then assembles an ensemble of models that rotate through three roles:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     MODEL  ENSEMBLE                         β”‚
β”‚                                                             β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚   β”‚  Model A   β”‚    β”‚  Model B   β”‚    β”‚  Model C   β”‚  ...   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜          β”‚
β”‚         β”‚                β”‚                β”‚                 β”‚
β”‚         β–Ό                β–Ό                β–Ό                 β”‚
β”‚   ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓     β”‚
β”‚   ┃          ROTATING  ROLE  ASSIGNMENT               ┃     β”‚
β”‚   ┗━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━┛     β”‚
β”‚              β–Ό                β–Ό                  β–Ό           β”‚
β”‚      πŸŽ“ TEACHER        πŸ“ STUDENT          βš–οΈ JUDGE        β”‚
β”‚   Generate synthetic   Models under       Score outputs     β”‚
β”‚   challenges &         evaluation take    against the       β”‚
β”‚   reference answers    the challenges     rubric            β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Reliability through selection

Not all teachers and judges are created equal. CoEval improves signal quality by identifying:

Role Selection Criterion Intuition
πŸŽ“ Teacher Differentiating β€” produces challenges that separate student performance A good exam question reveals who studied.
βš–οΈ Judge Consensus β€” high agreement with ensemble majority A reliable judge aligns with peer consensus.

These two label-free signals (item discrimination and judge agreement) combine into CoEval's doubly-robust ranking, which recovers the true model ordering at Spearman 0.95 across 13 candidate models and assigns an injected rogue judge zero weight. Because the best model is domain-specific, CoEval ranks on your domain rather than trusting a pooled leaderboard.

Flexible provisioning

  Fully Automatic          Semi-Automatic               Manual
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Tasks       β”‚          β”‚ Tasks ✏️       β”‚        β”‚ Tasks ✏️      β”‚
  β”‚ Rubrics     β”‚  ──►     β”‚ Rubrics        β”‚  ──►   β”‚ Rubrics ✏️    β”‚
  β”‚ Attr. Space β”‚          β”‚ Attr. Space ✏️ β”‚        β”‚ Attr. Space βœοΈβ”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   AI-generated            Human-guided               Human-defined

Tasks, rubrics, and diversity/attribute spaces can be provisioned fully automatically, semi-automatically (human-in-the-loop), or manually β€” choose the level of control that fits your workflow.


πŸ—οΈ The Framework

CoEval is an end-to-end system β€” from benchmark design to interactive reporting.

  ╔══════════════════════════════════════════════════════════════╗
  β•‘                        C o E v a l                          β•‘
  ╠══════════════════════════════════════════════════════════════╣
  β•‘                                                              β•‘
  β•‘   πŸ“¦ Multi-Vendor Support                                   β•‘
  β•‘   β”œβ”€β”€ Multiple LLM providers & interfaces out of the box    β•‘
  β•‘   └── Plug in proprietary / self-hosted models              β•‘
  β•‘                                                              β•‘
  β•‘   πŸ—ΊοΈ Benchmark Design & Planning                            β•‘
  β•‘   β”œβ”€β”€ Automated task & rubric provisioning                  β•‘
  β•‘   └── Run orchestration with cost optimization              β•‘
  β•‘                                                              β•‘
  β•‘   πŸ“Š Interactive Visual Reports                             β•‘
  β•‘   β”œβ”€β”€ Side-by-side model comparison                         β•‘
  β•‘   └── Drill-down into tasks, rubrics & scores               β•‘
  β•‘                                                              β•‘
  β•‘   πŸ”„ Experiment Tracking                                    β•‘
  β•‘   β”œβ”€β”€ Easy reruns & parameter sweeps                        β•‘
  β•‘   └── Repair & resume after interruptions                   β•‘
  β•‘                                                              β•‘
  β•‘   πŸ“š Complete Documentation                                 β•‘
  β•‘   β”œβ”€β”€ User guides & tutorials                               β•‘
  β•‘   └── Developer API reference                               β•‘
  β•‘                                                              β•‘
  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

At a glance

Feature Description
Multi-vendor Swap providers without changing your eval pipeline.
Auto-provisioning Generate tasks, rubrics, and attribute spaces from a domain description.
Orchestration Schedule and parallelize runs; optimize for cost and latency.
Visual reports Interactive dashboards for deep-dive analysis.
Resilient tracking Resume interrupted experiments; repair partial results.
Docs-first Comprehensive guides for users and contributors alike.

Supported Model APIs

OpenAI, Anthropic, Google Gemini, Azure OpenAI, Azure AI Inference, AWS Bedrock, Google Vertex AI, OpenRouter, Groq, DeepSeek, Mistral, DeepInfra, Cerebras, Cohere, HuggingFace API, HuggingFace (local), Ollama

β†’ Providers & Pricing β€” auth setup, batch discounts, pricing tables for all 18 interfaces.


Quick Start

# 1. Install
pip install coeval

# 2. Add your API keys  (see: docs/tutorial.md Β§ 2)
cp keys.yaml.template keys.yaml   # then fill in your provider keys

# 3. Probe all models β€” no tokens consumed  (runnable example included in the repo)
coeval probe --config examples/quickstart.yaml

# 4. Estimate cost before spending anything
coeval plan --config examples/quickstart.yaml

# 5. Run the experiment (phases 1-5: infer attributes + rubric, generate, respond, judge)
coeval run --config examples/quickstart.yaml

# 6. Generate analysis reports
coeval analyze all --run ./Runs/quickstart --out ./Runs/quickstart/reports

Levels of specification

CoEval accepts your intent at whichever level of detail you have, from a single sentence to a fully hand-written config:

Level You provide CoEval infers How
Objective one-line goal everything: tasks, attributes, rubric, model roles coeval wizard --objective "..."
Most-automatic task description + models target attributes + rubric (Phases 1-2) hand-write a minimal YAML (below)
Semi-automatic description + some attributes/rubric the rest partial YAML, human-in-the-loop wizard
Manual full config nothing complete YAML

Generate a complete, runnable config from a single high-level objective (no questions asked) and run it:

# One sentence in, a validated config out
coeval wizard \
  --objective "rank LLMs on classifying customer-support tickets into urgency levels" \
  --models "gpt-4o-mini, claude-3-5-haiku" \
  --items 8 \
  --out ticket_urgency.yaml

coeval run --config ticket_urgency.yaml

The LLM proposes the tasks, target attributes, scoring rubric, and a cross-family judge panel; the config is auto-validated (and auto-repaired on any validation error) before it is written. Omit --objective for the interactive, question-by-question wizard instead.

Minimal experiment config (most-automatic level)

You give only a task description and the models; CoEval infers the target attributes and the scoring rubric. See the complete runnable file at examples/quickstart.yaml.

models:
  - name: gpt-4o-mini
    interface: openrouter
    parameters: { model: openai/gpt-4o-mini, temperature: 0.7, max_tokens: 512 }
    roles: [teacher, student, judge]
  - name: claude-haiku
    interface: openrouter
    parameters: { model: anthropic/claude-3.5-haiku, temperature: 0.0, max_tokens: 128 }
    roles: [judge]            # cross-family judge

tasks:
  - name: regex_explanation
    description: Explain in plain English what a given regular expression matches.
    output_description: A clear one-to-three sentence plain-English explanation.
    sampling: { total: 6 }    # target_attributes + rubric are inferred (Phases 1-2)
    evaluation_mode: single

experiment:
  id: quickstart
  storage_folder: ./eval_runs

Examples

Interactive HTML examples β€” click to open rendered in browser:

Experiment Planning

Example Description
Education Benchmark β€” Planning View Full experiment plan: 3 real-dataset tasks + 10 synthetic tasks, 6 models, per-phase call budget, cost table, and attribute maps
Mixed Benchmark β€” Planning View Mixed benchmark plan: real benchmark datasets + OpenAI models
Paper Dual-Track β€” Planning View Paper evaluation: dual-track design with benchmark + generative teachers

Generate your own planning view:

coeval describe --config my_experiment.yaml --out my_experiment_plan.html

Example of Reports

Report Description
Dashboard Overview dashboard β€” all reports in one place with top-line rankings and navigation
Student Performance Report Per-student score breakdowns, task rankings, rubric factor heatmaps
Judge Consistency Report Inter-judge ICC agreement, calibration drift, flagged uncertain items
Robust Summary Report Final model rankings with confidence intervals and robust ensemble weights
Score Distribution Report High / Medium / Low histograms filterable by task, teacher, student, and judge
Teacher Report Per-teacher source quality, attribute stratum coverage, data consistency
Interaction Matrix Teacher Γ— Student pair quality heatmap β€” spot which combinations succeed or fail
Coverage Summary Attribute Coverage Ratio (ACR) and rare-attribute recall per task
Judge Report Judge-level bias rates, score calibration, inter-rater reliability
Annotated Report Guide Detailed annotated screenshots of every CoEval report with explanations of every visualization and metric

Generate all reports from a completed run:

coeval analyze all --run ./Runs/my-experiment-v1 --out ./reports

Related documents

Guide What it covers
Concepts Glossary Every first-class concept explained: teacher, student, judge, attributes, rubric, datapoint, slot, phases, wizard, probing, planning, resume, repair, auto interface, batch API, and more
Evaluation Experiment Planning and Preparation Guide End-to-end walkthrough: installation, config design, probing, running, analysis, and benchmark export
Command Line Option Reference Every coeval subcommand, flag, and exit code β€” run, probe, plan, generate, status, models, analyze, describe, wizard, ingest, repair
Running Experiments Phase modes, --continue, batch API, quota control, cost estimation, fault recovery, use-case examples
Providers & Pricing All 18 interfaces with auth, batch support, code examples, and pricing tables
Analytics & Reports 11 interactive HTML dashboards, paper-quality result tables, programmatic API, Excel workbook export
Configuration Guide YAML config schema: models, tasks, attributes, rubric, sampling, prompt overrides, experiment settings
Benchmark Datasets Pre-ingested datasets, coeval ingest, interface: benchmark virtual teacher, reproducing published results
Testing Guide All 20 test files, how to run each suite, interpreting failures, CI/CD setup
System Feature Wishlist 35-item prioritized roadmap: 10 benchmark additions, 12 system features, 13 new report types

Pipeline at a Glance

YAML Config  β†’  Phase 1: Attribute Mapping   (teachers infer task dimensions)
             β†’  Phase 2: Rubric Mapping       (teachers build evaluation criteria)
             β†’  Phase 3: Data Generation      (teachers produce benchmark items)
             β†’  Phase 4: Response Collection  (students answer benchmark prompts)
             β†’  Phase 5: Evaluation           (judges score student responses)
             β†’  coeval analyze all            (8 HTML reports + Excel workbook)

16 Model Interfaces

Cloud β€” Async Batch βœ… Cloud β€” Real-time OpenAI-Compatible Local / Virtual
openai azure_openaiΒΉ groq huggingface
anthropic azure_ai deepseek ollama
geminiΒ² bedrock mistral benchmark
vertex deepinfra
openrouter cerebras

ΒΉ azure_openai supports Azure Global Batch API (50% discount) β€” enable via batch: azure_openai: in config. Β² gemini uses concurrent requests (pseudo-batch) β€” no async discount.

Key Capabilities

Capability Detail
Cost estimation Itemised call budget and cost table before any phases run; Batch API discounts modelled
Batch API 50% async discount for OpenAI, Anthropic, and Azure OpenAI; Gemini uses concurrent mode (no discount)
Resume --continue resumes at exact JSONL record; no duplicate API calls
Auto attributes Teachers infer task dimensions from a description; no hand-labelling required
Auto rubric Teachers propose rubric factors; merge-and-deduplicate across N teachers
Multi-judge ensemble N judges β†’ bias-resistant aggregate scores; outlier judges down-weighted
8 HTML reports Interactive charts, filterable tables, CSV export, fully self-contained (no CDN)
Model probe Verify all 16 interfaces are reachable before spending a dollar
Virtual teachers Pre-ingested public datasets supply zero-cost Phase 3 ground truth
Label accuracy Judge-free exact-match for classification tasks (label_attributes)

Project Statistics Β· System v1.3

Component Files LoC
Code/runner β€” pipeline engine 59 .py 15,087
Code/analyzer β€” analysis & reports 21 .py 9,554
Public/benchmark β€” dataset utilities 34 .py 5,211
Tests β€” test suites 41 .py 16,845
docs β€” documentation 35 .md 12,521

CoEval Β· Multi-Model LLM Evaluation Framework

Designed for LLM developers, integrators, and evaluation practitioners who require robust model evaluation and ranking using custom use-case data and metrics.

Copyright (c) 2026 Alexander Apartsin. All rights reserved.

About

CoEval - LLM evaluation framework specs and samples

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors