WaffleBench measures what programmatic ground truth cannot: craft. Every judgment is made by a calibrated, course-matched expert, scored blind, justified in writing, and published against a stated reliability standard. This page specifies the full method: the task, the rubric, the panel qualification, the statistics, and the outputs.
Objective-physics benchmarks attach exact ground truth to each test item, attributing every failure to a named law, object, and frame. Craft has no programmatic ground truth, so WaffleBench builds its attribution layer from structured expert judgment instead: every score on every axis requires a written, one-line craft observation, and every preference pick requires a rationale. A score without its reasoning is not accepted by the platform. The result is the craft analogue of law/object/frame attribution: which axis, which observation, which take.
Benchmarks that rest on averaged crowd judgment are capped by annotator disagreement. WaffleBench addresses this at the source: scorers are course-matched experts (graduates of the matching Curious Refuge program for each vertical), individually calibrated against a consensus anchor set before any scoring counts, and monitored against a published inter-rater reliability standard for the life of the benchmark.
The same scoring pass yields (a) chosen-versus-rejected preference pairs, the exact training format for RLHF and DPO, and (b) six-axis quality scores that aggregate into the public leaderboard. Nothing is scored twice; nothing is assembled manually.
For each anchor prompt, a model generates two takes under identical settings. A scorer reviews both blind, scores each take 1 to 5 on six craft axes with written reasoning, and picks the stronger take with a written rationale. The unit of work is the pair, not the clip: comparing two outputs of the same model isolates within-model variance and converts every judgment directly into a usable preference label.
Cross-model A/B picks confound model identity with prompt fit and style. Same-model comparison holds the model constant, so the pick is purely a craft judgment, and the resulting pair is exactly the chosen/rejected structure a lab feeds to a preference-optimization pipeline for that model. Cross-model ranking still emerges, but from quality scores, not from picks (section 5).
| AXIS | WHAT IT EVALUATES |
|---|---|
| Composition & Framing | Intentional framing, balance, use of the frame |
| Lighting Intentionality | Light feels motivated and shaped, not just present |
| Color & Tonal Grading | Palette cohesion and tonal control across the shot |
| Camera Motion Language | Moves are deliberate, motivated, and settled |
| Depth & Lens Character | Compression, depth separation, optical believability |
| Overall Cinematic Feel | Holistic: would this hold as a take in a finished film? |
Scale anchors: 1 broken · 2 amateur · 3 competent but unauthored · 4 deliberate professional craft · 5 reference grade. Each vertical (VFX, documentary, animation, commercial, and the two world-model tracks) carries its own six-axis rubric of the same shape.
A scorer qualifies once per vertical by scoring three reference clips spanning the scale (weak, competent, strong), each consensus-scored in advance by the most experienced course-matched experts. Grading is automatic and instant:
Qualification is unpaid; every subsequent evaluation is paid per completed pair. Each evaluator's calibration agreement, attempts, and ongoing pick agreement with final labels are tracked automatically and exportable as an anonymized panel profile.
The benchmark publishes only when the panel demonstrates measurable agreement. The headline statistic is Krippendorff's alpha with an ordinal distance metric, computed over every (pair, take, axis) unit that carries two or more independent ratings:
The ordinal metric matches how craft judgment behaves: adjacent scores are near-agreement, opposite ends are not. Alpha is computed live as scoring proceeds, and the standard is α ≥ 0.70, the conventional threshold for drawing tentative research conclusions from coded data. Below it, scoring pauses for a calibration discussion on the highest-disagreement pairs before resuming.
| STATISTIC | DEFINITION | HEALTHY RANGE |
|---|---|---|
| Pick agreement | Across all scorer couples on multi-scored pairs, the share that chose the same take | ≥ 75% |
| Unanimous rate | Share of fully scored pairs where all three scorers picked the same take | Reported, not gated |
| Per-evaluator label agreement | How often an individual's pick matched the final majority label | 70 to 90% |
All reliability figures, with per-pair detail and method notes, export as a machine-readable report that accompanies every dataset delivery and leaderboard publication.
Each fully scored pair resolves by majority into one training record:
{
"pair_id": "CIN-MA-EST-001",
"model": "(unmasked at delivery)",
"prompt": "A lone figure walks toward a cabin at dawn...",
"chosen": { "take": "B", "clip_url": "...", "mean_quality": 4.0 },
"rejected": { "take": "A", "clip_url": "...", "mean_quality": 2.7 },
"n_scorers": 3,
"unanimous": true,
"pick_rationales": ["Both render cleanly, so it comes down to intentionality..."]
}
A model's quality score is the mean of all six-axis scores across all of its takes from all scorers. The leaderboard ranks models on quality scores; identities are unmasked only at publication. Because picks and scores come from the same pass, the leaderboard and the preference datasets can never disagree about what the panel saw.
Every record carries its written reasoning. Aggregating notes by axis surfaces each model's failure modes in the panel's own words ("contact shadows detach on landing," "push-ins never settle"), which is simultaneously the eval insight and the spec for the corrective training data.
Objective physical realism (object permanence, gravity, impenetrability, momentum) is best measured with programmatic ground truth on synthetic scenes, and recent work shows current VLM judges remain unreliable even there. WaffleBench is the complementary instrument: it covers the dimensions that have no closed-form check, cinematic craft, art-directability, and expert-judged physical plausibility in real-world contexts, using calibrated human experts precisely where machine critics cannot yet be trusted. A complete evaluation of a video or world model uses both: programmatic benches for what is provable, WaffleBench for what must be judged.
There are three ways a model appears on or against the benchmark:
| TRACK | WHAT IT IS | TERMS |
|---|---|---|
| Public leaderboard | We generate outputs from publicly available models on the locked anchor set and publish quality scores and rankings, with the reliability report attached. | Free. No participation required. |
| Private evaluation | A lab submits an unreleased model or checkpoint. The full panel scores it blind against the same anchors; the lab receives the quality scores, six-axis diagnostics with written reasoning, and the minted preference pairs. Results stay private unless the lab opts into publication. | Paid, per run per vertical. Publication and badge rights included on opt-in. |
| Continuous monitoring | Recurring private runs per checkpoint or per release cycle, tracking quality-score movement axis by axis across training. | Paid subscription. |
Every evaluation also functions as a specification: the panel's written axis notes identify exactly which craft behaviors a model lacks, and Waffle Video supplies the corresponding training data, from targeted sensor-fused capture to the preference pairs minted by the evaluation itself. The diagnosis and the cure come from the same pipeline.