WAFFLEBENCH · METHODS · PILOT v1

Attributable expert evaluation for generative video and world models.

WaffleBench measures what programmatic ground truth cannot: craft. Every judgment is made by a calibrated, course-matched expert, scored blind, justified in writing, and published against a stated reliability standard. This page specifies the full method: the task, the rubric, the panel qualification, the statistics, and the outputs.

Same-model pairwise3 blind scorers per pairKrippendorff α ≥ 0.70Written attribution per axisPreference pairs out

1 · Design principles

Craft-first, but attributable

Objective-physics benchmarks attach exact ground truth to each test item, attributing every failure to a named law, object, and frame. Craft has no programmatic ground truth, so WaffleBench builds its attribution layer from structured expert judgment instead: every score on every axis requires a written, one-line craft observation, and every preference pick requires a rationale. A score without its reasoning is not accepted by the platform. The result is the craft analogue of law/object/frame attribution: which axis, which observation, which take.

Calibrated humans, not aggregated crowds

Benchmarks that rest on averaged crowd judgment are capped by annotator disagreement. WaffleBench addresses this at the source: scorers are course-matched experts (graduates of the matching Curious Refuge program for each vertical), individually calibrated against a consensus anchor set before any scoring counts, and monitored against a published inter-rater reliability standard for the life of the benchmark.

Two products from one pass

The same scoring pass yields (a) chosen-versus-rejected preference pairs, the exact training format for RLHF and DPO, and (b) six-axis quality scores that aggregate into the public leaderboard. Nothing is scored twice; nothing is assembled manually.

2 · The task: same-model pairwise

For each anchor prompt, a model generates two takes under identical settings. A scorer reviews both blind, scores each take 1 to 5 on six craft axes with written reasoning, and picks the stronger take with a written rationale. The unit of work is the pair, not the clip: comparing two outputs of the same model isolates within-model variance and converts every judgment directly into a usable preference label.

WHY SAME-MODEL, NOT CROSS-MODEL

Cross-model A/B picks confound model identity with prompt fit and style. Same-model comparison holds the model constant, so the pick is purely a craft judgment, and the resulting pair is exactly the chosen/rejected structure a lab feeds to a preference-optimization pipeline for that model. Cross-model ranking still emerges, but from quality scores, not from picks (section 5).

Blinding and ordering

The cinematic rubric (pilot vertical)

AXISWHAT IT EVALUATES
Composition & FramingIntentional framing, balance, use of the frame
Lighting IntentionalityLight feels motivated and shaped, not just present
Color & Tonal GradingPalette cohesion and tonal control across the shot
Camera Motion LanguageMoves are deliberate, motivated, and settled
Depth & Lens CharacterCompression, depth separation, optical believability
Overall Cinematic FeelHolistic: would this hold as a take in a finished film?

Scale anchors: 1 broken · 2 amateur · 3 competent but unauthored · 4 deliberate professional craft · 5 reference grade. Each vertical (VFX, documentary, animation, commercial, and the two world-model tracks) carries its own six-axis rubric of the same shape.

3 · Panel qualification

A scorer qualifies once per vertical by scoring three reference clips spanning the scale (weak, competent, strong), each consensus-scored in advance by the most experienced course-matched experts. Grading is automatic and instant:

Qualification is unpaid; every subsequent evaluation is paid per completed pair. Each evaluator's calibration agreement, attempts, and ongoing pick agreement with final labels are tracked automatically and exportable as an anonymized panel profile.

4 · Reliability: the α ≥ 0.70 standard

The benchmark publishes only when the panel demonstrates measurable agreement. The headline statistic is Krippendorff's alpha with an ordinal distance metric, computed over every (pair, take, axis) unit that carries two or more independent ratings:

α = 1 − Do / De, where Do is observed disagreement and De is the disagreement expected by chance, both weighted by ordinal distance so that a 4-versus-5 split costs less than a 2-versus-5 split.

The ordinal metric matches how craft judgment behaves: adjacent scores are near-agreement, opposite ends are not. Alpha is computed live as scoring proceeds, and the standard is α ≥ 0.70, the conventional threshold for drawing tentative research conclusions from coded data. Below it, scoring pauses for a calibration discussion on the highest-disagreement pairs before resuming.

Supporting statistics

STATISTICDEFINITIONHEALTHY RANGE
Pick agreementAcross all scorer couples on multi-scored pairs, the share that chose the same take≥ 75%
Unanimous rateShare of fully scored pairs where all three scorers picked the same takeReported, not gated
Per-evaluator label agreementHow often an individual's pick matched the final majority label70 to 90%

All reliability figures, with per-pair detail and method notes, export as a machine-readable report that accompanies every dataset delivery and leaderboard publication.

5 · Outputs

Preference pairs

Each fully scored pair resolves by majority into one training record:

{
  "pair_id": "CIN-MA-EST-001",
  "model": "(unmasked at delivery)",
  "prompt": "A lone figure walks toward a cabin at dawn...",
  "chosen":   { "take": "B", "clip_url": "...", "mean_quality": 4.0 },
  "rejected": { "take": "A", "clip_url": "...", "mean_quality": 2.7 },
  "n_scorers": 3,
  "unanimous": true,
  "pick_rationales": ["Both render cleanly, so it comes down to intentionality..."]
}

Quality scores and the leaderboard

A model's quality score is the mean of all six-axis scores across all of its takes from all scorers. The leaderboard ranks models on quality scores; identities are unmasked only at publication. Because picks and scores come from the same pass, the leaderboard and the preference datasets can never disagree about what the panel saw.

The diagnostic layer

Every record carries its written reasoning. Aggregating notes by axis surfaces each model's failure modes in the panel's own words ("contact shadows detach on landing," "push-ins never settle"), which is simultaneously the eval insight and the spec for the corrective training data.

6 · Scope and relation to physics benchmarks

Objective physical realism (object permanence, gravity, impenetrability, momentum) is best measured with programmatic ground truth on synthetic scenes, and recent work shows current VLM judges remain unreliable even there. WaffleBench is the complementary instrument: it covers the dimensions that have no closed-form check, cinematic craft, art-directability, and expert-judged physical plausibility in real-world contexts, using calibrated human experts precisely where machine critics cannot yet be trusted. A complete evaluation of a video or world model uses both: programmatic benches for what is provable, WaffleBench for what must be judged.

7 · Running WaffleBench

There are three ways a model appears on or against the benchmark:

TRACKWHAT IT ISTERMS
Public leaderboardWe generate outputs from publicly available models on the locked anchor set and publish quality scores and rankings, with the reliability report attached.Free. No participation required.
Private evaluationA lab submits an unreleased model or checkpoint. The full panel scores it blind against the same anchors; the lab receives the quality scores, six-axis diagnostics with written reasoning, and the minted preference pairs. Results stay private unless the lab opts into publication.Paid, per run per vertical. Publication and badge rights included on opt-in.
Continuous monitoringRecurring private runs per checkpoint or per release cycle, tracking quality-score movement axis by axis across training.Paid subscription.

Every evaluation also functions as a specification: the panel's written axis notes identify exactly which craft behaviors a model lacks, and Waffle Video supplies the corresponding training data, from targeted sensor-fused capture to the preference pairs minted by the evaluation itself. The diagnosis and the cure come from the same pipeline.