WaffleBench · Methods

1 · Design principles

Craft-first, but attributable

Objective-physics benchmarks attach exact ground truth to each test item, attributing every failure to a named law, object, and frame. Craft has no programmatic ground truth, so WaffleBench builds its attribution layer from structured expert judgment instead: every score on every axis requires a written, one-line craft observation, and every preference pick requires a rationale. A score without its reasoning is not accepted by the platform. The result is the craft analogue of law/object/frame attribution: which axis, which observation, which take.

Calibrated humans, not aggregated crowds

Benchmarks that rest on averaged crowd judgment are capped by annotator disagreement. WaffleBench addresses this at the source: scorers are course-matched experts (graduates of the matching Curious Refuge program for each vertical), individually calibrated against a consensus anchor set before any scoring counts, and monitored against a published inter-rater reliability standard for the life of the benchmark.

Two products from one pass

The same scoring pass yields (a) chosen-versus-rejected preference pairs, the exact training format for RLHF and DPO, and (b) six-axis quality scores that aggregate into the public leaderboard. Nothing is scored twice; nothing is assembled manually.

2 · The task: same-model pairwise

For each anchor prompt, a model generates two takes under identical settings. A scorer reviews both blind, scores each take 1 to 5 on six craft axes with written reasoning, and picks the stronger take with a written rationale. The unit of work is the pair, not the clip: comparing two outputs of the same model isolates within-model variance and converts every judgment directly into a usable preference label.

WHY SAME-MODEL, NOT CROSS-MODEL

Cross-model A/B picks confound model identity with prompt fit and style. Same-model comparison holds the model constant, so the pick is purely a craft judgment, and the resulting pair is exactly the chosen/rejected structure a lab feeds to a preference-optimization pipeline for that model. Cross-model ranking still emerges, but from quality scores, not from picks (section 5).

Blinding and ordering

Scorers never see model names. Pairs carry masked labels; identity is held admin-side and revealed only in the published leaderboard.
Take display order is randomized per scorer per pair to control position bias; the realized order is stored with each submission.
Three independent scorers see every pair. Assignment is automatic; pairs close at three submissions; scorers are instructed not to discuss open pairs.

The cinematic rubric (pilot vertical)

AXIS	WHAT IT EVALUATES
Composition & Framing	Intentional framing, balance, use of the frame
Lighting Intentionality	Light feels motivated and shaped, not just present
Color & Tonal Grading	Palette cohesion and tonal control across the shot
Camera Motion Language	Moves are deliberate, motivated, and settled
Depth & Lens Character	Compression, depth separation, optical believability
Overall Cinematic Feel	Holistic: would this hold as a take in a finished film?

Scale anchors: 1 broken · 2 amateur · 3 competent but unauthored · 4 deliberate professional craft · 5 reference grade. Each vertical (VFX, documentary, animation, commercial, and the two world-model tracks) carries its own six-axis rubric of the same shape.

3 · Panel qualification

A scorer qualifies once per vertical by scoring three reference clips spanning the scale (weak, competent, strong), each consensus-scored in advance by the most experienced course-matched experts. Grading is automatic and instant:

Agreement: at least 80% of the scorer's 18 axis scores (3 clips × 6 axes) must land within one point of consensus.
No reversed calls: on the Overall axis, scoring a consensus-strong clip (4 to 5) as weak (1 to 2), or the inverse, fails the round regardless of percentage.
Two attempts: a failed round produces axis-by-axis feedback against the consensus reasoning, then one re-test. A second miss locks the seat pending admin review.

Qualification is unpaid; every subsequent evaluation is paid per completed pair. Each evaluator's calibration agreement, attempts, and ongoing pick agreement with final labels are tracked automatically and exportable as an anonymized panel profile.

4 · Reliability: the α ≥ 0.70 standard

The benchmark publishes only when the panel demonstrates measurable agreement. The headline statistic is Krippendorff's alpha with an ordinal distance metric, computed over every (pair, take, axis) unit that carries two or more independent ratings:

α = 1 − D_o / D_e, where D_o is observed disagreement and D_e is the disagreement expected by chance, both weighted by ordinal distance so that a 4-versus-5 split costs less than a 2-versus-5 split.

The ordinal metric matches how craft judgment behaves: adjacent scores are near-agreement, opposite ends are not. Alpha is computed live as scoring proceeds, and the standard is α ≥ 0.70, the conventional threshold for drawing tentative research conclusions from coded data. Below it, scoring pauses for a calibration discussion on the highest-disagreement pairs before resuming.

Supporting statistics

STATISTIC	DEFINITION	HEALTHY RANGE
Pick agreement	Across all scorer couples on multi-scored pairs, the share that chose the same take	≥ 75%
Unanimous rate	Share of fully scored pairs where all three scorers picked the same take	Reported, not gated
Per-evaluator label agreement	How often an individual's pick matched the final majority label	70 to 90%

All reliability figures, with per-pair detail and method notes, export as a machine-readable report that accompanies every dataset delivery and leaderboard publication.

5 · Outputs

Preference pairs

Each fully scored pair resolves by majority into one training record:

{
  "pair_id": "CIN-MA-EST-001",
  "model": "(unmasked at delivery)",
  "prompt": "A lone figure walks toward a cabin at dawn...",
  "chosen":   { "take": "B", "clip_url": "...", "mean_quality": 4.0 },
  "rejected": { "take": "A", "clip_url": "...", "mean_quality": 2.7 },
  "n_scorers": 3,
  "unanimous": true,
  "pick_rationales": ["Both render cleanly, so it comes down to intentionality..."]
}

Quality scores and the leaderboard

A model's quality score is the mean of all six-axis scores across all of its takes from all scorers. The leaderboard ranks models on quality scores; identities are unmasked only at publication. Because picks and scores come from the same pass, the leaderboard and the preference datasets can never disagree about what the panel saw.

The diagnostic layer

Every record carries its written reasoning. Aggregating notes by axis surfaces each model's failure modes in the panel's own words ("contact shadows detach on landing," "push-ins never settle"), which is simultaneously the eval insight and the spec for the corrective training data.

6 · Scope and relation to physics benchmarks

Objective physical realism (object permanence, gravity, impenetrability, momentum) is best measured with programmatic ground truth on synthetic scenes, and recent work shows current VLM judges remain unreliable even there. WaffleBench is the complementary instrument: it covers the dimensions that have no closed-form check, cinematic craft, art-directability, and expert-judged physical plausibility in real-world contexts, using calibrated human experts precisely where machine critics cannot yet be trusted. A complete evaluation of a video or world model uses both: programmatic benches for what is provable, WaffleBench for what must be judged.

7 · Running WaffleBench

There are three ways a model appears on or against the benchmark:

TRACK	WHAT IT IS	TERMS
Public leaderboard	We generate outputs from publicly available models on the locked anchor set and publish quality scores and rankings, with the reliability report attached.	Free. No participation required.
Private evaluation	A lab submits an unreleased model or checkpoint. The full panel scores it blind against the same anchors; the lab receives the quality scores, six-axis diagnostics with written reasoning, and the minted preference pairs. Results stay private unless the lab opts into publication.	Paid, per run per vertical. Publication and badge rights included on opt-in.
Continuous monitoring	Recurring private runs per checkpoint or per release cycle, tracking quality-score movement axis by axis across training.	Paid subscription.

Every evaluation also functions as a specification: the panel's written axis notes identify exactly which craft behaviors a model lacks, and Waffle Video supplies the corresponding training data, from targeted sensor-fused capture to the preference pairs minted by the evaluation itself. The diagnosis and the cure come from the same pipeline.