WAFFLEBENCH · BY WAFFLE VIDEO

The expert eval layer for generative video and world models.

Calibrated professional filmmakers score frontier models blind on six-axis craft rubrics, under a published Krippendorff α ≥ 0.70 reliability standard. Every judgment is attributable: which axis, which observation, which take.

THE SCIENCE

Methods

The full methodology: same-model pairwise design, panel qualification, the reliability statistics, and the output schemas.

Read the methods →
FOR LABS

Submit your model

Private pre-release evaluation: quality scores, six-axis diagnostics, RLHF/DPO preference pairs, and the reliability report. Published only on your opt-in.

Request a private run →
FOR THE PANEL

Evaluator platform

Panel members: sign in with your access code to qualify and score. Three blind scorers per pair, paid per evaluation.

Open the platform →