Scoring Rubric - Spec-First Build
The deliberate message in the maths: “it works” is worth 6 points. The human review gate is worth 12. The test pyramid is worth 14. Doing it spec-first is worth 10. A team whose endpoint barely runs but is specified, contract-tested, and signed by a named non-author beats a team with a slick demo and no rigor. That ranking is the entire teaching point - say it before you score.
Scored per team, out of 100. Every line maps to the Definition of Done and a corporate-standards ID. Facilitators score live during the demo showdown; witness slips from Gate 2 settle the review points.
The scorecard
| # | Criterion | What earns it | Cite(s) | Max |
|---|---|---|---|---|
| 1 | Spec-first order observed | Acceptance criteria + frozen contract existed before any feature code (Gate 1 initialled). | QUAL-PRINC-SHIFT-LEFT, Way #1 | 10 |
| 2 | Machine-readable contract | OpenAPI is real and in sync: $ref-ed response, shared error envelope, explicit version, consistent conventions. | REQ-API-2, REQ-API-4, REQ-API-5 | 10 |
| 3 | Idempotency reasoning in ADR | ADR explains why GET is idempotent and what would change if it became a write. | REQ-API-6 | 4 |
| 4 | ADR quality | Captures the decision and the why - a real trade-off, durable, simplest-thing-that-works. | ENG-PRIN-DOC-STMT, ENG-PRIN-SIMPLE-STMT | 6 |
| 5 | Test pyramid | First commit = stub + failing contract test; behaviours tested at the lowest viable layer; no duplicated coverage; right-shaped pyramid. | QTEST-EARLY, QTEST-NO-DUP, REQ-API-7 | 14 |
| 6 | Human review gate | A named non-author approved against the DoD; AI assisted not approved; no self-merge. | ENG-PRIN-REVIEW-STMT | 12 |
| 7 | Cross-role handoff | Build visibly consumed Product’s criteria, Architecture’s ADR, QA’s tests - quality as a team property. | QUAL-PRINC-TEAM | 10 |
| 8 | UX states | Loading / empty / error / populated all handled; error state reads the contract’s Problem envelope. | QUAL-PRINC-SDLC, REQ-API-5 | 6 |
| 9 | Endpoint actually works | The happy path returns the agreed shape, live. | REQ-API-2 | 6 |
| 10 | Standards technique spotting | Facilitator-observed standards moves during build (see table below), 2 pts each. | (various) | 8 |
| 11 | ”Most Standards-Aligned” (audience) | Top-voted team after demos. | - | 5 |
| 12 | Sabotage survival | Survived the delivered card the right way (re-assigned reviewer / updated contract+test). | ENG-PRIN-REVIEW-STMT, REQ-API-2, QTEST-NO-DUP | 3 |
| TOTAL | 94 + 6 bonus |
Where the weight sits - on purpose: Review (12) + Tests (14) + Spec-first (10) = 36 points for rigor. “Endpoint actually works” = 6. Rigor outscores results 6-to-1. A team cannot demo their way to the top; they have to earn it through the process.
Scoring notes
- Gate is a multiplier on shipping, not just points. A team that never passes Gate 2 (no named human approval) forfeits criterion 6 (12 pts) and cannot be ranked as “shipped” - they’re scored on spec/test merit only. This is
ENG-PRIN-REVIEW-STMTmade literal. - No contract, capped low. If criterion 2 scores 0 (no machine-readable contract), cap criterion 9 at 0 too - you can’t have “it works to spec” with no spec.
- Partial credit is fine. A right-shaped pyramid with one missing layer still earns most of criterion 5. Award against the DoD MUST/SHOULD bands.
Standards technique spotting (criterion 10)
Watch the build. When you see one, call it out loudly: “Team B just wrote the failing contract test before any code - 2 points!” Max 4 techniques scored per team (8 pts).
| Technique | What to look for | Cite |
|---|---|---|
| Contract-first | Writes the OpenAPI $ref schema before touching implementation | REQ-API-2, Way #1 |
| Red-first | Commits a failing contract test as commit #1 | QTEST-CONTRACT, ENG-PRIN-INCR-STMT |
| Lowest-layer test | Pushes a rule down to a unit test instead of an E2E | QTEST-EARLY |
| No-dup discipline | Refuses to re-assert the same rule at three layers | QTEST-NO-DUP, QUAL-PRINC-WASTE |
| Idempotency call-out | Names idempotency / Idempotency-Key in the ADR unprompted | REQ-API-6 |
| Envelope reuse | Reuses Problem / Lesson rather than copying a shape | REQ-API-5 |
| AI-assisted, human-signed | Uses AI to review the diff, then a human types the approval | ENG-PRIN-REVIEW-STMT, Way #2 |
| Wrote down the why | Captures a real trade-off in the ADR, not just the choice | ENG-PRIN-DOC-STMT, Way #5 |
Leaderboard template
Project this. Update live and announce every change - don’t move numbers silently.
| Team | Stack / Substrate | 1. Spec-first /10 | 2. Contract /10 | 3. Idemp. /4 | 4. ADR /6 | 5. Tests /14 | 6. Review /12 | 7. Handoff /10 | 8. UX /6 | 9. Works /6 | 10. Techniques /8 | 11. Audience /5 | 12. Sabotage /3 | TOTAL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Team 1 | ||||||||||||||
| Team 2 | ||||||||||||||
| Team 3 |
Spot prizes (award on the day):
- Most Standards-Aligned - audience vote (criterion 11)
- Best ADR - clearest why (criterion 4)
- Greenest Pyramid - best-shaped test suite (criterion 5)
- Cleanest Gate 2 - most rigorous human review against the DoD (criterion 6)
Update after: Gate 1 (mark spec-first locked), demos (criteria 1–10), audience vote (11), sabotage resolution (12). Make the rigor categories the dramatic ones - that’s where the lesson lands.
Part of Innovation Day. The five ways: ways-of-working.md. The bar being scored: definition-of-done.md. How to run it: facilitator-guide.md.