Scoring Rubric - Spec-First Build

The deliberate message in the maths: “it works” is worth 6 points. The human review gate is worth 12. The test pyramid is worth 14. Doing it spec-first is worth 10. A team whose endpoint barely runs but is specified, contract-tested, and signed by a named non-author beats a team with a slick demo and no rigor. That ranking is the entire teaching point - say it before you score.

Scored per team, out of 100. Every line maps to the Definition of Done and a corporate-standards ID. Facilitators score live during the demo showdown; witness slips from Gate 2 settle the review points.

The scorecard

#	Criterion	What earns it	Cite(s)	Max
1	Spec-first order observed	Acceptance criteria + frozen contract existed before any feature code (Gate 1 initialled).	QUAL-PRINC-SHIFT-LEFT, Way #1	10
2	Machine-readable contract	OpenAPI is real and in sync: `$ref`-ed response, shared error envelope, explicit version, consistent conventions.	REQ-API-2, REQ-API-4, REQ-API-5	10
3	Idempotency reasoning in ADR	ADR explains why GET is idempotent and what would change if it became a write.	REQ-API-6	4
4	ADR quality	Captures the decision and the why - a real trade-off, durable, simplest-thing-that-works.	ENG-PRIN-DOC-STMT, ENG-PRIN-SIMPLE-STMT	6
5	Test pyramid	First commit = stub + failing contract test; behaviours tested at the lowest viable layer; no duplicated coverage; right-shaped pyramid.	QTEST-EARLY, QTEST-NO-DUP, REQ-API-7	14
6	Human review gate	A named non-author approved against the DoD; AI assisted not approved; no self-merge.	ENG-PRIN-REVIEW-STMT	12
7	Cross-role handoff	Build visibly consumed Product’s criteria, Architecture’s ADR, QA’s tests - quality as a team property.	QUAL-PRINC-TEAM	10
8	UX states	Loading / empty / error / populated all handled; error state reads the contract’s `Problem` envelope.	QUAL-PRINC-SDLC, REQ-API-5	6
9	Endpoint actually works	The happy path returns the agreed shape, live.	REQ-API-2	6
10	Standards technique spotting	Facilitator-observed standards moves during build (see table below), 2 pts each.	(various)	8
11	”Most Standards-Aligned” (audience)	Top-voted team after demos.	-	5
12	Sabotage survival	Survived the delivered card the right way (re-assigned reviewer / updated contract+test).	ENG-PRIN-REVIEW-STMT, REQ-API-2, QTEST-NO-DUP	3
			TOTAL	94 + 6 bonus

Where the weight sits - on purpose: Review (12) + Tests (14) + Spec-first (10) = 36 points for rigor. “Endpoint actually works” = 6. Rigor outscores results 6-to-1. A team cannot demo their way to the top; they have to earn it through the process.

Scoring notes

Gate is a multiplier on shipping, not just points. A team that never passes Gate 2 (no named human approval) forfeits criterion 6 (12 pts) and cannot be ranked as “shipped” - they’re scored on spec/test merit only. This is ENG-PRIN-REVIEW-STMT made literal.
No contract, capped low. If criterion 2 scores 0 (no machine-readable contract), cap criterion 9 at 0 too - you can’t have “it works to spec” with no spec.
Partial credit is fine. A right-shaped pyramid with one missing layer still earns most of criterion 5. Award against the DoD MUST/SHOULD bands.

Standards technique spotting (criterion 10)

Watch the build. When you see one, call it out loudly: “Team B just wrote the failing contract test before any code - 2 points!” Max 4 techniques scored per team (8 pts).

Technique	What to look for	Cite
Contract-first	Writes the OpenAPI `$ref` schema before touching implementation	REQ-API-2, Way #1
Red-first	Commits a failing contract test as commit #1	QTEST-CONTRACT, ENG-PRIN-INCR-STMT
Lowest-layer test	Pushes a rule down to a unit test instead of an E2E	QTEST-EARLY
No-dup discipline	Refuses to re-assert the same rule at three layers	QTEST-NO-DUP, QUAL-PRINC-WASTE
Idempotency call-out	Names idempotency / `Idempotency-Key` in the ADR unprompted	REQ-API-6
Envelope reuse	Reuses `Problem` / `Lesson` rather than copying a shape	REQ-API-5
AI-assisted, human-signed	Uses AI to review the diff, then a human types the approval	ENG-PRIN-REVIEW-STMT, Way #2
Wrote down the why	Captures a real trade-off in the ADR, not just the choice	ENG-PRIN-DOC-STMT, Way #5

Leaderboard template

Project this. Update live and announce every change - don’t move numbers silently.

Team	Stack / Substrate	1. Spec-first /10	2. Contract /10	3. Idemp. /4	4. ADR /6	5. Tests /14	6. Review /12	7. Handoff /10	8. UX /6	9. Works /6	10. Techniques /8	11. Audience /5	12. Sabotage /3	TOTAL
Team 1
Team 2
Team 3

Spot prizes (award on the day):

Most Standards-Aligned - audience vote (criterion 11)
Best ADR - clearest why (criterion 4)
Greenest Pyramid - best-shaped test suite (criterion 5)
Cleanest Gate 2 - most rigorous human review against the DoD (criterion 6)

Update after: Gate 1 (mark spec-first locked), demos (criteria 1–10), audience vote (11), sabotage resolution (12). Make the rigor categories the dramatic ones - that’s where the lesson lands.

Part of Innovation Day. The five ways: ways-of-working.md. The bar being scored: definition-of-done.md. How to run it: facilitator-guide.md.

Source: docs/innovation-day/spec-first-build/scoring-rubric.md