Metric Decomposer Applied: Improving tldraw Diagram Quality

Step 1 — MART Pre-Check

Scoring "Diagram Quality" on MART

Before decomposing, score the top-level metric. "Diagram Quality" is vague — it fails MART immediately.

Property	Score	Problem
Measurable	2/5	What IS quality? Subjective, no formula
Actionable	1/5	"Make it better" — which lever? Layout? Colors? Content?
Relevant	4/5	Yes, we do want better diagrams
Timely	3/5	Can evaluate per-diagram, but manually
Total: 10/20		Decomposition needed

Verdict

"Diagram Quality" scores 10/20 — well below the threshold. Actionability (1/5) is the critical gap: there's no single lever to pull. This metric must be decomposed.

Steps 2-4 — Decompose

"Diagram Quality" → Five Measurable Dimensions

Using a custom decomposition (quality isn't a funnel or stock-flow), we identify five independent dimensions that jointly determine whether a diagram is "good."

Decomposition Tree with Evaluability Scores

Diagram Quality MART: 10/20

      Quality = f(Structure, Layout, Semantics, Aesthetics, Completeness)
    

Structural
Correctness

No overlaps, arrows bound,
identity holds

Q1 · 0.92

Layout
Quality

Alignment, spacing,
no crossings

Q1 · 0.83

Semantic
Clarity

Right shapes, labels,
colors for meaning

Q2 · 0.67

Visual
Aesthetics

Color harmony, fill
consistency, style

Q2 · 0.58

Content
Completeness

All components shown,
nothing missing

Q3 · 0.42

Steps 5-6 — Score Evaluability

Scoring each leaf for auto-research readiness

Each dimension is scored on 6 evaluability dimensions. The question: which can an agent optimize autonomously?

Leaf Metric	Clarity	Data	Latency	Cost	Alignment	Verification	Score	Quadrant
Structural Correctness	1.0	1.0	1.0	1.0	1.0	0.5	0.92	Q1
Layout Quality	1.0	1.0	1.0	1.0	0.5	0.5	0.83	Q1
Semantic Clarity	0.5	1.0	1.0	0.5	0.5	0.5	0.67	Q2
Visual Aesthetics	0.5	1.0	1.0	0.5	0.5	0	0.58	Q3
Content Completeness	0.5	0.5	0.5	0.5	0.5	0	0.42	Q3

Key Finding

Two leaves are Q1 (auto-researchable): Structural Correctness and Layout Quality. These can be verified programmatically — the validate-diagram.sh script already checks overlaps and unbound arrows. An agent can iterate on these without human judgment. The other three need human-in-the-loop or further decomposition.

Deep Dive

What makes each leaf evaluable (or not)

Q1: Structural Correctness (0.92)

No overlapping shapes (programmatic check)
All arrows bound to shapes (API verifiable)
Shape count matches plan (countable)
No orphaned shapes (graph traversal)
Zoom-to-fit after drawing (deterministic)

Already automated via validate-diagram.sh

→

Q1: Layout Quality (0.83)

Shapes aligned on grid (measurable offset)
Even spacing between shapes (calculable)
No arrow crossings (graph planarity)
Flow direction consistent (L-to-R or T-to-B)
Bounding box aspect ratio reasonable

Can be scored by reading shape coordinates

Q2: Semantic Clarity (0.67)

~ Right shape type for concept (diamonds for decisions)
~ Labels are descriptive, not truncated
~ Color encodes meaning (semantic color system)
~ Arrow labels explain relationships

Partially automatable — can check conventions, but "right shape for concept" requires understanding intent

→

Q3: Visual Aesthetics (0.58)

? Color harmony (max 4-5 colors)
? Consistent fill style across similar shapes
? Visual hierarchy guides the eye
? "Looks professional" (subjective)

Verification gap — can enforce rules (max colors, consistent fills) but can't verify "looks good" without human or vision model

Step 8 — Action Plan

How to improve tldraw skill at each quadrant

Q1 — Automate

Enhance validate-diagram.sh

Add layout quality checks: alignment deviation score, spacing variance, arrow crossing count, aspect ratio. Run automatically after every draw.

Fitness function: issue_count == 0 AND alignment_score > 0.9 AND spacing_variance < 15px

Q1 — Automate

Auto-fix layout pass

After validation, automatically apply align/stack/distribute to fix detected issues. Iterate until score converges.

Loop: draw → validate → fix → validate → screenshot (max 3 iterations)

Q2 — Codify Rules

Semantic shape-type rules

Add a reference file mapping concept types to shapes: decisions = diamond, start/end = ellipse, external = cloud, data = pill. Enforce in SKILL.md workflow.

Partial automation: detect mismatches (rectangle for a decision) and suggest corrections

Q2 — Codify Rules

Color palette enforcement

Pre-define palettes per diagram type. Validate that actual colors match the palette. Flag violations.

Rule: max 5 colors per diagram, all from the semantic color system in SKILL.md

Q3 — Screenshot Loop

Visual verification with LLM

After drawing, take a screenshot and ask the LLM to evaluate aesthetics. Score on a rubric. Iterate if score is low.

Proxy metric: LLM aesthetic score (1-5) on screenshot. Goodhart risk: model may optimize for what IT thinks looks good, not what humans prefer

Q4 — Human Judgment

Content completeness

Whether all the right components are shown depends on understanding the system being diagrammed. This remains human domain.

Assist only: checklist in scaffolded doc ("have you included: auth? logging? error paths?")

Meta — ShinkaEvolve Application

Evolving the tldraw skill itself

The Q1 leaves have clear fitness functions. ShinkaEvolve could evolve the skill's layout parameters, color palettes, and spacing defaults by treating each diagram as a fitness evaluation.

Mutate

LLM proposes change
to SKILL.md or script

→

Draw

Generate a test
diagram with change

→

Validate

Run structural +
layout scoring

→

Screenshot

Visual check
via LLM scorer

→

Score

Composite fitness
Q1 + Q2 + Q3

→

Select

UCB1 bandit
keeps winners

What could be evolved

Layout defaults — optimal gap sizes (currently hardcoded at 60px), aspect ratios, alignment strategies
Color palettes — which semantic color assignments produce the most readable diagrams
Diagram patterns — the recipes in references/diagram-patterns.md could be evolved against readability scores
Validation thresholds — what counts as "too close" or "misaligned" in validate-diagram.sh

The Punchline

Before and after decomposition

Before: "Make better diagrams"

Vague, subjective, no clear lever
Can't tell if a change helped
Can't automate improvement
Every diagram is a fresh struggle
Skill improvements are guesswork

→

After: Five evaluable dimensions

2 dimensions fully automatable (Q1)
2 dimensions rule-enforceable (Q2)
1 dimension needs human review (Q3-Q4)
Each has a specific fitness function
Skill can evolve via ShinkaEvolve loop

The Framework in Action

This is exactly what the metric-decomposer does: takes a vague goal ("better diagrams"), decomposes it into leaves with different evaluability levels, and tells you which parts an agent can optimize autonomously (structure, layout), which need rules (semantics, color), and which remain human (content completeness). The decomposition itself is the value — it turns an impossible optimization problem into five tractable ones.