Metric Decomposer — Applied

Improving tldraw Diagram Quality

Applying the decomposition framework to turn "make better diagrams" into evaluable, optimizable sub-metrics

Step 1 — MART Pre-Check

Scoring "Diagram Quality" on MART

Before decomposing, score the top-level metric. "Diagram Quality" is vague — it fails MART immediately.

PropertyScoreProblem
Measurable2/5What IS quality? Subjective, no formula
Actionable1/5"Make it better" — which lever? Layout? Colors? Content?
Relevant4/5Yes, we do want better diagrams
Timely3/5Can evaluate per-diagram, but manually
Total: 10/20Decomposition needed
Verdict

"Diagram Quality" scores 10/20 — well below the threshold. Actionability (1/5) is the critical gap: there's no single lever to pull. This metric must be decomposed.

Steps 2-4 — Decompose

"Diagram Quality" → Five Measurable Dimensions

Using a custom decomposition (quality isn't a funnel or stock-flow), we identify five independent dimensions that jointly determine whether a diagram is "good."

Decomposition Tree with Evaluability Scores
Diagram Quality MART: 10/20
Quality = f(Structure, Layout, Semantics, Aesthetics, Completeness)
Structural
Correctness
No overlaps, arrows bound,
identity holds
Q1 · 0.92
Layout
Quality
Alignment, spacing,
no crossings
Q1 · 0.83
Semantic
Clarity
Right shapes, labels,
colors for meaning
Q2 · 0.67
Visual
Aesthetics
Color harmony, fill
consistency, style
Q2 · 0.58
Content
Completeness
All components shown,
nothing missing
Q3 · 0.42
Steps 5-6 — Score Evaluability

Scoring each leaf for auto-research readiness

Each dimension is scored on 6 evaluability dimensions. The question: which can an agent optimize autonomously?

Leaf MetricClarityDataLatencyCostAlignmentVerificationScoreQuadrant
Structural Correctness 1.01.01.01.01.00.5 0.92Q1
Layout Quality 1.01.01.01.00.50.5 0.83Q1
Semantic Clarity 0.51.01.00.50.50.5 0.67Q2
Visual Aesthetics 0.51.01.00.50.50 0.58Q3
Content Completeness 0.50.50.50.50.50 0.42Q3
Key Finding

Two leaves are Q1 (auto-researchable): Structural Correctness and Layout Quality. These can be verified programmatically — the validate-diagram.sh script already checks overlaps and unbound arrows. An agent can iterate on these without human judgment. The other three need human-in-the-loop or further decomposition.

Deep Dive

What makes each leaf evaluable (or not)

Q1: Structural Correctness (0.92)

  • No overlapping shapes (programmatic check)
  • All arrows bound to shapes (API verifiable)
  • Shape count matches plan (countable)
  • No orphaned shapes (graph traversal)
  • Zoom-to-fit after drawing (deterministic)

Already automated via validate-diagram.sh

Q1: Layout Quality (0.83)

  • Shapes aligned on grid (measurable offset)
  • Even spacing between shapes (calculable)
  • No arrow crossings (graph planarity)
  • Flow direction consistent (L-to-R or T-to-B)
  • Bounding box aspect ratio reasonable

Can be scored by reading shape coordinates

Q2: Semantic Clarity (0.67)

  • ~ Right shape type for concept (diamonds for decisions)
  • ~ Labels are descriptive, not truncated
  • ~ Color encodes meaning (semantic color system)
  • ~ Arrow labels explain relationships

Partially automatable — can check conventions, but "right shape for concept" requires understanding intent

Q3: Visual Aesthetics (0.58)

  • ? Color harmony (max 4-5 colors)
  • ? Consistent fill style across similar shapes
  • ? Visual hierarchy guides the eye
  • ? "Looks professional" (subjective)

Verification gap — can enforce rules (max colors, consistent fills) but can't verify "looks good" without human or vision model

Step 8 — Action Plan

How to improve tldraw skill at each quadrant

Q1 — Automate

Enhance validate-diagram.sh

Add layout quality checks: alignment deviation score, spacing variance, arrow crossing count, aspect ratio. Run automatically after every draw.

Fitness function: issue_count == 0 AND alignment_score > 0.9 AND spacing_variance < 15px
Q1 — Automate

Auto-fix layout pass

After validation, automatically apply align/stack/distribute to fix detected issues. Iterate until score converges.

Loop: draw → validate → fix → validate → screenshot (max 3 iterations)
Q2 — Codify Rules

Semantic shape-type rules

Add a reference file mapping concept types to shapes: decisions = diamond, start/end = ellipse, external = cloud, data = pill. Enforce in SKILL.md workflow.

Partial automation: detect mismatches (rectangle for a decision) and suggest corrections
Q2 — Codify Rules

Color palette enforcement

Pre-define palettes per diagram type. Validate that actual colors match the palette. Flag violations.

Rule: max 5 colors per diagram, all from the semantic color system in SKILL.md
Q3 — Screenshot Loop

Visual verification with LLM

After drawing, take a screenshot and ask the LLM to evaluate aesthetics. Score on a rubric. Iterate if score is low.

Proxy metric: LLM aesthetic score (1-5) on screenshot. Goodhart risk: model may optimize for what IT thinks looks good, not what humans prefer
Q4 — Human Judgment

Content completeness

Whether all the right components are shown depends on understanding the system being diagrammed. This remains human domain.

Assist only: checklist in scaffolded doc ("have you included: auth? logging? error paths?")
Meta — ShinkaEvolve Application

Evolving the tldraw skill itself

The Q1 leaves have clear fitness functions. ShinkaEvolve could evolve the skill's layout parameters, color palettes, and spacing defaults by treating each diagram as a fitness evaluation.

Mutate
LLM proposes change
to SKILL.md or script
Draw
Generate a test
diagram with change
Validate
Run structural +
layout scoring
Screenshot
Visual check
via LLM scorer
Score
Composite fitness
Q1 + Q2 + Q3
Select
UCB1 bandit
keeps winners
What could be evolved

Layout defaults — optimal gap sizes (currently hardcoded at 60px), aspect ratios, alignment strategies
Color palettes — which semantic color assignments produce the most readable diagrams
Diagram patterns — the recipes in references/diagram-patterns.md could be evolved against readability scores
Validation thresholds — what counts as "too close" or "misaligned" in validate-diagram.sh

The Punchline

Before and after decomposition

Before: "Make better diagrams"

  • Vague, subjective, no clear lever
  • Can't tell if a change helped
  • Can't automate improvement
  • Every diagram is a fresh struggle
  • Skill improvements are guesswork

After: Five evaluable dimensions

  • 2 dimensions fully automatable (Q1)
  • 2 dimensions rule-enforceable (Q2)
  • 1 dimension needs human review (Q3-Q4)
  • Each has a specific fitness function
  • Skill can evolve via ShinkaEvolve loop
The Framework in Action

This is exactly what the metric-decomposer does: takes a vague goal ("better diagrams"), decomposes it into leaves with different evaluability levels, and tells you which parts an agent can optimize autonomously (structure, layout), which need rules (semantics, color), and which remain human (content completeness). The decomposition itself is the value — it turns an impossible optimization problem into five tractable ones.