Applying the decomposition framework to turn "make better diagrams" into evaluable, optimizable sub-metrics
Before decomposing, score the top-level metric. "Diagram Quality" is vague — it fails MART immediately.
| Property | Score | Problem |
|---|---|---|
| Measurable | 2/5 | What IS quality? Subjective, no formula |
| Actionable | 1/5 | "Make it better" — which lever? Layout? Colors? Content? |
| Relevant | 4/5 | Yes, we do want better diagrams |
| Timely | 3/5 | Can evaluate per-diagram, but manually |
| Total: 10/20 | Decomposition needed | |
"Diagram Quality" scores 10/20 — well below the threshold. Actionability (1/5) is the critical gap: there's no single lever to pull. This metric must be decomposed.
Using a custom decomposition (quality isn't a funnel or stock-flow), we identify five independent dimensions that jointly determine whether a diagram is "good."
Each dimension is scored on 6 evaluability dimensions. The question: which can an agent optimize autonomously?
| Leaf Metric | Clarity | Data | Latency | Cost | Alignment | Verification | Score | Quadrant |
|---|---|---|---|---|---|---|---|---|
| Structural Correctness | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.5 | 0.92 | Q1 |
| Layout Quality | 1.0 | 1.0 | 1.0 | 1.0 | 0.5 | 0.5 | 0.83 | Q1 |
| Semantic Clarity | 0.5 | 1.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0.67 | Q2 |
| Visual Aesthetics | 0.5 | 1.0 | 1.0 | 0.5 | 0.5 | 0 | 0.58 | Q3 |
| Content Completeness | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | 0.42 | Q3 |
Two leaves are Q1 (auto-researchable): Structural Correctness and Layout Quality. These can be verified programmatically — the validate-diagram.sh script already checks overlaps and unbound arrows. An agent can iterate on these without human judgment. The other three need human-in-the-loop or further decomposition.
Already automated via validate-diagram.sh
Can be scored by reading shape coordinates
Partially automatable — can check conventions, but "right shape for concept" requires understanding intent
Verification gap — can enforce rules (max colors, consistent fills) but can't verify "looks good" without human or vision model
Add layout quality checks: alignment deviation score, spacing variance, arrow crossing count, aspect ratio. Run automatically after every draw.
After validation, automatically apply align/stack/distribute to fix detected issues. Iterate until score converges.
Add a reference file mapping concept types to shapes: decisions = diamond, start/end = ellipse, external = cloud, data = pill. Enforce in SKILL.md workflow.
Pre-define palettes per diagram type. Validate that actual colors match the palette. Flag violations.
After drawing, take a screenshot and ask the LLM to evaluate aesthetics. Score on a rubric. Iterate if score is low.
Whether all the right components are shown depends on understanding the system being diagrammed. This remains human domain.
The Q1 leaves have clear fitness functions. ShinkaEvolve could evolve the skill's layout parameters, color palettes, and spacing defaults by treating each diagram as a fitness evaluation.
Layout defaults — optimal gap sizes (currently hardcoded at 60px), aspect ratios, alignment strategies
Color palettes — which semantic color assignments produce the most readable diagrams
Diagram patterns — the recipes in references/diagram-patterns.md could be evolved against readability scores
Validation thresholds — what counts as "too close" or "misaligned" in validate-diagram.sh
This is exactly what the metric-decomposer does: takes a vague goal ("better diagrams"), decomposes it into leaves with different evaluability levels, and tells you which parts an agent can optimize autonomously (structure, layout), which need rules (semantics, color), and which remain human (content completeness). The decomposition itself is the value — it turns an impossible optimization problem into five tractable ones.