What Claude-Generated Diagrams Actually Look Like Across Four Tools
I used Claude to generate diagram source code across Graphviz, D2, Pikchr, and Mermaid for the same ML subjects. This is what came out, what held up, and what broke.
The previous post covers the toolchain. This one is about using Claude to generate the actual diagram source, what that process looked like, and what the rendered outputs are.
The short version: Claude generates syntactically correct diagram code reliably. The quality of what it generates, meaning whether the layout is readable, whether the annotations are accurate, whether the diagram actually communicates the concept, varies a lot by tool and by subject.
How the generation worked
For each diagram I gave Claude the subject and the tool. For ML algorithm diagrams (CNN, SVM, Viterbi, backpropagation) I also gave the mathematical context. For infrastructure diagrams I described the architecture.
Claude produced complete source files. I fed them to the renderer, looked at the output, and iterated. Most diagrams took two or three rounds. A few took more. The Pikchr SVM diagram, which required placing scatter points at specific coordinates with a diagonal hyperplane at the right angle, took the most back-and-forth, because Pikchr requires exact geometry and there is no layout engine to fall back on.
VGG-16 Convolutional Neural Network
The CNN architecture is a natural fit for Graphviz. It is a pure DAG, one direction, layered blocks, no geometry needed. Claude generated a clean DOT file on the first try. The subgraph clusters map directly onto conv blocks, parameter counts go in the labels, and the dot engine handles all placement.
digraph CNN {
rankdir=TB
fontname="monospace"
bgcolor="#fafafa"
nodesep=0.35
ranksep=0.7
label="VGG-16 Architecture — Convolutional Neural Network"
labelloc=t
node [fontname="monospace" fontsize=9 style=filled shape=rect]
edge [fontname="monospace" fontsize=8 color="#555555"]
node [fillcolor="#dbeafe"]
input [label="INPUT | 224 × 224 × 3 | RGB image"]
subgraph cluster_b1 {
label="Block 1 — 36,928 params"
style=filled fillcolor="#f0fdf4" color="#86efac"
node [fillcolor="#bbf7d0"]
b1c1 [label="Conv2D | 64 filters | 3×3 | ReLU | 224×224×64"]
b1c2 [label="Conv2D | 64 filters | 3×3 | ReLU | RF: 3×3"]
b1p [label="MaxPool | 2×2, stride=2 | 112×112×64"]
b1c1 -> b1c2 -> b1p
}
subgraph cluster_b2 {
label="Block 2 — 221,184 params"
style=filled fillcolor="#eff6ff" color="#93c5fd"
node [fillcolor="#bfdbfe"]
b2c1 [label="Conv2D | 128 filters | 3×3 | ReLU | 112×112×128"]
b2c2 [label="Conv2D | 128 filters | 3×3 | ReLU | RF: 7×7"]
b2p [label="MaxPool | 2×2, stride=2 | 56×56×128"]
b2c1 -> b2c2 -> b2p
}
subgraph cluster_b3 {
label="Block 3 — 1,475,584 params"
style=filled fillcolor="#fff7ed" color="#fdba74"
node [fillcolor="#fed7aa"]
b3c1 [label="Conv2D | 256 filters | 3×3 | ReLU | 56×56×256"]
b3c2 [label="Conv2D | 256 filters | 3×3 | ReLU"]
b3c3 [label="Conv2D | 256 filters | 3×3 | ReLU | RF: 16×16"]
b3p [label="MaxPool | 2×2, stride=2 | 28×28×256"]
b3c1 -> b3c2 -> b3c3 -> b3p
}
subgraph cluster_b4 {
label="Block 4 — 5,899,264 params"
style=filled fillcolor="#fdf4ff" color="#d8b4fe"
node [fillcolor="#e9d5ff"]
b4c1 [label="Conv2D | 512 filters | 3×3 | ReLU | 28×28×512"]
b4c2 [label="Conv2D | 512 filters | 3×3 | ReLU"]
b4c3 [label="Conv2D | 512 filters | 3×3 | ReLU | RF: 40×40"]
b4p [label="MaxPool | 2×2, stride=2 | 14×14×512"]
b4c1 -> b4c2 -> b4c3 -> b4p
}
subgraph cluster_b5 {
label="Block 5 — 7,079,424 params"
style=filled fillcolor="#fff1f2" color="#fda4af"
node [fillcolor="#fecdd3"]
b5c1 [label="Conv2D | 512 filters | 3×3 | ReLU | 14×14×512"]
b5c2 [label="Conv2D | 512 filters | 3×3 | ReLU"]
b5c3 [label="Conv2D | 512 filters | 3×3 | ReLU | RF: 88×88"]
b5p [label="MaxPool | 2×2, stride=2 | 7×7×512"]
b5c1 -> b5c2 -> b5c3 -> b5p
}
subgraph cluster_head {
label="Classifier Head — 123,642,856 params"
style=filled fillcolor="#fffbeb" color="#fbbf24"
node [fillcolor="#fef3c7"]
flat [label="Flatten | 25,088 units"]
fc1 [label="FC-4096 | ReLU | Dropout 0.5 | 102.8M params"]
fc2 [label="FC-4096 | ReLU | Dropout 0.5 | 16.8M params"]
fc3 [label="FC-1000 | ImageNet classes | 4.1M params"]
soft [label="Softmax | 1000-dim | Σ pᵢ = 1.0"]
flat -> fc1 -> fc2 -> fc3 -> soft
}
edge [penwidth=1.5]
input -> b1c1
b1p -> b2c1
b2p -> b3c1
b3p -> b4c1
b4p -> b5c1
b5p -> flat
} The parameter counts in the labels are accurate. Claude computed them correctly, including the receptive field annotations at each block boundary. This is the kind of detail where a human drawing the same diagram by hand would likely skip or round.
The D2 version of the same diagram is more verbose but renders well. The Mermaid and Pikchr versions exist too. None of them add information the Graphviz version does not have. For a pure DAG like this, Graphviz is the right tool and produces the cleanest result.
Support Vector Machine
The SVM diagram is where the tool choice matters most. The diagram needs a diagonal hyperplane at a specific angle, scatter points in two classes positioned relative to the hyperplane, support vectors annotated, margin width indicated with a two-headed arrow, and reference boxes for the optimisation objective and kernels. That is a geometric drawing problem.
# SVM: Hard and Soft Margin Classification
scale = 1.0
text "SVM — Support Vector Machine: Geometry, Margins, Kernels" bold at 3.8,9.6
arrow from 0.3,0.4 to 5.8,0.4
text "x₁ (feature 1)" small at 6.15,0.4
arrow from 0.3,0.4 to 0.3,5.2
text "x₂" small with .s at 0.3,5.35
line from 1.3,0.35 to 1.3,0.45
line from 2.3,0.35 to 2.3,0.45
line from 3.3,0.35 to 3.3,0.45
line from 4.3,0.35 to 4.3,0.45
line from 0.25,1.4 to 0.35,1.4
line from 0.25,2.4 to 0.35,2.4
line from 0.25,3.4 to 0.35,3.4
line from 0.25,4.4 to 0.35,4.4
line from 0.9,0.5 to 4.9,4.9 color 0x1e40af thick
line from 0.2,0.5 to 3.8,4.7 dashed color 0x3b82f6
line from 1.8,0.5 to 5.8,4.7 dashed color 0x3b82f6
text "w·x+b=0" small bold with .w at 5.0,5.1
text "w·x+b=+1" small italic with .e at 3.55,4.85
text "w·x+b=−1" small italic with .w at 6.0,4.85
arrow from 2.8,2.5 to 3.5,2.5
arrow from 3.4,2.5 to 2.7,2.5
text "2/||w||" small bold with .s at 3.1,2.7
circle radius 0.13 fill 0x86efac at 4.3,4.5
circle radius 0.13 fill 0x86efac at 4.8,4.0
circle radius 0.13 fill 0x86efac at 5.1,3.5
circle radius 0.13 fill 0x86efac at 4.6,3.2
circle radius 0.13 fill 0x86efac at 5.3,4.3
circle radius 0.13 fill 0x86efac at 4.9,2.8
circle radius 0.13 fill 0x86efac at 5.5,3.8
box width 0.22 height 0.22 fill 0xfca5a5 at 1.0,1.0
box width 0.22 height 0.22 fill 0xfca5a5 at 1.6,1.6
box width 0.22 height 0.22 fill 0xfca5a5 at 0.7,1.9
box width 0.22 height 0.22 fill 0xfca5a5 at 1.3,2.4
box width 0.22 height 0.22 fill 0xfca5a5 at 0.9,0.7
box width 0.22 height 0.22 fill 0xfca5a5 at 1.8,2.0
box width 0.22 height 0.22 fill 0xfca5a5 at 0.6,2.8
circle radius 0.18 fill 0x4ade80 color 0x166534 thickness 0.06 at 3.8,4.2
circle radius 0.18 fill 0x4ade80 color 0x166534 thickness 0.06 at 4.3,3.5
box width 0.28 height 0.28 fill 0xf87171 color 0x7f1d1d thickness 0.06 at 1.5,0.7
box width 0.28 height 0.28 fill 0xf87171 color 0x7f1d1d thickness 0.06 at 2.0,1.8
arrow from 4.8,4.6 to 4.0,4.28 color 0x166534
text "support vector" small italic with .w at 4.85,4.65
arrow from 0.6,0.5 to 1.38,0.68 color 0x7f1d1d
text "support vector" small italic with .e at 0.55,0.5
circle radius 0.14 fill 0xfbbf24 color 0x92400e thickness 0.05 at 2.5,1.2
arrow from 2.5,1.34 to 2.5,1.65 color 0x92400e
text "ξ > 0" small bold with .w at 2.65,1.5
text "(slack: inside margin)" small with .w at 2.65,1.28
box "min ½||w||² + C Σξᵢ" small bold "s.t. yᵢ(w·xᵢ+b) ≥ 1−ξᵢ, ξᵢ≥0" small width 2.6 height 0.6 fill 0xfef9c3 color 0xca8a04 at 3.8,7.8
box "Kernels: K(x,z) = φ(x)·φ(z)" small bold "Linear: x·z" small "Poly: (γx·z+r)^d" small "RBF: exp(−γ||x−z||²)" small width 2.6 height 0.8 fill 0xf3e8ff color 0xa855f7 at 3.8,6.8
box "Legend" bold "● Circle = positive (y=+1)" small "■ Square = negative (y=−1)" small "★ Larger border = support vector" small "✕ Yellow = slack ξᵢ > 0" small width 2.6 height 0.9 fill 0xf1f5f9 color 0x64748b at 3.8,5.7 Claude handled the Pikchr geometry correctly after a couple of iterations. The initial version had the hyperplane at the wrong slope and the scatter points too tightly clustered. After adjusting the endpoint coordinates for the hyperplane line and spreading the class distributions, the diagram reads correctly.
The Graphviz SVM version uses neato with pinned pos= coordinates to approximate the same layout. It works, but you can see it fighting the tool. Pikchr is simply the right choice for this kind of figure.
Viterbi Algorithm
The Viterbi trellis is a grid: states as rows, time steps as columns, transition edges between every state pair at each step, with the optimal path highlighted. Graphviz handles this well with rankdir=LR and explicit rank=same groupings.
digraph Viterbi {
rankdir=LR
fontname="monospace"
fontsize=12
bgcolor="#fafafa"
splines=line
nodesep=0.55
ranksep=1.5
label="Viterbi Algorithm — Hidden Markov Model (Ice-Cream Weather)\nStates: Hot / Warm / Cold Observations: ice-cream count each day"
labelloc=t
node [fontname="monospace" fontsize=10 shape=circle style=filled width=0.85]
edge [fontname="monospace" fontsize=7.5]
node [shape=none style=invis width=0 height=0 fontsize=11]
lbl [label="State"]
t1 [label="t = 1"] t2 [label="t = 2"] t3 [label="t = 3"]
t4 [label="t = 4"] t5 [label="t = 5"] t6 [label="t = 6"]
lbl -> t1 -> t2 -> t3 -> t4 -> t5 -> t6 [style=invis]
node [shape=rect style="filled,rounded" fillcolor="#fef9c3" fontsize=9 width=0.9 height=0.55]
obs1 [label="O₁=3\n🍦🍦🍦"] obs2 [label="O₂=1\n🍦"] obs3 [label="O₃=2\n🍦🍦"]
obs4 [label="O₄=1\n🍦"] obs5 [label="O₅=2\n🍦🍦"] obs6 [label="O₆=1\n🍦"]
node [shape=circle style=filled width=0.85 fontsize=9]
node [fillcolor="#fca5a5"] H1 [label="Hot\n δ=0.240\n b(3)=0.4"]
node [fillcolor="#fde68a"] W1 [label="Warm\n δ=0.080\n b(3)=0.4"]
node [fillcolor="#93c5fd"] C1 [label="Cold\n δ=0.020\n b(3)=0.1"]
node [fillcolor="#fca5a5"] H2 [label="Hot\n δ=0.168\n b(1)=0.2"]
node [fillcolor="#fde68a"] W2 [label="Warm\n δ=0.064\n b(1)=0.2"]
node [fillcolor="#93c5fd"] C2 [label="Cold\n δ=0.028\n b(1)=0.5"]
node [fillcolor="#fca5a5"] H3 [label="Hot\n δ=0.034\n b(2)=0.3"]
node [fillcolor="#fde68a"] W3 [label="Warm\n δ=0.067\n b(2)=0.4"]
node [fillcolor="#93c5fd"] C3 [label="Cold\n δ=0.020\n b(2)=0.2"]
node [fillcolor="#fca5a5"] H4 [label="Hot\n δ=0.024\n b(1)=0.2"]
node [fillcolor="#fde68a"] W4 [label="Warm\n δ=0.027\n b(1)=0.2"]
node [fillcolor="#93c5fd"] C4 [label="Cold\n δ=0.033\n b(1)=0.5"]
node [fillcolor="#fca5a5"] H5 [label="Hot\n δ=0.007\n b(2)=0.3"]
node [fillcolor="#fde68a"] W5 [label="Warm\n δ=0.013\n b(2)=0.4"]
node [fillcolor="#93c5fd"] C5 [label="Cold\n δ=0.016\n b(2)=0.2"]
node [fillcolor="#fca5a5"] H6 [label="Hot\n δ=0.005\n b(1)=0.2"]
node [fillcolor="#fde68a"] W6 [label="Warm\n δ=0.006\n b(1)=0.2"]
node [fillcolor="#93c5fd" penwidth=3 color="#1d4ed8"] C6 [label="Hot\n δ=0.008\n b(1)=0.5"]
{ rank=same; lbl; obs1; H1; W1; C1 }
{ rank=same; t1; obs2; H2; W2; C2 }
{ rank=same; t2; obs3; H3; W3; C3 }
{ rank=same; t3; obs4; H4; W4; C4 }
{ rank=same; t4; obs5; H5; W5; C5 }
{ rank=same; t5; obs6; H6; W6; C6 }
edge [style=dotted color="#94a3b8" arrowhead=none penwidth=1.0]
obs1 -> H1 obs1 -> W1 obs1 -> C1
obs2 -> H2 obs2 -> W2 obs2 -> C2
obs3 -> H3 obs3 -> W3 obs3 -> C3
obs4 -> H4 obs4 -> W4 obs4 -> C4
obs5 -> H5 obs5 -> W5 obs5 -> C5
obs6 -> H6 obs6 -> W6 obs6 -> C6
edge [color="#d1d5db" fontcolor="#6b7280" style=solid penwidth=0.8 arrowhead=vee]
H1->H2 [label="0.70"] H1->W2 [label="0.20"] H1->C2 [label="0.10"]
W1->H2 [label="0.30"] W1->W2 [label="0.40"] W1->C2 [label="0.30"]
C1->H2 [label="0.10"] C1->W2 [label="0.30"] C1->C2 [label="0.60"]
H2->H3 [label="0.70"] H2->W3 [label="0.20"] H2->C3 [label="0.10"]
W2->H3 [label="0.30"] W2->W3 [label="0.40"] W2->C3 [label="0.30"]
C2->H3 [label="0.10"] C2->W3 [label="0.30"] C2->C3 [label="0.60"]
H3->H4 [label="0.70"] H3->W4 [label="0.20"] H3->C4 [label="0.10"]
W3->H4 [label="0.30"] W3->W4 [label="0.40"] W3->C4 [label="0.30"]
C3->H4 [label="0.10"] C3->W4 [label="0.30"] C3->C4 [label="0.60"]
H4->H5 [label="0.70"] H4->W5 [label="0.20"] H4->C5 [label="0.10"]
W4->H5 [label="0.30"] W4->W5 [label="0.40"] W4->C5 [label="0.30"]
C4->H5 [label="0.10"] C4->W5 [label="0.30"] C4->C5 [label="0.60"]
H5->H6 [label="0.70"] H5->W6 [label="0.20"] H5->C6 [label="0.10"]
W5->H6 [label="0.30"] W5->W6 [label="0.40"] W5->C6 [label="0.30"]
C5->H6 [label="0.10"] C5->W6 [label="0.30"] C5->C6 [label="0.60"]
edge [color="#dc2626" penwidth=3.5 style=bold fontcolor="#dc2626" fontsize=9 arrowhead=vee]
H1->H2 [label="BEST"] H2->W3 [label="BEST"]
W3->C4 [label="BEST"] C4->C5 [label="BEST"] C5->C6 [label="BEST"]
node [shape=rect style="filled,rounded" fillcolor="#f1f5f9" fontsize=8 width=2.2 height=0.9]
init [label="Initial π:\nπ(Hot)=0.6 π(Warm)=0.2\nπ(Cold)=0.2"]
node [shape=rect style="filled,rounded" fillcolor="#f1f5f9" fontsize=8 width=2.2 height=1.0]
emis [label="Emission B (count→state):\nB(1|H)=0.2 B(2|H)=0.3 B(3|H)=0.4\nB(1|W)=0.2 B(2|W)=0.4 B(3|W)=0.4\nB(1|C)=0.5 B(2|C)=0.2 B(3|C)=0.1"]
init -> H1 [style=dotted arrowhead=none color="#94a3b8"]
emis -> C1 [style=invis]
{ rank=same; init; emis; H1; W1; C1 }
} The delta values at each trellis cell are correct. Claude computed them step by step using the HMM parameters. The optimal path (H H W C C C) matches what you get running the algorithm by hand.
One thing Claude did well here: the observation nodes are tied to each column using rank=same, which keeps the column structure readable even with all the crossing transition edges. That is a non-obvious Graphviz trick.
Backpropagation
The backprop diagram is dense. Forward pass edges go left to right in blue, backward pass edges go right to left in red dashed. Every layer is fully connected to the next, so the edge count is high.
digraph Backprop {
rankdir=LR
fontname="monospace"
fontsize=12
bgcolor="#fafafa"
nodesep=0.45
ranksep=1.8
label="Backpropagation — MLP 3→4→3→2 | Forward: blue | Backward: red dashed\nChain rule: ∂L/∂W¹ = (∂L/∂h²)(∂h²/∂a²)(∂a²/∂h¹)(∂h¹/∂a¹)(∂a¹/∂W¹)"
labelloc=t
node [fontname="monospace" fontsize=9 style=filled shape=circle]
edge [fontname="monospace" fontsize=8]
subgraph cluster_in {
label="Input\nLayer" fontname="monospace" fontsize=9
style=filled fillcolor="#eff6ff" color="#93c5fd"
node [fillcolor="#dbeafe" width=0.75]
x1 [label="x₁\nfeat 1"] x2 [label="x₂\nfeat 2"] x3 [label="x₃\nfeat 3"]
}
subgraph cluster_h1 {
label="Hidden Layer 1\na¹ = W¹x+b¹\nh¹ = ReLU(a¹)" fontname="monospace" fontsize=9
style=filled fillcolor="#f0fdf4" color="#86efac"
node [fillcolor="#bbf7d0" width=0.82]
h11 [label="h₁¹\nReLU\na¹₁"] h12 [label="h₂¹\nReLU\na¹₂"]
h13 [label="h₃¹\nReLU\na¹₃"] h14 [label="h₄¹\nReLU\na¹₄"]
}
subgraph cluster_h2 {
label="Hidden Layer 2\na² = W²h¹+b²\nh² = ReLU(a²)" fontname="monospace" fontsize=9
style=filled fillcolor="#fff7ed" color="#fdba74"
node [fillcolor="#fed7aa" width=0.82]
h21 [label="h₁²\nReLU\na²₁"] h22 [label="h₂²\nReLU\na²₂"] h23 [label="h₃²\nReLU\na²₃"]
}
subgraph cluster_out {
label="Output Layer\na³ = W³h²+b³\nŷ = σ(a³)" fontname="monospace" fontsize=9
style=filled fillcolor="#fdf4ff" color="#d8b4fe"
node [fillcolor="#e9d5ff" width=0.85]
y1 [label="ŷ₁\nσ(a³₁)\np(c₁)"] y2 [label="ŷ₂\nσ(a³₂)\np(c₂)"]
}
node [shape=rect style="filled,rounded" fillcolor="#fee2e2" width=1.7 height=1.0]
L [label="Loss L\nCross-Entropy:\nL = −Σ yᵢ log ŷᵢ\n∂L/∂ŷ = ŷ − y"]
edge [color="#1d4ed8" penwidth=1.3 style=solid arrowhead=vee]
x1->h11 [label="W¹₁₁"] x1->h12 [label="W¹₁₂"] x1->h13 x1->h14
x2->h11 [label="W¹₂₁"] x2->h12 x2->h13 x2->h14
x3->h11 x3->h12 x3->h13 [label="W¹₃₃"] x3->h14
h11->h21 [label="W²₁₁"] h11->h22 h11->h23
h12->h21 h12->h22 [label="W²₂₂"] h12->h23
h13->h21 h13->h22 h13->h23 [label="W²₃₃"]
h14->h21 h14->h22 h14->h23
h21->y1 [label="W³₁₁"] h21->y2
h22->y1 h22->y2 [label="W³₂₂"]
h23->y1 h23->y2
y1->L y2->L
edge [color="#dc2626" penwidth=1.8 style=dashed arrowhead=vee constraint=false fontcolor="#dc2626"]
L->y1 [label="∂L/∂ŷ₁\n=ŷ₁−y₁"] L->y2 [label="∂L/∂ŷ₂\n=ŷ₂−y₂"]
y1->h21 [label="∂L/∂a³₁"] y1->h22 y1->h23
y2->h21 y2->h22 [label="∂L/∂a³₂"] y2->h23
h21->h11 [label="∂L/∂h¹₁"] h21->h12 h21->h13 h21->h14
h22->h11 h22->h12 [label="∂L/∂h¹₂"] h22->h13 h22->h14
h23->h11 h23->h12 h23->h13 [label="∂L/∂h¹₃"] h23->h14
h11->x1 [label="∂L/∂W¹₁₁"] h12->x2 [label="∂L/∂W¹₂₂"]
h13->x3 [label="∂L/∂W¹₃₃"] h14->x1 [label="∂L/∂W¹·₄"]
node [shape=rect style="filled,rounded" fillcolor="#fef9c3" fontsize=8 width=1.8 height=0.9 color="#ca8a04"]
gd [label="Gradient Descent:\nW ← W − η ∂L/∂W\nη = learning rate"]
node [shape=rect style="filled,rounded" fillcolor="#f0f9ff" fontsize=8 width=1.9 height=0.9 color="#0284c7"]
act [label="ReLU: f(a)=max(0,a)\nf'(a)=1 if a>0 else 0\nσ: f(a)=1/(1+e^-a)"]
gd -> L [style=dotted color="#ca8a04" arrowhead=none]
act -> h11 [style=dotted color="#0284c7" arrowhead=none]
{ rank=same; gd; L }
{ rank=same; act; h11 }
} This one required the most iteration on layout. With constraint=false on the backward edges, the engine tries to route them without affecting the forward-pass layout, but on a dense fully-connected graph the result is still visually noisy. The key gradient labels (∂L/∂W¹₁₁ etc.) are annotated on representative edges only, not all of them.
Hidden Markov Model structure
The HMM diagram is the model itself, not the algorithm run on it. Hidden states with transition probabilities (including self-loops), emission probabilities to observation nodes, and initial distribution.
digraph HMM {
rankdir=TB
fontname="monospace"
fontsize=13
nodesep=1.2
ranksep=1.4
label="Hidden Markov Model — Ice Cream Weather Example"
labelloc=t
node [fontname="monospace" fontsize=12]
edge [fontname="monospace" fontsize=10]
node [shape=diamond style=filled fillcolor="#f3e5f5" width=0.7]
pi [label="π\nStart"]
node [shape=circle style=filled fillcolor="#bbdefb" width=1.1]
H [label="HOT\nπ=0.8"] C [label="COLD\nπ=0.2"] W [label="WARM\nπ=0.0"]
node [shape=rect style="filled,rounded" fillcolor="#fff9c4" width=1.2]
O1 [label="1 ice cream\nP(1|H)=0.2\nP(1|C)=0.5\nP(1|W)=0.4"]
O2 [label="2 ice creams\nP(2|H)=0.4\nP(2|C)=0.4\nP(2|W)=0.4"]
O3 [label="3 ice creams\nP(3|H)=0.4\nP(3|C)=0.1\nP(3|W)=0.2"]
edge [color="#9c27b0" penwidth=1.5 fontcolor="#9c27b0"]
pi -> H [label="0.8"] pi -> C [label="0.2"] pi -> W [label="0.0"]
edge [color="#1565c0" penwidth=1.5 fontcolor="#1565c0"]
H -> H [label="0.7"] H -> C [label="0.1"] H -> W [label="0.2"]
C -> C [label="0.5"] C -> H [label="0.1"] C -> W [label="0.4"]
W -> W [label="0.3"] W -> H [label="0.4"] W -> C [label="0.3"]
edge [color="#e65100" style=dashed penwidth=1.2 fontcolor="#e65100"]
H -> O1 [label="0.2"] H -> O2 [label="0.4"] H -> O3 [label="0.4"]
C -> O1 [label="0.5"] C -> O2 [label="0.4"] C -> O3 [label="0.1"]
W -> O1 [label="0.4"] W -> O2 [label="0.4"] W -> O3 [label="0.2"]
{ rank=same; H; C; W }
{ rank=same; O1; O2; O3 }
} The self-loops (H→H, C→C, W→W) render cleanly in Graphviz. Most other tools handle self-loops poorly or not at all. This is one area where DOT’s maturity shows.
ML training pipeline
For an end-to-end pipeline with a feedback loop, D2 with ELK produces a cleaner result than Graphviz. The retrain trigger edge from monitoring back to ingestion crosses several other elements, and ELK routes it without collisions.
title: ML Training Pipeline {
near: top-center
shape: text
style.font-size: 20
}
direction: right
raw_data: Raw Data {
shape: cylinder
style.fill: "#e3f2fd"
label: "Raw Data\nCSV / JSON / Parquet"
}
ingestion: Data Ingestion {
shape: rectangle
style.fill: "#bbdefb"
label: "Data Ingestion\n• schema validation\n• type coercion\n• dedup"
}
split: Train/Val/Test Split {
shape: rectangle
style.fill: "#c8e6c9"
label: "Split\n70% train\n15% val\n15% test"
}
preprocessing: Feature Engineering {
shape: rectangle
style.fill: "#fff9c4"
label: "Feature Engineering\n• normalisation\n• one-hot encoding\n• imputation\n• PCA"
}
training: Model Training {
shape: rectangle
style.fill: "#ffe0b2"
label: "Training Loop\n• forward pass\n• loss compute\n• backprop\n• optimiser step"
epochs: Epochs {
shape: rectangle
style.fill: "#ffcc80"
label: "N epochs\nbatch_size=32\nlr=1e-3"
}
checkpoints: Checkpoints {
shape: rectangle
style.fill: "#ffcc80"
label: "Checkpoint\nbest val loss\nearly stopping"
}
epochs -> checkpoints: "save if improved"
}
evaluation: Evaluation {
shape: rectangle
style.fill: "#e8eaf6"
label: "Evaluation\n• accuracy\n• precision/recall\n• F1, AUC-ROC\n• confusion matrix"
}
registry: Model Registry {
shape: cylinder
style.fill: "#f3e5f5"
label: "Model Registry\nversioned artifacts\nmetadata + metrics"
}
serving: Serving {
shape: rectangle
style.fill: "#fce4ec"
label: "Inference API\nREST / gRPC\nbatching\nlatency SLA"
}
monitoring: Monitoring {
shape: rectangle
style.fill: "#efebe9"
label: "Monitoring\n• data drift\n• prediction drift\n• latency\n• error rate"
}
raw_data -> ingestion: "load"
ingestion -> split: "clean data"
split -> preprocessing: "train set\nval set\ntest set"
preprocessing -> training: "feature matrix X\nlabels y"
training -> evaluation: "trained model"
evaluation -> registry: "passes threshold?" {
style.stroke-dash: 4
}
registry -> serving: "deploy"
serving -> monitoring: "live traffic"
monitoring -> ingestion: "retrain trigger" {
style.stroke: "#e53935"
style.stroke-dash: 4
} The nested training container with epochs and checkpoints inside it renders cleanly in D2. Graphviz could do this with subgraphs but the ELK routing handles the backward retrain edge better.
Transformer architecture
The transformer encoder-decoder has enough nested structure (six-layer encoder, six-layer decoder, cross-attention connecting them) that D2’s container model is a natural fit.
title: Transformer Architecture — Encoder-Decoder {
near: top-center
shape: text
style.font-size: 20
}
direction: right
input_tokens: Input Tokens {
shape: rectangle
style.fill: "#e3f2fd"
label: "Input Tokens\n[BOS, x₁, x₂, ..., xₙ, EOS]"
}
encoder: Encoder {
style.fill: "#e8f5e9"
style.stroke: "#388e3c"
embed: Input Embedding {
shape: rectangle
style.fill: "#c8e6c9"
label: "Token Embedding\nd_model=512"
}
pos_enc: Positional Encoding {
shape: rectangle
style.fill: "#c8e6c9"
label: "Positional Encoding\nsin/cos, max_len=512"
}
layer1: Encoder Layer × 6 {
style.fill: "#a5d6a7"
mha: Multi-Head Attention {
shape: rectangle
style.fill: "#81c784"
label: "Multi-Head Self-Attention\nh=8 heads, d_k=64\nQ, K, V projections"
}
add_norm1: Add & Norm {
shape: rectangle
style.fill: "#e8f5e9"
label: "Add & LayerNorm\nresidual connection"
}
ffn: Feed-Forward {
shape: rectangle
style.fill: "#81c784"
label: "FFN\nd_ff=2048\nReLU(W₁x+b₁)W₂+b₂"
}
add_norm2: Add & Norm {
shape: rectangle
style.fill: "#e8f5e9"
label: "Add & LayerNorm\nresidual connection"
}
mha -> add_norm1
add_norm1 -> ffn
ffn -> add_norm2
}
embed -> pos_enc
pos_enc -> layer1
}
decoder: Decoder {
style.fill: "#fff3e0"
style.stroke: "#e65100"
embed: Output Embedding {
shape: rectangle
style.fill: "#ffe0b2"
label: "Token Embedding\n(shifted right)"
}
pos_enc: Positional Encoding {
shape: rectangle
style.fill: "#ffe0b2"
label: "Positional Encoding"
}
layer1: Decoder Layer × 6 {
style.fill: "#ffcc80"
masked_mha: Masked Multi-Head Attention {
shape: rectangle
style.fill: "#ffa726"
label: "Masked Self-Attention\ncausal mask\nh=8 heads"
}
add_norm1: Add & Norm {
shape: rectangle
style.fill: "#fff3e0"
label: "Add & LayerNorm"
}
cross_mha: Cross-Attention {
shape: rectangle
style.fill: "#ffa726"
label: "Cross-Attention\nQ from decoder\nK,V from encoder"
}
add_norm2: Add & Norm {
shape: rectangle
style.fill: "#fff3e0"
label: "Add & LayerNorm"
}
ffn: Feed-Forward {
shape: rectangle
style.fill: "#ffa726"
label: "FFN\nd_ff=2048"
}
add_norm3: Add & Norm {
shape: rectangle
style.fill: "#fff3e0"
label: "Add & LayerNorm"
}
masked_mha -> add_norm1
add_norm1 -> cross_mha
cross_mha -> add_norm2
add_norm2 -> ffn
ffn -> add_norm3
}
embed -> pos_enc
pos_enc -> layer1
}
linear: Linear + Softmax {
shape: rectangle
style.fill: "#fce4ec"
label: "Linear\nVocab projection\nd_model → |V|\n+ Softmax"
}
output_tokens: Output Tokens {
shape: rectangle
style.fill: "#f3e5f5"
label: "Output Probabilities\nP(token | context)"
}
input_tokens -> encoder
encoder -> decoder: "encoder output\nK, V keys/values"
decoder -> linear
linear -> output_tokens Microservices architecture
The same architecture drawn in both D2 and Mermaid. D2 with ELK routes the dense cross-layer edges more cleanly. Mermaid’s dagre lays it out more compactly but the edge routing gets crowded in the middle.
title: ML Platform — Microservices Architecture {
near: top-center
shape: text
style.font-size: 20
}
direction: down
client: Client Layer {
style.fill: "#e3f2fd"
style.stroke: "#1565c0"
web: Web App { shape: rectangle; style.fill: "#bbdefb"; label: "Web App\nReact / Next.js" }
mobile: Mobile App { shape: rectangle; style.fill: "#bbdefb"; label: "Mobile App\niOS / Android" }
sdk: Python SDK { shape: rectangle; style.fill: "#bbdefb"; label: "Python SDK\napi client" }
}
gateway: API Gateway {
shape: rectangle
style.fill: "#fff9c4"
style.stroke: "#f9a825"
label: "API Gateway\nauth, rate limiting\nrouting, logging\nnginx / Kong"
}
services: Core Services {
style.fill: "#e8f5e9"
style.stroke: "#388e3c"
auth: Auth Service { shape: rectangle; style.fill: "#c8e6c9"; label: "Auth Service\nJWT / OAuth2\nuser management" }
experiment: Experiment Service { shape: rectangle; style.fill: "#c8e6c9"; label: "Experiment Service\nhyperparameter search\nrun tracking\nMLflow" }
training: Training Service { shape: rectangle; style.fill: "#c8e6c9"; label: "Training Service\njob scheduling\nGPU allocation\nk8s jobs" }
inference: Inference Service { shape: rectangle; style.fill: "#c8e6c9"; label: "Inference Service\nmodel serving\nbatch + realtime\ntriton" }
feature: Feature Store { shape: rectangle; style.fill: "#c8e6c9"; label: "Feature Store\nFeast / Tecton\nonline + offline" }
}
data: Data Layer {
style.fill: "#fff3e0"
style.stroke: "#e65100"
postgres: PostgreSQL { shape: cylinder; style.fill: "#ffe0b2"; label: "PostgreSQL\nmetadata\nusers, runs" }
s3: Object Storage { shape: cylinder; style.fill: "#ffe0b2"; label: "S3 / GCS\nmodels, datasets\nartefacts" }
redis: Redis { shape: cylinder; style.fill: "#ffe0b2"; label: "Redis\nfeature cache\nsession store" }
kafka: Kafka { shape: queue; style.fill: "#ffe0b2"; label: "Kafka\nevent streaming\nprediction logs" }
}
monitoring: Observability {
style.fill: "#fce4ec"
style.stroke: "#c62828"
prometheus: Prometheus { shape: rectangle; style.fill: "#f8bbd0"; label: "Prometheus\nmetrics scraping" }
grafana: Grafana { shape: rectangle; style.fill: "#f8bbd0"; label: "Grafana\ndashboards\nalerting" }
drift: Drift Monitor { shape: rectangle; style.fill: "#f8bbd0"; label: "Drift Monitor\nEvidentlyAI\ndata + model drift" }
}
client.web -> gateway: "HTTPS"
client.mobile -> gateway: "HTTPS"
client.sdk -> gateway: "HTTPS"
gateway -> services.auth: "authenticate"
gateway -> services.experiment: "REST"
gateway -> services.training: "REST"
gateway -> services.inference: "REST / gRPC"
services.training -> data.s3: "save artefacts"
services.training -> data.postgres: "log runs"
services.training -> data.kafka: "publish events"
services.inference -> data.redis: "feature lookup"
services.inference -> data.kafka: "log predictions"
services.feature -> data.redis: "populate cache"
services.feature -> data.s3: "offline features"
data.kafka -> monitoring.drift: "stream"
services.inference -> monitoring.prometheus: "metrics"
services.training -> monitoring.prometheus: "metrics"
monitoring.prometheus -> monitoring.grafana ---
title: "ML Platform — Microservices Architecture"
---
flowchart TD
classDef client fill:#bbdefb,stroke:#1565c0,color:#0d2a6e,font-size:12px
classDef gateway fill:#fff9c4,stroke:#f9a825,color:#5d4037,font-size:12px
classDef service fill:#c8e6c9,stroke:#388e3c,color:#1b5e20,font-size:12px
classDef data fill:#ffe0b2,stroke:#e65100,color:#4e2a00,font-size:12px
classDef monitor fill:#f8bbd0,stroke:#c62828,color:#4a0000,font-size:12px
classDef clusterClient fill:#e3f2fd,stroke:#1565c0
classDef clusterSvc fill:#e8f5e9,stroke:#388e3c
classDef clusterData fill:#fff3e0,stroke:#e65100
classDef clusterMonitor fill:#fce4ec,stroke:#c62828
subgraph CLIENT["Client Layer"]
WEB["**Web App**<br/>React / Next.js"]:::client
MOB["**Mobile App**<br/>iOS / Android"]:::client
SDK["**Python SDK**<br/>api client"]:::client
end
GW["**API Gateway**<br/>auth · rate limiting<br/>routing · logging<br/>nginx / Kong"]:::gateway
subgraph SERVICES["Core Services"]
AUTH["**Auth Service**<br/>JWT / OAuth2<br/>user management"]:::service
EXP["**Experiment Service**<br/>hyperparameter search<br/>run tracking · MLflow"]:::service
TRAIN["**Training Service**<br/>job scheduling<br/>GPU allocation · k8s jobs"]:::service
INF["**Inference Service**<br/>model serving<br/>batch + realtime · triton"]:::service
FEAT["**Feature Store**<br/>Feast / Tecton<br/>online + offline"]:::service
end
subgraph DATA["Data Layer"]
PG[("**PostgreSQL**<br/>metadata<br/>users · runs")]:::data
S3[("**S3 / GCS**<br/>models · datasets<br/>artefacts")]:::data
REDIS[("**Redis**<br/>feature cache<br/>session store")]:::data
KAFKA[/"**Kafka**<br/>event streaming<br/>prediction logs"\]:::data
end
subgraph OBS["Observability"]
PROM["**Prometheus**<br/>metrics scraping"]:::monitor
GRAF["**Grafana**<br/>dashboards · alerting"]:::monitor
DRIFT["**Drift Monitor**<br/>EvidentlyAI<br/>data + model drift"]:::monitor
end
WEB -->|HTTPS| GW
MOB -->|HTTPS| GW
SDK -->|HTTPS| GW
GW -->|authenticate| AUTH
GW -->|REST| EXP
GW -->|REST| TRAIN
GW -->|REST / gRPC| INF
TRAIN -->|save artefacts| S3
TRAIN -->|log runs| PG
TRAIN -->|publish events| KAFKA
INF -->|feature lookup| REDIS
INF -->|log predictions| KAFKA
FEAT -->|populate cache| REDIS
FEAT -->|offline features| S3
KAFKA -->|stream| DRIFT
INF -->|metrics| PROM
TRAIN -->|metrics| PROM
PROM --> GRAF
class CLIENT clusterClient
class SERVICES clusterSvc
class DATA clusterData
class OBS clusterMonitor Mermaid won on this particular diagram for use in a blog post. The output is more compact and fits the page width better. D2’s ELK layout produces a taller diagram that requires more scrolling.
ML training loop sequence
Sequence diagrams with loop and alt blocks are Mermaid’s strongest feature. The D2 version of the same diagram annotates the loop semantics on individual messages instead, because those constructs do not exist in D2’s sequence syntax.
sequenceDiagram
participant U as User / Researcher
participant T as Trainer Process
participant DS as DataLoader
participant M as Model
participant O as Optimiser
participant V as Validator
participant R as Registry
U->>T: train(config, dataset)
T->>DS: build DataLoader(batch_size=32, shuffle=True)
DS-->>T: batched iterator
loop Every Epoch
T->>DS: next batch (X, y)
DS-->>T: X ∈ ℝ^(32×d), y ∈ ℝ^32
T->>M: forward(X)
M-->>T: ŷ = σ(Wx + b)
T->>T: loss = CrossEntropy(ŷ, y)
Note over T: L = −Σ yᵢ log(ŷᵢ)
T->>O: zero_grad()
T->>T: loss.backward()
Note over T,M: Compute ∂L/∂W via backprop
T->>O: step() [W ← W − η∇L]
O-->>T: updated weights
T->>T: clip_grad_norm_(max=1.0)
end
T->>V: evaluate(val_loader)
V->>M: forward(X_val) for all batches
M-->>V: predictions ŷ_val
V-->>T: val_loss, accuracy, F1
alt val_loss improved
T->>R: save_checkpoint(model, epoch, metrics)
R-->>T: checkpoint path
else no improvement for patience=5
T->>T: early_stop()
T-->>U: best model at epoch k
end
T->>R: register_model(name, version, metrics)
R-->>U: model URI Gradient descent flowchart
flowchart TD
A([Start]) --> B[Initialize weights W\nrandomly or zeros]
B --> C[Compute forward pass\nŷ = f(X, W)]
C --> D[Compute loss\nL = ½‖ŷ − y‖²]
D --> E[Compute gradients\n∂L/∂W via backprop]
E --> F{Gradient\ncheck OK?}
F -- No --> G[Debug: check\nnumerical gradient]
G --> C
F -- Yes --> H[Update weights\nW ← W − η·∂L/∂W]
H --> I{Scheduler\ntype?}
I -- StepLR --> J[η ← η · γ\nevery k steps]
I -- CosineAnnealing --> K[η ← ηₘᵢₙ + ½(ηₘₐₓ−ηₘᵢₙ)\n·(1+cos(πt/T))]
I -- ReduceOnPlateau --> L[Monitor val loss\nreduce if stagnant]
J --> M[Increment step\nt ← t + 1]
K --> M
L --> M
M --> N{Convergence\ncriteria?}
N -- ‖∇L‖ < ε --> O[Check val loss\nfor overfitting]
N -- max epochs --> O
N -- No --> C
O --> P{Val loss\nstill falling?}
P -- Yes --> C
P -- No\nearly stop --> Q([Return best W])
style A fill:#dbeafe,stroke:#1d4ed8
style Q fill:#dcfce7,stroke:#15803d
style G fill:#fef9c3,stroke:#ca8a04
style D fill:#fce7f3,stroke:#be185d
style E fill:#fce7f3,stroke:#be185d
style H fill:#ede9fe,stroke:#7c3aed Kalman filter
The Kalman predict-update cycle is a two-box diagram with input arrows and a feedback loop. Pikchr handles it cleanly with named box anchors and the then left until even with arrow routing syntax.
# Kalman Filter — Predict/Update Cycle
scale = 1.0
B1: box rad 0.12 wid 2.8 ht 1.6 fill 0xdbeafe \
"PREDICT" bold big \
"x' = F*x + B*u" small \
"P' = F*P*F' + Q" small \
at 1.4,2.5
B2: box rad 0.12 wid 2.8 ht 1.6 fill 0xdcfce7 \
"UPDATE" bold big \
"K = P'*H'*inv(H*P'*H'+R)" small \
"x = x' + K*(z - H*x')" small \
"P = (I - K*H)*P'" small \
at 5.0,2.5
arrow from B1.e to B2.w "x', P'" above small
arrow from B2.s down 0.5 then left until even with B1.s then to B1.s \
"updated x, P" below small
arrow from B1.n+(-.5,0) up 0.5 "u (control)" above small
arrow from B1.n+(.5,0) up 0.5 "Q (process noise)" above small
arrow from B2.n+(-.5,0) up 0.5 "z (measurement)" above small
arrow from B2.n+(.5,0) up 0.5 "R (meas. noise)" above small
box wid 5.0 ht 0.6 rad 0.08 fill 0xfef9c3 \
"F: transition H: observation K: Kalman gain Q,R: noise covariances" small \
at 3.2,0.35
text "Kalman Filter: Predict / Update Cycle" bold with .s at B1.nw+(1.4,0.5) Decision tree
# Decision Tree — Loan Approval (depth=3)
scale = 1.0
R: box rad 0.1 wid 1.9 ht 0.65 fill 0xdbeafe \
"Age <= 30?" bold "Gini=0.48 n=200" small \
at 3.0,5.2
L1: box rad 0.1 wid 1.8 ht 0.65 fill 0xe0f2fe \
"Income <= 50k?" bold "Gini=0.42 n=110" small \
at 1.3,3.8
R1: box rad 0.1 wid 1.7 ht 0.65 fill 0xe0f2fe \
"Credit > 700?" bold "Gini=0.30 n=90" small \
at 4.7,3.8
LL: box rad 0.1 wid 1.4 ht 0.6 fill 0xfce4ec \
"DENY" bold "n=68 p=0.82" small \
at 0.4,2.4
LR: box rad 0.1 wid 1.4 ht 0.6 fill 0xf0fdf4 \
"APPROVE" bold "n=42 p=0.76" small \
at 2.2,2.4
RL: box rad 0.1 wid 1.5 ht 0.6 fill 0xfef9c3 \
"Yrs Emp?" bold "Gini=0.22 n=55" small \
at 3.8,2.4
RR: box rad 0.1 wid 1.4 ht 0.6 fill 0xf0fdf4 \
"APPROVE" bold "n=35 p=0.91" small \
at 5.5,2.4
RLL: box rad 0.1 wid 1.35 ht 0.55 fill 0xfce4ec \
"DENY" bold "n=28 p=0.79" small \
at 3.1,1.0
RLR: box rad 0.1 wid 1.35 ht 0.55 fill 0xf0fdf4 \
"APPROVE" bold "n=27 p=0.85" small \
at 4.7,1.0
arrow from R.sw to L1.n
arrow from R.se to R1.n
arrow from L1.sw to LL.n
arrow from L1.se to LR.n
arrow from R1.sw to RL.n
arrow from R1.se to RR.n
arrow from RL.sw to RLL.n
arrow from RL.se to RLR.n
text "Yes" small with .e at 0.5 way between R.sw and L1.n
text "No" small with .w at 0.5 way between R.se and R1.n
text "Yes" small with .e at 0.5 way between L1.sw and LL.n
text "No" small with .w at 0.5 way between L1.se and LR.n
text "No" small with .e at 0.5 way between R1.sw and RL.n
text "Yes" small with .w at 0.5 way between R1.se and RR.n
text "<2y" small with .e at 0.5 way between RL.sw and RLL.n
text ">=2y" small with .w at 0.5 way between RL.se and RLR.n
text "Decision Tree: Loan Approval (depth=3)" bold with .s at R.n+(0,0.4) What held up and what did not
Claude generated syntactically correct code for all four tools without exception. The accuracy of the mathematical content (parameter counts, probability values, gradient expressions) was consistently correct and would have taken significant time to write by hand.
The failure modes were layout-related, not content-related. Pikchr requires knowing the exact coordinates you want before you start. Claude’s initial coordinate estimates for the SVM scatter plot put the hyperplane at the wrong angle and the support vectors too close to the margin. Fixing it meant specifying the endpoint coordinates explicitly and iterating.
For graph tools, Claude sometimes generated too many label edges on dense graphs, making the result unreadable. The solution was to label only representative edges, not every one.
The Pikchr backprop diagram does not exist in this set. Pikchr can draw it (circles at coordinates, arrows between them) but the result would be manually placing every neuron and every edge. For a fully-connected network that is around 150 arrows. Claude can generate that but it is not a good use of Pikchr. The Graphviz version is better for that specific diagram.