11. ControlNet image-to-image — extra rails for edges, depth, and pose¶
~12 min read. The thing that stops prompt-only generation from drifting away from the structure we need.
Built on the ELI5 in 00-eli5.md. The blueprint — the text instruction — is useful, but here we add another guide like edges or pose so the denoiser has stronger structural rails to follow.
1) Why prompt-only generation drifts¶
A prompt can say,
"make a person dancing on a stage."
Fine.
But it does not pin exact limb angles, edge placement, or room geometry.
So prompt-only generation drifts.
You ask for the same scene twice.
You get two different poses.
That is often good for creativity.
It is bad when the user needs structure.
┌──────────── prompt only ────────────┐
text ──→ denoiser ──→ plausible but drifting layout
└─────────────────────────────────────┘
┌────────── prompt + control ─────────┐
text + edge/depth/pose ──→ denoiser ──→ layout on rails
└─────────────────────────────────────┘
This is why ControlNet exists.
The prompt still gives semantics.
The control map gives spatial discipline.
OpenPose skeletons, canny edges, depth maps, and scribbles are common examples.
A shoe catalogue tool cannot let the lace outline drift.
A comics pipeline cannot let the hero's pose mutate on every retry.
That is why structural conditioning moved from research toy to product necessity.
Image-to-image workflows pair naturally with this because there is already a visual starting point.
2) A residual example with all intermediates¶
ControlNet usually adds residual feature corrections into a pretrained U-Net.
Use a toy two-element feature vector.
Base feature from the main U-Net: [1.0, 0.5].
Control residual from the control branch: [0.4, -0.2].
Control scale: 0.5.
First scale the residual:
Then add it to the base feature:
Good.
The combined feature is [1.2, 0.4].
That is the recall answer.
Notice the control branch did not redraw the whole network.
It nudged an existing feature stream in a structured direction.
3) What the control branches are really doing¶
Do not think of ControlNet as a second painter.
Think of it as a rail system.
The pretrained denoiser already knows texture, lighting, and natural image statistics.
The control branch adds spatial hints at matching resolutions.
So the main model keeps its artistic prior.
The control branch says,
"Please do it with these edges, this pose, or this depth layout."
pretrained U-Net prior ──→ realism and texture
control residuals ──→ geometry and structure
combined result ──→ guided denoising with both
That is why residual injection works so well.
We reuse the strong base model.
We add structure without throwing away the prior.
4) How image-to-image strength and control strength interact¶
There are two knobs now.
Image-to-image denoise strength decides how far we move away from the source image.
Control strength decides how hard we obey the control map.
Low denoise strength plus high control strength means,
"stay near the source and obey structure tightly."
High denoise strength plus low control strength means,
"change a lot, but keep only loose rails."
denoise 0.2 + control 1.5 ──→ careful edit, strong structure preservation
denoise 0.8 + control 0.5 ──→ bigger rewrite, looser structure retention
This is why Photoshop-style edits, pose transfer, and product mockups need both knobs tuned together.
One knob alone cannot explain the result.
Structure and freedom are bargaining with each other.
Where this lives in the wild¶
-
AUTOMATIC1111 ControlNet extension — exposes canny, depth, normal, and pose control as everyday user tools.
-
ComfyUI pose-transfer workflows — artists feed skeletons or depth maps to lock structure while changing style.
-
InvokeAI control layers — product pipelines use explicit control maps to reduce prompt-only drift.
-
Leonardo AI structure references — users guide composition with external images rather than text alone.
-
Stability AI image-to-image APIs — practical generation often mixes prompt conditioning with edge or depth control.
Pause and recall¶
-
Why does prompt-only text-to-image drift on detailed structure?
-
In the toy example, what combined feature did we get after adding the scaled control residual?
-
What kinds of control maps are common in production?
-
Why must noise strength and control strength be tuned together?
Interview Q&A¶
Q: Why is ControlNet useful when CFG already exists? A: Because CFG strengthens prompt adherence, while ControlNet injects explicit spatial structure such as edges, depth, or pose. Common wrong answer to avoid: "CFG already provides exact geometry, so ControlNet is redundant."
Q: Why does a residual control branch work well? A: Because it lets the model keep the strong pretrained denoiser while adding structured guidance through additive feature corrections. Common wrong answer to avoid: "ControlNet replaces the whole U-Net with a brand-new model."
Q: Why can too much control make images worse? A: Because the network may obey the control map so rigidly that natural variation and texture quality suffer. Common wrong answer to avoid: "More control is always strictly better."
Q: Why does image-to-image pair naturally with ControlNet? A: Because both already start from an existing visual state, so structural constraints can guide edits rather than invent everything from scratch. Common wrong answer to avoid: "ControlNet only matters for pure text-to-image, not image-to-image."
Apply now (5 min)¶
Quick exercise. Take any two-element base feature and any two-element control residual, then apply a control scale and add them.
Now imagine what happens if the control scale becomes 0, 0.5, or 2.0.
Sketch from memory the flow prompt + noisy latent + control map → ControlNet + U-Net → guided image.
Under the sketch, write one line on how the blueprint becomes more precise when a structural rail is added.
Bridge. Good. We can control structure now. But many-step denoising still costs time. The next file studies a stronger shortcut: distilling diffusion into a near one-step generator. → 12-consistency-models-distillation.md