09. Image editing and control — steering the painter without repainting everything¶

15 minutes. Learn how masks, maps, and references push generation where you want.

Built on the ELI5 in 00-eli5.md. The the eye — vision encoder that sees raw pixels and outputs numbers — matters because control signals like depth, pose, and masks must be read before they can steer generation.

1) Why free generation is not enough¶

A plain text-to-image model is powerful. But it is also stubborn. You ask for “a dancer on a stage.” The model may give a dancer. But maybe the arm pose is wrong. Maybe the camera angle is wrong. Maybe the background region you liked gets destroyed. So what do people need in real work? Control. Not just creativity. Think like a designer. Sometimes you want freedom. Sometimes you want obedience. A mask says, “Touch only this region.” A depth map says, “Keep far and near structure like this.” A pose skeleton says, “Place joints like this.” A reference image says, “Borrow this identity or style.” So the core question changes. Not “Can the model generate?” But “Can the model generate while respecting constraints?” That is where steering tools enter. They push the model while keeping the same underlying painter. The painter still works on the canvas. But now extra guides are placed around that canvas. Simple, no?

2) ControlNet: freeze the base, add a trainable guide¶

ControlNet is a very practical idea. Do not retrain the whole base model from scratch. Keep the base diffusion model frozen. Then add a trainable copy branch for control signals. That control branch reads extra inputs. Edge maps. Depth maps. Pose skeletons. Segmentation maps. It produces residual features. Those residuals get injected into the frozen base denoiser. So the base keeps its image prior, while the new branch teaches obedience to structure. Look at the architecture. ┌───────────────────────────── Control input ─────────────────────────────┐ │ edge map / depth map / pose skeleton / segmentation │ └───────────────────────────────┬─────────────────────────────────────────┘ │ ▼ ┌──────────────────────┐ │ trainable copy path │ │ learns control cues │ └──────────┬───────────┘ │ residual features ▼ latent noise ──→ ┌──────────────────────────────────────────────────────┐ │ frozen base U-Net / denoiser │ text embeddings ─┤ cross-attention with prompt │ │ + added residuals from control branch │ └──────────────────────┬───────────────────────────────┘ │ ▼ predicted noise update Why freeze the base? Because the base already knows how to paint. You do not want to destroy that knowledge. Why add a trainable copy? Because edge and depth conditioning need new behavior. The base was not born knowing those extra rules. So what does the control branch really do? It converts structured hints into feature nudges. Those nudges reshape denoising at each level. This is where the eye matters again. The system must read a depth map or pose map as meaningful structure, not as random pixels. ControlNet is strong because it is modular. One base model. Many control types. That is good engineering.

3) Inpainting and image-to-image: partial change, not full restart¶

Now two very common editing modes. Inpainting. And image-to-image. Inpainting means only part of the image should change. You provide a mask. White region means regenerate here. Black region means keep this untouched. So if a user likes the room, but wants a new sofa, inpainting is perfect. The room stays. The sofa area changes. The system typically encodes the existing image into latent space. Then it adds noise mainly where editing is allowed. During denoising, it protects the unmasked regions. That protection is the whole product value. Otherwise every edit would wreck the full picture. Image-to-image is a cousin of this idea. Instead of starting from pure random noise, you start from an existing image, encode it, and add some noise. The amount of added noise matters a lot. Low noise keeps the source close. High noise gives more freedom. So there is a slider hidden in the workflow. Preserve structure versus allow creativity. Here is the intuition. ┌──────────────┐ encode ┌──────────────┐ │ source image │ ───────────────→ │ latent z │ └──────┬───────┘ └──────┬───────┘ │ │ add noise │ mask for edit ▼ │ ┌──────────────┐ └────────────────────────→ │ noisy latent │ └──────┬───────┘ │ denoise with prompt ▼ ┌──────────────┐ │ edited image │ └──────────────┘ See. Inpainting is local control. Image-to-image is global-but-guided control. Both reduce randomness. Both respect existing visual context. And yes, they still operate on the canvas, not directly by painting final pixels at every step.

4) IP-Adapter: steer with another image¶

Sometimes text is not enough. A user says, “Make it in this style.” Or, “Keep this person’s face.” Now a reference image becomes useful. That is where IP-Adapter enters. IP-Adapter injects image embeddings into generation. So the model gets text conditioning, and image-reference conditioning together. One prompt may describe content. The reference image may provide style, identity, composition mood, or product design language. This is powerful in brand work. It is powerful in character consistency too. Why not fine-tune every time? Because that is slow and expensive. A reference adapter is lighter. It can steer generation without retraining the full model. Again, the eye has a job here. The system must read the reference image into embeddings first. Those embeddings then influence denoising. A rough flow looks like this. reference image ──→ image encoder ──→ image embeddings ─┐ │ text prompt ──────→ text encoder ──→ text embeddings ──┼──→ denoiser │ latent noise / source latent ────────────────────────────┘ So what is the gain? Text gives semantics. Image embeddings give visual anchors. Together they reduce ambiguity.

5) Worked example: how a depth map constrains denoising¶

Now let us make this concrete. Prompt: “a cyclist riding toward the camera on a road.” Control input: a depth map. Suppose the depth map is simplified into three bands. Far background = 0.2 Mid cyclist body = 0.5 Near road foreground = 0.9 These numbers are not the image itself. They are spatial structure hints. They say what should feel far, mid, and near. Now imagine one denoising block predicts base noise for three regions. Base prediction from frozen model: background = 0.60 cyclist = 0.20 foreground = -0.10 The ControlNet branch reads the depth map, and outputs residual corrections. Residual from depth control: background = -0.30 cyclist = 0.10 foreground = 0.40 Suppose control weight = 0.8. Now multiply each residual by 0.8. background: 0.8 × (-0.30) = -0.24 cyclist: 0.8 × 0.10 = 0.08 foreground: 0.8 × 0.40 = 0.32 Now add these to the base predictions. background: 0.60 + (-0.24) = 0.36 cyclist: 0.20 + 0.08 = 0.28 foreground: -0.10 + 0.32 = 0.22 So the final guided predictions become: [0.36, 0.28, 0.22] What changed conceptually? The background got reduced more strongly. The cyclist got a moderate correction. The foreground got pushed upward a lot. That means the model is being told, “Keep the road region behaving like something close.” “Keep the background behaving like something far.” Now watch the same idea across denoising time. Step 20 of 20: Only coarse mass matters. The depth map locks rough front-versus-back layout. Step 12 of 20: The cyclist silhouette aligns better with road perspective. Wheels stop floating into the sky. Step 4 of 20: Texture appears, but the near-far ordering stays stable. The pose may vary, yet perspective remains believable. So the depth map is not drawing the final image. It is constraining the search path on the canvas. That is the key insight. The control signal acts at every useful stage. Not once. Again and again. Yes? Look. Control is not replacing the painter. It is giving the painter rails. That is why edits stay believable.

Where this lives in the wild¶

Adobe Photoshop Generative Fill uses mask-based inpainting so selected regions change while the rest of the scene stays intact.
AUTOMATIC1111 with the ControlNet extension lets creators drive Stable Diffusion with canny edges, depth maps, and OpenPose skeletons.
ComfyUI workflows use IP-Adapter nodes to keep face identity or visual style from reference images during generation.
Leonardo AI offers pose, depth, and image-to-image controls for creators who need stronger structural steering than plain prompting.
OpenAI image editing tools and APIs support masked edits, where the user preserves most of an image and regenerates only the chosen area.

Pause and recall¶

Why is ControlNet usually built as a frozen base plus a trainable side branch?
What is the core difference between inpainting and image-to-image?
Why does a depth map help perspective consistency?
Why is IP-Adapter useful when text alone is too vague?

Interview Q&A¶

Q1. Why use ControlNet and not fine-tune the whole base model for every new control signal?¶

Because the base model already contains broad image knowledge. Freezing it preserves that prior, while the side branch learns only the new control behavior. This is cheaper, safer, and more modular. Common wrong answer to avoid: “Freezing is only about saving GPU memory.”

Q2. Why use inpainting and not simply regenerate the whole image with a better prompt?¶

Because users often want local edits with global consistency preserved. Regenerating the whole image risks changing lighting, identity, composition, and background details they already approved. Common wrong answer to avoid: “A stronger prompt can replace masking in all cases.”

Q3. Why use image-to-image instead of pure text-to-image when a source draft already exists?¶

Because the source image gives a structural prior. That prior reduces search space and keeps layout closer to user intent. Text alone would force the model to rediscover the whole composition. Common wrong answer to avoid: “Image-to-image is just text-to-image with extra decoration.”

Q4. Why use IP-Adapter and not only text prompts for style or identity matching?¶

Because many visual properties are hard to describe precisely in words. Reference-image embeddings carry richer style and identity information than prompt language alone. That gives much tighter control. Common wrong answer to avoid: “If the prompt is detailed enough, image references add no value.”

Apply now (5 min)¶

Quick exercise. Take one source image in your mind. Now choose three edits. Replace the sky. Keep the person pose. Match a reference poster style. Write which tool you would pick for each, and why. Now sketch from memory. Draw the ControlNet diagram. Show a frozen base denoiser. Show a trainable control branch. Show residuals flowing into the base. Then add one mask box beside it. If you can redraw that, you now understand steering, not just generation.

Bridge. Still images are conquered. But video adds a brutal new dimension — time. Every frame must agree with its neighbors. → 10-video-tokenization-temporal.md