How Stable Diffusion Builds an Image

Every image Stable Diffusion generates starts the same way — as random noise. Pure static. Like a TV with no signal.

The model doesn't "draw" an image the way you or I would. It does the opposite. It starts with chaos and slowly removes noise until an image appears. Like a sculptor chipping away at marble, except the sculptor is a neural network and the marble is Gaussian noise.

This process is called denoising, and it happens over many small steps. Here's what it actually looks like:

Click through the steps. Watch what changes between each one.

Some things worth paying attention to: When does the image stop looking like noise and start looking like something? Is it gradual, or is there a moment where it clicks? And how different does the final image really look from a few steps before it?

Why does this work?

The core idea is surprisingly simple.

During training, the model learns to do one thing: given a noisy image, predict what noise was added. That's it. It doesn't learn to "generate images." It learns to "remove noise."

At generation time, we start from pure noise and ask the model: "what noise do you see here?" It tells us, we subtract that noise, and the image gets slightly cleaner. Repeat 50 times and you have an image.

The prompt guides this process. At each step, the model's noise prediction is influenced by your text — nudging the denoising toward an image that matches your description. Without a prompt, you'd still get an image. Just not one you asked for.

The latent space trick

One detail worth knowing: the model doesn't actually work with pixels directly.

A 512×512 image has ~786,000 values. That's expensive to denoise. So Stable Diffusion first compresses the image into a smaller "latent" representation (64×64) using an encoder. All 50 denoising steps happen in this compressed space. Only at the very end does it decode back to full-resolution pixels.

This is why it's called Latent Diffusion. Same idea, cheaper to run.

The visualization above uses real outputs from Stable Diffusion v1.5. Each frame is the decoded latent at that denoising step — what the model "sees" at that point in the process.