Variational Autoencoders

The VAE Idea

A Variational Autoencoder solves the latent-space regularity problem by changing what the encoder outputs. Instead of mapping an input x to a single point z in latent space, the encoder maps x to a distribution over latent space — a mean and a variance (μ and σ\mu \text{ and } \sigma). To get z, we sample from that distribution. Then we decode z as usual.

Because the encoder outputs a distribution, and because we sample from that distribution, the model is forced to learn latent representations where nearby points decode to similar outputs — otherwise reconstruction error would be terrible. The latent space becomes smooth and well-organized as a result.

Architecture Comparison — click any stage to explore
Simple AE
encoding
decoding

The encoder maps x to a single fixed point z. No randomness.

Click any stage

Explore what each step does and how it differs between simple autoencoders and VAEs.

Input / OutputEncoder / DecoderLatent space

Compare the forward pass of a simple autoencoder vs. a VAE. Toggle between the two models and click any stage to see what it does.

KL Divergence: The Regularizer

To make this really work, we add one more term to the loss: a regularizer that pushes the learned latent distribution toward a simple prior — usually a standard normal distribution. We measure how far the learned distribution is from the prior using Kullback-Leibler (KL) divergence, also called relative entropy.

For discrete distributions P and Q, the KL divergence is:

KL(PQ)=xP(x)logP(x)Q(x)\text{KL}(P \| Q) = \sum_x P(x) \cdot \log\frac{P(x)}{Q(x)}

It's not symmetric — KL(P ‖ Q) ≠ KL(Q ‖ P) — and it equals zero exactly when the distributions are identical. By penalizing the learned latent distribution for deviating from a standard normal, we shape the entire latent space to be simple, continuous, and complete. We can now confidently sample a random vector from a standard normal at inference time, hand it to the decoder, and get a meaningful output.

The Two-Term Loss

The full VAE training loss:

L=E[logp(xz)]reconstructionKL(q(zx)p(z))regularization\mathcal{L} = \underbrace{\mathbb{E}[\log p(x | z)]}_{\text{reconstruction}} - \underbrace{\text{KL}(q(z|x) \| p(z))}_{\text{regularization}}

The reconstruction term keeps outputs faithful to inputs. The regularization term keeps the latent space well-behaved. We can tune the balance between them with a coefficient.

The full training procedure:

  1. The encoder takes input x and outputs the parameters of a latent distribution (μ and σ).
  2. We sample a point z from that distribution using the reparameterization trick.
  3. The decoder reconstructs x' from z.
  4. We compute reconstruction error between x and x'.
  5. We compute KL divergence between the learned distribution and the prior.
  6. We backpropagate the sum of those two losses through the entire network.
VAE Latent Space — hover or click to decode

2D latent space — clusters colored by digit class

0123456789z₁ (latent dimension 1)z₂

Move cursor over the space · click to lock a point

Move cursor into the map

The decoder produces a digit for any point in the continuous latent space.

Explore the VAE latent space trained on handwritten digits. Drag a point through the 2D latent space and watch the decoder produce outputs in real time — notice how the space is smooth and continuous.

Real-world VAE applications

VAEs aren't always the right tool for the prettiest pictures — GANs and diffusion models usually beat them on raw image quality. But VAEs shine where you care about the structure of the latent space itself:

  • Anomaly detection. Because VAEs learn a probability distribution over the training data, you can ask: "How likely is this new input under the learned distribution?" If the answer is "extremely unlikely," it might be an anomaly. Used in fraud detection, manufacturing defect detection, and medical imaging.
  • Compression. A VAE compresses inputs to a small latent representation and can reconstruct them — a learned compression algorithm.
  • Controlled generation. The well-structured latent space lets you find directions that correspond to interpretable attributes — "smiling-ness," "hair length," "age" — and move along those directions to edit outputs. This underlies many "AI photo editor" features.
Checkpoint

In a VAE, what happens if you set the KL divergence coefficient to zero (effectively removing the regularization term from the loss)?