Recommendation Systems/Ch. 2 — Neural Network Based Recommenders/4 of 6

Autoencoders for Collaborative Filtering

An autoencoder is a neural network trained to compress data into a lower-dimensional latent representation and then reconstruct it. The architecture is symmetric: an encoder maps input → latent code z, and a decoder maps z → reconstructed output. The training objective is to minimize the reconstruction error between input and output.

Autoencoders can model non-linear structure in the data because each layer applies a non-linear activation function. This makes them more expressive dimensionality reduction tools for complex, high-dimensional user interaction vectors.

Autoencoder Architecture — Hover or click any layer to explore

EncoderBottleneckDecoder

Width represents vector dimensionality — compression then expansion

Hover or click a layer

See how each layer transforms the data as it flows through the autoencoder.

Input / Output (N dims)Encoder / Decoder layersLatent code z (bottleneck)

Step through the autoencoder pipeline: encode a user's sparse interaction vector → compress to latent code z → decode to reconstructed predictions. See which items the decoder fills in.

Autoencoders for Recommendations: The Pipeline

Here is how an autoencoder becomes a recommender:

User as input vector. Each user is represented as a sparse vector of length N (number of items), where entry i contains their rating of item i (0 if no interaction).
Encoding. The encoder is a stack of fully connected layers that compresses this N-dimensional sparse vector into a dense latent vector z.
Decoding. The decoder expands z back to size N, producing predicted interaction values for all items, including those the user hasn't yet interacted with.
Masked loss. Training uses a masked loss that only computes reconstruction error on observed (non-zero) interactions. We don't want the model to be penalized for failing to reconstruct the zeros (which are unknowns, not explicit negatives).
Recommendation. After training, we pass a user's interaction vector through the full autoencoder. The output values for items the user hasn't interacted with are their predicted preference scores. The top-K items by predicted score become recommendations.

ℹ

Loss Function: Masked Reconstruction Error

The autoencoder is trained to minimize the reconstruction error between the input vector x and its reconstruction x̂. This runs only over Ω, the set of observed (non-zero) interactions. This masking is critical: including the unobserved zeros would mean the loss is dominated by entries we know nothing about, and the model would collapse to predicting zero everywhere.

✦

Denoising and Variational Autoencoders

Denoising autoencoders (DAE) randomly mask some known interactions during training, forcing the model to infer them from context: this acts as regularization and more closely simulates the recommendation task (we always have incomplete information). Variational autoencoders (VAE) constrain the latent space to follow a probability distribution, enabling the model to sample diverse recommendations and better handle uncertainty in sparse data. Mult-VAE, a VAE-based recommender, has shown strong results on standard benchmarks.

Checkpoint•Multiple Choice

Why does an autoencoder for collaborative filtering use a masked loss rather than computing MSE over all entries of the user vector including zeros?

Because zeros are known negative preferences and should be reconstructed as 0Because computing loss over zeros would dominate training: the matrix is >99% zero, so the model would learn to predict zero for everythingBecause zeros do not have gradients and cannot be backpropagatedTo reduce computational cost: fewer non-zero entries means faster training

←

Neural Collaborative Filtering

Graph Neural Networks

→