Autoencoders for Collaborative Filtering
An autoencoder is a neural network trained to compress data into a lower-dimensional latent representation and then reconstruct it. The architecture is symmetric: an encoder maps input → latent code z, and a decoder maps z → reconstructed output. The training objective is to minimize the reconstruction error between input and output.
Autoencoders can model non-linear structure in the data because each layer applies a non-linear activation function. This makes them more expressive dimensionality reduction tools for complex, high-dimensional user interaction vectors.
Width represents vector dimensionality — compression then expansion
Hover or click a layer
See how each layer transforms the data as it flows through the autoencoder.
Step through the autoencoder pipeline: encode a user's sparse interaction vector → compress to latent code z → decode to reconstructed predictions. See which items the decoder fills in.
Autoencoders for Recommendations: The Pipeline
Here is how an autoencoder becomes a recommender:
- User as input vector. Each user is represented as a sparse vector of length N (number of items), where entry i contains their rating of item i (0 if no interaction).
- Encoding. The encoder is a stack of fully connected layers that compresses this N-dimensional sparse vector into a dense latent vector z.
- Decoding. The decoder expands z back to size N, producing predicted interaction values for all items, including those the user hasn't yet interacted with.
- Masked loss. Training uses a masked loss that only computes reconstruction error on observed (non-zero) interactions. We don't want the model to be penalized for failing to reconstruct the zeros (which are unknowns, not explicit negatives).
- Recommendation. After training, we pass a user's interaction vector through the full autoencoder. The output values for items the user hasn't interacted with are their predicted preference scores. The top-K items by predicted score become recommendations.
Loss Function: Masked Reconstruction Error
The autoencoder is trained to minimize the reconstruction error between the input vector x and its reconstruction x̂. This runs only over Ω, the set of observed (non-zero) interactions. This masking is critical: including the unobserved zeros would mean the loss is dominated by entries we know nothing about, and the model would collapse to predicting zero everywhere.
Denoising and Variational Autoencoders
Denoising autoencoders (DAE) randomly mask some known interactions during training, forcing the model to infer them from context: this acts as regularization and more closely simulates the recommendation task (we always have incomplete information). Variational autoencoders (VAE) constrain the latent space to follow a probability distribution, enabling the model to sample diverse recommendations and better handle uncertainty in sparse data. Mult-VAE, a VAE-based recommender, has shown strong results on standard benchmarks.
Why does an autoencoder for collaborative filtering use a masked loss rather than computing MSE over all entries of the user vector including zeros?