Recommendation Systems/Ch. 2 — Neural Network Based Recommenders/6 of 6

Sequential Recommendations and Transformers

All the methods covered so far treat the user-item interaction history as a set where order doesn't matter. But user behavior often has strong sequential structure. A user streaming a TV show is almost certainly going to want the next episode. A customer who just bought running shoes might want performance socks. A user who just listened to three jazz albums is probably in a jazz mood right now.

Hammer and nails — If a hammer is purchased, what is the most likely next item to be purchased?

Sequential recommendation frames the problem as: given a user's ordered interaction history [i₁, i₂, ..., i_t], predict the next item i_t+1.

GNNs for Sequential Recommendations

Standard GNNs treat the interaction graph as static. But in practice, the order of interactions matters: a user who watched a sci-fi trilogy is probably on a sci-fi binge, not looking for rom-coms. Sequential GNNs augment the bipartite graph with temporal edges between consecutively-interacted items, allowing the model to capture both collaborative signals (who else liked this?) and sequential patterns (what do people watch next after this?).

GNN architecture for sequential recommendations — GNNs can be used for sequential recommendations (Wu, et al.)

Transformers for Sequential Recommendations

Transformers have proven extremely effective for sequential recommendation. An interaction history is a sequence, items are tokens, and predicting the next item is analogous to next-token prediction in language modeling.

Transformers4Rec system diagram — Transformers4Rec (Weights&Biases)

Key transformer components in this context:

Item embeddings: Each item ID is embedded into a dense vector, just like word embeddings in NLP
Positional encoding: Since transformers are permutation-invariant, positional encodings are added to preserve sequence order (critical for recommendations)
Self-attention: Allows the model to weigh the importance of each past interaction when predicting the next item, capturing long-range dependencies (e.g., a user's genre preference from 20 sessions ago)
Causal masking: Prevents the model from "looking ahead" during training, ensuring it only uses past interactions to predict future ones

Transformers4Rec

Transformers4Rec is an open-source library built on top of HuggingFace Transformers that adapts the transformer architecture specifically for recommendation tasks. It adds recommendation-specific components including input feature normalization, multi-hot encoding for categorical features, and specialized prediction heads. It supports both session-based and long-term sequential recommendation.

Loss Functions for Recommendation

The choice of loss function matters enormously for recommendation quality. Three are most common:

Mean Squared Error (MSE): Used for explicit feedback (ratings). Penalizes large prediction errors. Straightforward but doesn't directly optimize ranking quality.
Binary Cross-Entropy (BCE): Used for implicit feedback (clicked/not clicked). Models the probability of interaction as a binary classification problem. Requires negative sampling: designating some unobserved interactions as negatives.
Bayesian Personalized Ranking (BPR): A pairwise ranking loss. Instead of predicting absolute scores, BPR trains the model to rank a positive item above a negative item. Training data consists of triplets (user u, positive item i, negative item j), and the loss penalizes the model when the positive item's score doesn't exceed the negative item's score by a sufficient margin. BPR directly optimizes the ranking objective, making it excellent for top-K recommendation.

Checkpoint•Multiple Choice

BPR training uses triplets (u, i, j) where i is a positive item and j is a negative item. Which best describes what BPR is optimizing?

The absolute predicted score for item iThe probability that the model ranks item i above item j for user uThe reconstruction error of the full rating matrixThe classification accuracy of interaction prediction

←

Graph Neural Networks

Ch. 3 — Recommenders in Practice

Data Challenges

→