Sequential Recommendations and Transformers
All the methods covered so far treat the user-item interaction history as a set where order doesn't matter. But user behavior often has strong sequential structure. A user streaming a TV show is almost certainly going to want the next episode. A customer who just bought running shoes might want performance socks. A user who just listened to three jazz albums is probably in a jazz mood right now.

Sequential recommendation frames the problem as: given a user's ordered interaction history [i1, i2, ..., it], predict the next item it+1.
GNNs for Sequential Recommendations
Standard GNNs treat the interaction graph as static. But in practice, the order of interactions matters: a user who watched a sci-fi trilogy is probably on a sci-fi binge, not looking for rom-coms. Sequential GNNs augment the bipartite graph with temporal edges between consecutively-interacted items, allowing the model to capture both collaborative signals (who else liked this?) and sequential patterns (what do people watch next after this?).

Transformers for Sequential Recommendations
Transformers have proven extremely effective for sequential recommendation. An interaction history is a sequence, items are tokens, and predicting the next item is analogous to next-token prediction in language modeling.

Key transformer components in this context:
- Item embeddings: Each item ID is embedded into a dense vector, just like word embeddings in NLP
- Positional encoding: Since transformers are permutation-invariant, positional encodings are added to preserve sequence order (critical for recommendations)
- Self-attention: Allows the model to weigh the importance of each past interaction when predicting the next item, capturing long-range dependencies (e.g., a user's genre preference from 20 sessions ago)
- Causal masking: Prevents the model from "looking ahead" during training, ensuring it only uses past interactions to predict future ones

Transformers4Rec
Transformers4Rec is an open-source library built on top of HuggingFace Transformers that adapts the transformer architecture specifically for recommendation tasks. It adds recommendation-specific components including input feature normalization, multi-hot encoding for categorical features, and specialized prediction heads. It supports both session-based and long-term sequential recommendation.
Loss Functions for Recommendation
The choice of loss function matters enormously for recommendation quality. Three are most common:
- Mean Squared Error (MSE): Used for explicit feedback (ratings). Penalizes large prediction errors. Straightforward but doesn't directly optimize ranking quality.
- Binary Cross-Entropy (BCE): Used for implicit feedback (clicked/not clicked). Models the probability of interaction as a binary classification problem. Requires negative sampling: designating some unobserved interactions as negatives.
- Bayesian Personalized Ranking (BPR): A pairwise ranking loss. Instead of predicting absolute scores, BPR trains the model to rank a positive item above a negative item. Training data consists of triplets (user u, positive item i, negative item j), and the loss penalizes the model when the positive item's score doesn't exceed the negative item's score by a sufficient margin. BPR directly optimizes the ranking objective, making it excellent for top-K recommendation.
BPR training uses triplets (u, i, j) where i is a positive item and j is a negative item. Which best describes what BPR is optimizing?