Limitations and Opportunities
Even with LSTMs and GRUs, practical RNN training has serious problems:
- They can't be parallelized. The hidden state at time t depends on time t-1. You compute sequentially.
- They're slow. A direct consequence of sequential computation.
- Long-range dependencies are still hard, even for LSTMs and GRUs. Just less hard than for vanilla RNNs.
- Hyperparameter tuning is painful. Learning rate, batch size, hidden size, dropout, gradient clipping threshold, sequence length: they're all interlinked, and small changes blow up training.
◆
Real World: When RNNs Are Still the Right Call
RNNs and LSTMs are still production architectures for time series forecasting (energy demand, financial prices), wearable signal processing, and streaming/online inference where you don't have the full sequence in memory and can't afford a transformer's quadratic memory cost. For batch NLP work, transformers have eaten their lunch.
Checkpoint
A model needs to predict the next word in a streaming audio transcript in real-time, processing tokens as they arrive without buffering the full sequence. Which architecture is best suited for this constraint?