Limitations and Opportunities

Even with LSTMs and GRUs, practical RNN training has serious problems:

  • They can't be parallelized. The hidden state at time t depends on time t-1. You compute sequentially.
  • They're slow. A direct consequence of sequential computation.
  • Long-range dependencies are still hard, even for LSTMs and GRUs. Just less hard than for vanilla RNNs.
  • Hyperparameter tuning is painful. Learning rate, batch size, hidden size, dropout, gradient clipping threshold, sequence length: they're all interlinked, and small changes blow up training.

Real World: When RNNs Are Still the Right Call

RNNs and LSTMs are still production architectures for time series forecasting (energy demand, financial prices), wearable signal processing, and streaming/online inference where you don't have the full sequence in memory and can't afford a transformer's quadratic memory cost. For batch NLP work, transformers have eaten their lunch.

Checkpoint

A model needs to predict the next word in a streaming audio transcript in real-time, processing tokens as they arrive without buffering the full sequence. Which architecture is best suited for this constraint?