Handling Variable-Length Sequences

Natural text comes in every possible length. Your model expects fixed-size inputs. Two fixes:

Padding. Add zeros (or <PAD> tokens) to the end of shorter sequences until every sequence in the batch is the same length. Then provide a padding mask so the model knows to ignore those positions. This mask can be used in the loss calculation, in RNN hidden states, or in attention scores. Without a mask, your model will attend to padding tokens as if they were meaningful, and the loss signal will be polluted.

Packed Sequences (PyTorch). An optimization for RNNs that only processes actual tokens, not padding. At each time step, the batch size shrinks as shorter sequences finish. Same correctness, less wasted compute.

Padding vs. Packing14 real tokens · 20 slots if padded · 30% waste

Seq A

Thecatsat[PAD][PAD]

3/5 real

Seq B

Ilovenaturallanguageprocessing

5/5 real

Seq C

Helloworld[PAD][PAD][PAD]

2/5 real

Seq D

Deeplearningisfun[PAD]

4/5 real

Real token — processed

[PAD] token — present in batch, no mask yet

Without a mask, the model sees [PAD] tokens as real input. Enable the mask to fix this.

Padding

Simple. Works with any architecture. Requires a padding mask to avoid corrupting loss and attention.

Packed Sequences

No wasted compute. RNN-only (not for Transformers). PyTorch provides pack_padded_sequence / pad_packed_sequence.

Toggle between padding and packing to see how the same four sentences are handled differently. Use the mask toggle on the padding view to see how [PAD] tokens get excluded from computation.

←PreviousTransformers in PracticeAttention and Transformers Next→Text Data AugmentationNLP Implementation