Beyond Language — Multimodal Models

The deepest insight of the last few years of NLP is that the transformer isn't really a language architecture. It's a sequence-of-tokens architecture. If you can chop your data into tokens, you can feed it to a transformer.

Examples already in production:

  • Images → ViT (tokens are 16×16 pixel patches)
  • Audio / spectrograms → AST (tokens are time-frequency patches)
  • Time series → Timer, Informer (tokens are time windows)
  • Video → CogVideoX (tokens are spatial-temporal patches)

Real World: Multimodal Systems Are Everywhere

The product analytics tool that ingests screenshots and writes summaries. The retail platform that lets you upload a photo and find similar items. The medical imaging system that reads scans and writes draft radiology reports. The customer support bot that ingests screenshots from frustrated users. All of this is transformer-based, much of it CLIP-style.