What Is a Large Language Model?

Checkpoint

What is a Large Language Model? Write your own definition before reading on.

The honest answer is that the definition is moving. GPT-1 (2018) is usually considered the first "large" language model. It had 117 million parameters. By today's standards, that's barely a small language model.

GPT-2 (2019): when OpenAI released it, they were so worried about misuse that they delayed the full model and released a smaller version with a technical paper instead. Reading that press release today is surreal. GPT-2 by modern standards is not capable of fooling anyone. Yet at the time the discourse was about responsible disclosure of dangerous AI. That tells you something about how fast the field has moved — and about how perceptions of "dangerous capability" shift relative to the frontier.

  1. 2018

    GPT-1 — 117M parameters

    The first 'large' language model by the field's standards at the time. Demonstrated that unsupervised pretraining followed by fine-tuning could transfer across NLP tasks.

  2. 2019

    GPT-2 — 1.5B parameters

    Caused a responsible-disclosure controversy despite producing outputs that look obviously AI-generated today. Demonstrates how fast the definition of 'dangerous capability' moves relative to the frontier.

  3. 2020

    GPT-3 — 175B parameters

    Few-shot prompting. The model could perform tasks from a handful of examples in the prompt with no fine-tuning. A conceptual shift: the same model, different prompts.

  4. 2022–2023

    ChatGPT, Claude, Gemini

    Instruction-tuned models with reinforcement learning from human feedback (RLHF). Suddenly, these models could be used in conversation by non-technical users. The public inflection point.

  5. 2024–present

    Multimodal, Mixture of Experts, Long Context

    Models handle images, audio, and video. Context windows grow from 2K to 1M+ tokens. Mixture of Experts enables larger capacity at lower inference cost.

What You Can Do with Language Models

  • Rely on the fundamentals. Every model has limitations. Knowing the architecture tells you what those limitations are.
  • Fine-tune small models for specific tasks. A fine-tuned small model is often more accurate, faster, and cheaper than a giant model with prompt engineering — for a specific, well-defined task.
  • Prompt engineering is a real skill. People with backgrounds in psychology and linguistics have turned out to be remarkably good at it, because they understand how the corpus of human-generated text the model learned from reflects human thinking. (Fun fact: at one point it was discovered that typing prompts in all caps sometimes improved performance. Which says more about the internet than about the model.)
  • Retrieval-Augmented Generation (RAG) is a systems-based approach often used to ground the responses of LLMs in real data.
  • Agent-ish architectures are everywhere right now — though "agent" in the modern LLM sense is a loose term and doesn't quite match the formal RL meaning.