What Is a Large Language Model?

Checkpoint

What is a Large Language Model? Write your own definition before reading on.

The honest answer is that the definition is moving. GPT-1 (2018) is usually considered the first "large" language model. It had 117 million parameters. By today's standards, that's barely a small language model.

GPT-2 (2019): when OpenAI released it, they were so worried about misuse that they delayed the full model and released a smaller version with a technical paper instead. Reading that press release today is surreal. GPT-2 by modern standards is not capable of fooling anyone. Yet at the time the discourse was about responsible disclosure of dangerous AI. That tells you something about how fast the field has moved — and about how perceptions of "dangerous capability" shift relative to the frontier.

2018
GPT-1 — 117M parameters
The first 'large' language model by the field's standards at the time. Demonstrated that unsupervised pretraining followed by fine-tuning could transfer across NLP tasks.
2019
GPT-2 — 1.5B parameters
Caused a responsible-disclosure controversy despite producing outputs that look obviously AI-generated today. Demonstrates how fast the definition of 'dangerous capability' moves relative to the frontier.
2020
GPT-3 — 175B parameters
Few-shot prompting. The model could perform tasks from a handful of examples in the prompt with no fine-tuning. A conceptual shift: the same model, different prompts.
2022–2023
ChatGPT, Claude, Gemini
Instruction-tuned models with reinforcement learning from human feedback (RLHF). Suddenly, these models could be used in conversation by non-technical users. The public inflection point.
2024–present
Multimodal, Mixture of Experts, Long Context
Models handle images, audio, and video. Context windows grow from 2K to 1M+ tokens. Mixture of Experts enables larger capacity at lower inference cost.

What You Can Do with Language Models

Rely on the fundamentals. Every model has limitations. Knowing the architecture tells you what those limitations are.
Fine-tune small models for specific tasks. A fine-tuned small model is often more accurate, faster, and cheaper than a giant model with prompt engineering — for a specific, well-defined task.
Prompt engineering is a real skill. People with backgrounds in psychology and linguistics have turned out to be remarkably good at it, because they understand how the corpus of human-generated text the model learned from reflects human thinking. (Fun fact: at one point it was discovered that typing prompts in all caps sometimes improved performance. Which says more about the internet than about the model.)
Retrieval-Augmented Generation (RAG) is a systems-based approach often used to ground the responses of LLMs in real data.
Agent-ish architectures are everywhere right now — though "agent" in the modern LLM sense is a loose term and doesn't quite match the formal RL meaning.

←PreviousNLP Applications:Topic ModelingNLP Implementation Next→Visualizing Embedding SpacesLLMs + RAG