The Multi-Layer Perceptron

The solution to the non-linearity problem is almost anticlimactically simple: use more neurons, arranged in layers.

The key insight of the Multi-Layer Perceptron (MLP) is that you can approximate any non-linear boundary using multiple linear boundaries. Think of it geometrically: a circle cannot be drawn with a single straight line, but you can approximate it with enough line segments. With 8 perceptrons, you get a rough polygon. With 16, you get something that looks almost circular. With enough, you can approximate anything.

Linear decision boundary
Illustration of approximating nonlinear decision boundaries with multiple perceptrons [Image Source]

The Universal Approximation Theorem

The Universal Approximation Theorem states that a multi-layer perceptron with a sufficient number of neurons can approximate any continuous function to any desired level of accuracy. This is one of the most important theoretical results in deep learning, and it is why neural networks are such flexible tools.

Anatomy of a Neural Network

A multi-layer perceptron has three types of layers:

  • Input Layer — This layer simply receives your raw features. If you are working with a 28×28 grayscale image, your input layer has 784 nodes, one for each pixel. No computation happens here.
  • Hidden Layer(s) — These are the layers that never directly see the input or produce the final output. Each node in a hidden layer takes a weighted sum of all nodes from the previous layer, passes it through an activation function, and sends the result forward. The hidden layers are where the network learns to represent increasingly abstract features of the data.
  • Output Layer — This layer produces the final prediction. Its structure depends on your task: a single node for regression or binary classification, or one node per class for multi-class classification.
neural network
3-layer neural network

The weight connecting node i in layer 1 to node j in layer 2 is denoted wi,j,2. The first subscript is the source node, the second is the destination node, and the third is the destination layer.

Real-World Application: Choosing Network Depth

The depth of your network — the number of hidden layers — is one of the most consequential architectural decisions you will make.

  • Shallow networks (1–2 hidden layers) are often sufficient for tabular data: customer churn prediction, sales forecasting, risk scoring.
  • Deeper networks (10+, 100+, or even 1000+ layers) are needed for tasks like image classification, speech recognition, and language modeling, where features exist at multiple levels of abstraction.
💭Reflection

You are building a model to predict whether a bank loan will default, using 20 structured features (income, debt ratio, credit score, etc.). How many hidden layers would you start with, and why?