Introduction to Neural Networks/Ch. 2 — Training Neural Networks/1 of 6

Loss Functions

To improve, you need to know how wrong you are. That is the job of the loss function (also called a cost function, though technically the loss function applies to a single example while the cost function is the average loss across the entire training set — in practice, these terms are used interchangeably).

Consider a simple example: you are training a model to count blueberries in a muffin from an image. Your model looks at the image and predicts 2 blueberries. The actual label is 4. Your error is 2. The loss function formalizes and quantifies that error.

Illustration of loss function — The loss function tells you how wrong you are

Common Loss Functions for Regression

Mean Squared Error (MSE)
$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$

Squares the error, so large errors are penalized heavily. Use this when outlier errors are genuinely costly — house price prediction, energy consumption forecasting.
Mean Absolute Error (MAE)
$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n}|y_i - \hat{y}_i|$

Takes the absolute value of the error, treating all errors proportionally. Use when outlier errors should not dominate training — food delivery time prediction, where one delayed package should not distort the entire model.
Huber Loss
$L_\delta(y, f(x)) = \begin{cases} \frac{1}{2}(f(x) - y)^2 & \text{if } |f(x) - y| \leq \delta \\ \delta \left|f(x) - y\right| - \frac{1}{2}\delta^2 & \text{if } |f(x) - y| > \delta \end{cases}$

A hybrid that behaves like MSE for small errors and MAE for large errors. More robust than MSE but adds a threshold hyperparameter $delta$ .

Common Loss Functions for Classification

Binary Cross-Entropy
$L(y, f(x)) = -\left[y \log(f(x)) + (1 - y) \log(1 - f(x))\right]$

Used when you have exactly two classes (spam/not spam, dog/not dog). Measures the divergence between the predicted probability distribution and the true distribution.
Categorical Cross-Entropy
$L(y, f(x)) = -\sum_{i} y_i \log(f(x)_i)$

Used when you have more than two classes. Generalizes binary cross-entropy to the multi-class case.

ℹ

On Custom Loss Functions

In industry, custom loss functions are often where the real performance gains hide. The standard loss functions treat all errors symmetrically. In many real problems, errors are not symmetric — missing a fraud case is far worse than a false alarm. When you are comfortable with the basics, building custom loss functions that encode your problem's actual cost structure can dramatically improve model alignment with real-world objectives. PyTorch and TensorFlow both make custom loss functions straightforward to implement.

Loss functions are really wild. Check out losslandscapes.com for more amazing visuals.

◈

Real-World Application

The choice of loss function is a design decision that encodes what you care about. Choosing MSE implicitly says "I care a lot about large errors." Choosing MAE says "I want all errors treated equally." If you are building a medical model that predicts disease severity, and a missed severe case is catastrophic while a false alarm is merely inconvenient, your loss function should reflect that asymmetry.

Checkpoint•Multiple Choice

You are training a model to predict food delivery times in minutes. Occasionally, one delivery is 2 hours late due to a restaurant error — an outlier that has nothing to do with the model. Which loss function is the better choice, and why?

MSE, because squaring the error rewards very accurate predictionsMAE, because it treats all errors proportionally and won't let the outlier dominate trainingBinary cross-entropy, because delivery time is a classification problemCategorical cross-entropy, because there are many possible delivery times

←

Ch. 1 — Foundations of Neural Networks

Deep Learning

Gradient Descent

→