Activation Functions

You can think of the activation function as the network's decision-maker at each node. After computing the weighted sum of its inputs, a neuron asks: "How much of this signal should I pass forward?" The activation function answers that question.

Without activation functions, no matter how many layers you stack, your network would still be computing a linear function. A linear function of a linear function is still linear. Activation functions introduce the non-linearity that makes deep networks capable of approximating complex patterns.

ƒInteractive · Activation Functions
-1.5-1-0.500.511.5-10-8-6-4-20246810Input (x)Output
−10Input x = 0.0010

Current output values

Sigmoid0.5000
Sigmoidσ(x) = 1 / (1 + e^(−x)) · Maps input to (0, 1). Used as an output activation for binary classification.

Toggle functions on and off, then drag the slider to see how each activation responds to different input values.

Sigmoid

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Range

(0, 1)

When to use

Output layers for binary classification tasks

Characteristics

Smooth, gradual transition from 0 to 1. Historically popular for binary classification tasks.

Issue

Suffers from the vanishing gradient problem, making it less effective in deep networks.

Tanh

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Range

(−1, 1)

When to use

Output layers for bounded tasks

Characteristics

Similar to sigmoid but outputs values between −1 and 1, making it zero-centered and, in some cases, more efficient.

Issue

Like sigmoid, also suffers from the vanishing gradient problem.

ReLU

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

Range

[0, ∞)

When to use

Default choice for hidden layers due to its simplicity and efficiency

Characteristics

Outputs 0 for negative inputs and a linear relationship for positive inputs. Computationally efficient and allows the model to converge faster.

Issue

Neurons can "die" — they stop activating for any input due to the 0 slope for negative values (dying ReLU problem).

Leaky ReLU

max(αz,z),α0.01\max(\alpha z,\, z), \quad \alpha \approx 0.01

Range

(−∞, ∞)

When to use

Try when dead neurons become an issue

Characteristics

Similar to ReLU but allows a small, positive slope for negative values, addressing the dying ReLU issue.

Issue

The slope for negative values (α) needs to be carefully chosen.

Softmax

softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\displaystyle\sum_j e^{z_j}}

Range

(0, 1) per class; all sum to 1

When to use

Output layer for multi-class classification tasks

Characteristics

Converts a vector of raw scores into a probability distribution. Often used in the final layer of a classifier.

Issue

1 / 5
CheckpointMultiple Choice

You are building a classifier that identifies one of ten different manufacturing defect types from sensor readings. Which activation function should you use in the output layer?