Activation Functions
You can think of the activation function as the network's decision-maker at each node. After computing the weighted sum of its inputs, a neuron asks: "How much of this signal should I pass forward?" The activation function answers that question.
Without activation functions, no matter how many layers you stack, your network would still be computing a linear function. A linear function of a linear function is still linear. Activation functions introduce the non-linearity that makes deep networks capable of approximating complex patterns.
Current output values
Toggle functions on and off, then drag the slider to see how each activation responds to different input values.
Sigmoid
Range
(0, 1)
When to use
Output layers for binary classification tasks
Characteristics
Smooth, gradual transition from 0 to 1. Historically popular for binary classification tasks.
Issue
Suffers from the vanishing gradient problem, making it less effective in deep networks.
Tanh
Range
(−1, 1)
When to use
Output layers for bounded tasks
Characteristics
Similar to sigmoid but outputs values between −1 and 1, making it zero-centered and, in some cases, more efficient.
Issue
Like sigmoid, also suffers from the vanishing gradient problem.
ReLU
Range
[0, ∞)
When to use
Default choice for hidden layers due to its simplicity and efficiency
Characteristics
Outputs 0 for negative inputs and a linear relationship for positive inputs. Computationally efficient and allows the model to converge faster.
Issue
Neurons can "die" — they stop activating for any input due to the 0 slope for negative values (dying ReLU problem).
Leaky ReLU
Range
(−∞, ∞)
When to use
Try when dead neurons become an issue
Characteristics
Similar to ReLU but allows a small, positive slope for negative values, addressing the dying ReLU issue.
Issue
The slope for negative values (α) needs to be carefully chosen.
Softmax
Range
(0, 1) per class; all sum to 1
When to use
Output layer for multi-class classification tasks
Characteristics
Converts a vector of raw scores into a probability distribution. Often used in the final layer of a classifier.
Issue
—
You are building a classifier that identifies one of ten different manufacturing defect types from sensor readings. Which activation function should you use in the output layer?