What is layer normalization?
by Stephen M. Walker II, CoFounder / CEO
What is layer normalization?
Layer normalization (LayerNorm) is a technique used in deep learning to normalize the distributions of intermediate layers. It was proposed by researchers Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. The primary goal of layer normalization is to stabilize the learning process and accelerate the training of deep neural networks.
Unlike batch normalization, which normalizes the input across the batch dimension, layer normalization performs normalization for each feature across the feature dimension. This means that in layer normalization, all neurons in a particular layer effectively have the same distribution across all features for a given input.
Mathematically, layer normalization computes the mean and variance for each feature across the feature dimension.
One of the key advantages of layer normalization is that it performs the same computation at training and test times, and it does not impose any constraint on the size of the minibatch. This makes it particularly suitable for sequence models such as transformers and recurrent neural networks, where the sequence length can vary.
In the context of convolutional layers, layer normalization is applied channelwise, treating each channel as an independent sample.
Overall, layer normalization enables smoother gradients, faster training, and better generalization accuracy.
How does layer normalization differ from batch normalization?
Layer normalization and batch normalization are both techniques used in deep learning to normalize the distributions of intermediate layers, but they differ in several key aspects:

Normalization Dimension — Batch normalization normalizes the input across the batch dimension, meaning for each feature, it calculates the mean and variance across all instances in the batch. On the other hand, layer normalization operates over the feature dimension, calculating the mean and variance for each instance separately, over all the features.

Batch Size Dependency — Batch normalization is dependent on batch size and requires larger batch sizes for effective approximation of the population statistics. This can cause issues in certain scenarios, such as small batch sizes or sequence models, where the batch size changes every time step. Layer normalization, however, is independent of the batch size, making it suitable for models with varying batch sizes.

Training and Inference Processing — Batch normalization requires different processing at training and inference times. During training, it calculates the batch statistics (mean and variance), and during testing, a running average of these calculated during training is used. Layer normalization, in contrast, performs the same computation at training and test times.

Application — Batch normalization is widely used in Convolutional Neural Networks (CNNs) as it can accelerate training and improve generalization. Layer normalization, however, is often used in recurrent models and transformers where batch normalization performs poorly due to varying sequence lengths.

Normalization Statistics — In batch normalization, each input in the current minibatch is transformed by subtracting the input mean in the batch and dividing by the standard deviation. In layer normalization, all the hidden units in a layer share the same normalization terms, but different training cases have different normalization terms.