Making Neural Networks Learn Better: Understanding Activation Functions, Xavier Initialization, He…

wpnews.pro

Deep Neural Networks have achieved remarkable success in tasks like image classification, object detection, natural language processing, and speech recognition. However, building a deep neural network is only half the challenge; the real difficulty lies in training it efficiently.

As networks become deeper, they often suffer from problems such as vanishing gradients, exploding gradients, and slow convergence, making optimization difficult. Even with a good architecture and a large dataset, these issues can prevent the model from learning effectively.

Over the years, several techniques have been developed to address these challenges. Activation functions introduce non-linearity, allowing neural networks to learn complex patterns. Weight initialization methods, such as Xavier and He initialization, help gradients flow smoothly during training. Finally, Batch Normalization stabilizes the distribution of activations, leading to faster and more reliable convergence.

In this article, we’ll build an intuitive understanding of these concepts and see how they work together to make neural networks train faster, more stably, and achieve better performance.

Imagine building a neural network with multiple hidden layers but without using any activation functions. At first glance, it might seem that adding more layers would automatically make the network more powerful. Surprisingly, that’s not true.

Without activation functions, no matter how many hidden layers we add, the entire neural network behaves like a single linear model. In other words, stacking multiple linear layers still produces a linear transformation, which means the network cannot learn complex relationships present in real-world data. This is where activation functions become essential.

An activation function is a mathematical function applied to the output of every neuron before passing it to the next layer. Its primary purpose is to introduce non-linearity into the network, allowing it to learn complex patterns that a simple linear model cannot.

For example, tasks like image classification, speech recognition, and language translation involve highly non-linear relationships. Without activation functions, even a very deep neural network would struggle to solve these problems effectively.

Although different activation functions are designed for different purposes, an ideal activation function generally has the following characteristics:

No single activation function satisfies every requirement perfectly, which is why different activation functions are used in different layers and applications.

The Sigmoid activation function was one of the earliest activation functions used in neural networks. It takes any real-valued input and maps it to a value between 0 and 1, making its output easy to interpret as a probability. Mathematically, it is defined as:

As the input becomes very large, the output approaches 1. Similarly, for very small inputs, the output approaches 0. This smooth “S-shaped” curve is why it is also known as the logistic activation function.

One of the biggest advantages of the Sigmoid function is that it produces bounded outputs between 0 and 1. Because of this property, it is widely used in the output layer of binary classification models, where the network predicts the probability of belonging to a particular class.

For example, if the output of a sigmoid neuron is:

This probabilistic interpretation makes Sigmoid a natural choice for binary classification tasks.

Although Sigmoid works well in output layers, it has a major drawback when used in deep hidden layers. For very large positive or negative inputs, the curve becomes almost flat. In these regions, the gradient becomes extremely small. During backpropagation, gradients are multiplied across layers. When these gradients are already close to zero, they keep shrinking as they move backward through the network. As a result, the earlier layers receive almost no learning signal, making training extremely slow. This phenomenon is known as the Vanishing Gradient Problem. Another limitation is that the output of Sigmoid is not zero-centered. Since all outputs lie between 0 and 1, gradient updates can become less efficient and convergence may slow down.

Today, Sigmoid is rarely used in hidden layers of deep neural networks because of the vanishing gradient problem. However, it is still the preferred choice in the output layer of binary classification problems, where predicting probabilities between 0 and 1 is exactly what we need.

After understanding the limitations of the Sigmoid function, researchers introduced another activation function called Tanh (Hyperbolic Tangent). At first glance, Tanh looks very similar to Sigmoid because it also produces a smooth S-shaped curve. However, there is one important difference. Instead of producing outputs between 0 and 1, the Tanh function maps values between -1 and 1.

Mathematically, it is defined as:

Because its outputs are centered around zero, Tanh provides a more balanced representation of positive and negative values.

One of the biggest improvements offered by Tanh is that it is zero-centered. Unlike Sigmoid, whose outputs are always positive, Tanh produces both positive and negative values. This allows gradient updates to become more balanced during optimization, often leading to faster convergence. As a result, Tanh became a preferred choice for hidden layers in many neural networks before ReLU gained popularity.

Unfortunately, no. Although Tanh improves upon Sigmoid, it still suffers from the Vanishing Gradient Problem. For very large positive or negative inputs, the curve becomes almost flat, causing gradients to approach zero. During backpropagation, these tiny gradients keep shrinking as they pass through multiple layers, making it difficult for the earlier layers of a deep network to learn effectively. So while Tanh trains better than Sigmoid in many cases, it is still not ideal for very deep neural networks.

Tanh is often used in hidden layers where zero-centered activations are beneficial. It has also been widely used in recurrent neural networks (RNNs) and older deep learning architectures. However, in modern deep neural networks, ReLU and its variants have largely replaced Tanh because they help overcome many of its training limitations.

As neural networks became deeper, the limitations of Sigmoid and Tanh became more apparent. Both activation functions suffered from the Vanishing Gradient Problem, making it difficult for deep networks to learn effectively. To address this challenge, researchers introduced the Rectified Linear Unit (ReLU), which has since become the most widely used activation function in deep learning. Unlike Sigmoid and Tanh, ReLU is remarkably simple. It is defined as:

f(x)=max(0,x)

This means:

Although this function looks almost too simple, it dramatically improved the training of deep neural networks.

One of the biggest reasons behind ReLU’s success is that it significantly reduces the Vanishing Gradient Problem. For positive inputs, the gradient remains constant instead of becoming extremely small. As a result, gradients can flow more easily through deep networks, allowing earlier layers to continue learning during backpropagation. Another advantage is its computational simplicity. Unlike Sigmoid and Tanh, which involve exponential calculations, ReLU only performs a simple comparison with zero. This makes training both faster and more efficient.

Despite its advantages, ReLU is not perfect, one well-known limitation is called the Dying ReLU Problem. If a neuron’s input becomes negative, ReLU outputs 0. Since the gradient is also 0 in this region, that neuron stops updating during training. If this happens consistently, the neuron may never activate again. In other words, it becomes a dead neuron, contributing nothing to the learning process. This problem becomes more common when:

Despite the dying ReLU problem, ReLU remains the default activation function for hidden layers in most modern neural networks. Its simplicity, computational efficiency, and ability to train deep networks effectively make it a strong choice for a wide range of applications.

Although ReLU solved the vanishing gradient problem for deep neural networks, it introduced a new challenge: the Dying ReLU Problem, where neurons stop learning if they consistently receive negative inputs.

To overcome this limitation, several improved versions of ReLU were proposed.

Leaky ReLU allows a small, non-zero output for negative inputs instead of returning zero. As a result, gradients can still flow through the neuron even when the input is negative, reducing the chances of neurons becoming permanently inactive.

PReLU is an extension of Leaky ReLU. Instead of using a fixed slope for negative inputs, the network learns the slope automatically during training. This gives the model greater flexibility, although it also introduces additional learnable parameters.

ELU also allows negative outputs, but instead of using a straight line, it applies an exponential curve for negative values. This helps activations become more centered around zero, which can improve convergence during training while still avoiding dead neurons.

SELU builds upon ELU by introducing a scaling factor. Its most notable property is self-normalization, meaning the activations naturally remain close to a stable mean and variance as they propagate through the network. This can make training deep networks more stable under specific conditions.

After selecting the right activation function, one might think that the neural network is ready to start learning. However, there is another factor that plays a crucial role even before training begins- Weight Initialization. Every neural network starts with randomly initialized weights. These weights determine the initial predictions made by the network and influence how gradients flow during backpropagation. Although the initial weights are updated during training, choosing poor initial values can make learning extremely slow or even prevent the network from learning altogether.

Imagine a network with several hidden layers. During forward propagation, each layer transforms the input using its current weights. If the weights are initialized too large, the activations can become excessively large as they move through the network. On the other hand, if the weights are initialized too small, the activations gradually shrink. Both situations make training difficult. Large activations may lead to exploding gradients, while extremely small activations can result in vanishing gradients. As a result, the model struggles to converge efficiently.

The goal of weight initialization is therefore simple: Start the network with weights that keep the flow of information stable across all layers.

A common question beginners ask is: “Why don’t we initialize every weight to zero?”

Although this sounds reasonable, it creates a serious problem known as the Symmetry Problem. If every neuron starts with exactly the same weights, they all receive identical gradients during backpropagation. As a result, every neuron learns the same features and updates its weights in exactly the same way. Instead of learning diverse representations, the entire layer behaves like a single neuron repeated multiple times. This defeats the purpose of having multiple neurons in the first place. For this reason, neural networks are initialized with small random values rather than identical ones.

A good initialization strategy should:

Finding an initialization method that satisfies these goals led to the development of two widely used techniques:

Both methods are designed to maintain stable activations during training, but they are optimized for different activation functions.

After understanding why proper weight initialization is important, the next question is: How do we choose the initial weights?

One of the earliest and most influential solutions was Xavier Initialization, also known as Glorot Initialization. The main idea behind Xavier Initialization is simple: Initialize the weights so that the variance of activations remains nearly constant as data passes through the network.

If the activations keep growing larger at every layer, gradients may explode. If they keep shrinking, gradients may vanish. Xavier Initialization tries to maintain a healthy balance so that information flows smoothly during both forward and backward propagation. For a layer n in input neurons n out neurons, Xavier Initialization samples weights using:

By scaling the weights according to the size of the layer, Xavier Initialization helps stabilize the learning process and speeds up convergence. Since Sigmoid and Tanh produce outputs on both sides of the activation curve, Xavier Initialization works particularly well with these activation functions.

Although Xavier Initialization greatly improved training, researchers later realized that it was not the best choice for ReLU. The reason is that ReLU sets all negative values to zero. As activations move through multiple layers, roughly half of them become zero, reducing the overall variance. To compensate for this loss, He Initialization starts with slightly larger weights than Xavier Initialization. Its variance is given by:

This simple modification allows ReLU-based networks to preserve the flow of information much more effectively during training. As a result, gradients remain stable even in deeper architectures, making optimization faster and more reliable. Today, He Initialization is the standard choice for neural networks that use ReLU or its variants, while Xavier Initialization remains a good choice for networks using Sigmoid or Tanh.

Even with the right activation functions and a good weight initialization strategy, training deep neural networks can still become unstable as activations change from one layer to the next. To address this challenge, researchers introduced another powerful technique: Batch Normalization, which has become a standard component in many modern neural network architectures.

By now, we’ve learned that choosing the right activation function and initializing weights properly can make training much more stable. However, another challenge still remains. As data passes through multiple layers of a neural network, the distribution of activations keeps changing because the weights are continuously updated during training. This means that every layer has to constantly adapt to a moving target, making optimization slower and less stable. To address this problem, researchers introduced Batch Normalization (BN). The main idea is surprisingly simple: ** Normalize the activations of each mini-batch before passing them to the next layer. **By keeping the activations within a more consistent range, Batch Normalization allows the network to learn faster and more reliably.

During training, Batch Normalization performs four simple steps for every mini-batch:

Although the activations are normalized, the network still has the flexibility to learn the most suitable scale and shift through these trainable parameters.

A common question is: “If we’re normalizing the activations anyway, why do we need γ and β?”

The answer is flexibility. Completely fixing the activations to have zero mean and unit variance might not always be optimal for learning. By introducing γ (scale) and β (shift), the network can learn the most appropriate distribution for each layer while still enjoying the stability provided by normalization. In other words, Batch Normalization doesn’t force every layer to remain standardized forever- it simply gives the model a better starting point and lets it adjust when necessary.

Batch Normalization offers several practical advantages:

Because of these benefits, Batch Normalization has become a standard component in many modern deep learning architectures.

In most neural networks, the common order is:

Linear Layer      ↓Batch Normalization      ↓Activation Function

This allows the activations to be normalized before applying the non-linear activation function, helping maintain a stable flow of information throughout the network. Batch Normalization doesn’t replace good activation functions or proper weight initialization. Instead, it works alongside them. Activation functions help the network learn complex patterns. Weight initialization gives training a good starting point. Batch Normalization keeps learning stable as training progresses. Together, these techniques make deep neural networks easier to optimize and significantly improve training efficiency.

Training a deep neural network involves much more than choosing an architecture and providing data. Small design choices- such as selecting the right activation function, initializing weights properly, and normalizing activations- can have a significant impact on how effectively a model learns.

While these techniques may seem like independent improvements, they all share the same goal: making learning more stable and efficient. Activation functions help the network model complex relationships, Xavier and He initialization ensure a healthy flow of information from the very beginning, and Batch Normalization keeps training consistent as the network grows deeper.

Making Neural Networks Learn Better: Understanding Activation Functions, Xavier Initialization, He… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article How Deep Does Your AI Transformation Actually Go? I Asked 7 Major AI Models What It Feels Like to Be Them. The Answers Reveal More Than You’d Expect. Nvidia Already Won Training. The Real Fight Is Inference

Making Neural Networks Learn Better: Understanding Activation Functions, Xavier Initialization, He…

Run your AI side-project on zahid.host