How To Initialize Weights in Neural Network

Minakshi Mathpal
10 min readSep 28, 2021
Image Credit: blockgeni.com

Every node in a neural network has few parameters associated with it.These parameters are referred to as weights which used to calculate a weighted sum of the inputs.

Neural network models are trained to make useful predictions using an optimization algorithm called stochastic gradient descent that iteratively changes the network weights to minimize a loss function. This process give model its set of weights for the mode that is capable of making useful predictions.

Stochastic gradient descent requires a starting point in the space of possible weight values from which to begin the optimization process.

Weight initialization is a procedure of setting the weights of a neural network to small random values that define the starting point for the optimization (learning or training) of the neural network model.

Why Initialize Weights

The aim of weight initialization is to prevent layer activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network. If either occurs, loss gradients will either be too large or too small to flow backwards beneficially, and the network will take longer to converge, or may not converge at all.

Let’s see this with an example

Let’s suppose we have a vector x that contains some network inputs. When training neural networks our inputs’ values should be scaled such that they have a mean of 0 and a standard deviation of 1.

So we have our data with 0 mean and 1 standard Deviation.

Let’s assume we have a simple 250-layer network with no activations, and that each layer has a matrix w that contains the layer’s weights. In order to complete a single forward pass we’ll have to perform a matrix multiplication between layer inputs and weights at each of the 250 layers, which will make for a grand total of 250 consecutive matrix multiplications.

It turns out that initializing the values of weights from the same standard normal distribution to which we scaled our inputs is never a good idea. To see why, we can simulate a forward pass through our hypothetical network.

Somewhere during those 250 multiplications, the layer outputs got so big that even the computer wasn’t able to recognize their standard deviation and mean as numbers. This concludes that we have initialized our weights to be too large.

In addition to this, we also have to take care of preventing layer outputs from vanishing. To see what happens when we initialize network weights to be too small — we’ll scale our weight values such that, while they still fall inside a normal distribution with a mean of 0, they have a standard deviation of 0.01.

During the course of the above hypothetical forward pass, the activation outputs completely vanished.

As an example, suppose we applied sigmoid activation function for the output layer. Above image is the sigmoid function and its derivative. Note how when the inputs of the sigmoid function becomes larger or smaller (when |x| becomes bigger), the derivative becomes close to zero

Thus during the forward pass, the activations (and then the gradients) can quickly get really big or really small — this is due to the fact that we repeat a lot of matrix multiplications. More specifically, we might get either:

  • very big activations and hence large gradients that shoot towards infinity
  • very small activations and hence infinitesimal gradients, which may be cancelled to zero due to numerical precision

To sum it up, if weights are initialized too large, the network won’t learn well. The same happens when weights are initialized too small. Either of these effects is fatal for training.

How to initialize your network

Recall that the goal of a good initialization is to:

  • get random weights
  • keep the activations in a good range during the first forward passes (and so for the gradients in the backward passes)

What is a good range in practice? Quantitatively speaking, it implies having the output of the Matrix multiplications with the input vector produce an output vector (i.e. activations) with mean near 0 and standard deviation near 1. Then each layer will propagate these statistics across all the layers and even on a deep network, you will have stable statistics on the first iterations.

Why Initialize a Neural Network with Random Weights?

When we use Deterministic algorithms to solve problems, these algorithms can make guarantees about best, worst, and average running time. The problem is, they are not suitable for all problems.

Some problems are hard to model using deterministic algorithms. The reason can be the number of combinations or the large size of data. The deterministic may run, but will continue running until the heat death of the universe.

An alternate solution is to use nondeterministic algorithms. These algorithms use elements of randomness when making decisions during the execution of the algorithm. Thus for every rerun of the algorithm a different order of steps will be followed

Stochastic Search Algorithms are not random per se, instead they make careful use of randomness. Which means they are random within a bound. Stochastic Gradient Descent is one of the member of these Stochastic Search Algorithm family which:

  • Use randomness during initialization.
  • Use randomness during the progression of the search.

These two elements of random initialization and randomness during the search work together.

While training our neural network we don’t know anything about the structure of the search space. Therefore, to remove bias from the search process, we start from a randomly chosen position. As the search process unfolds, there is a risk that we are stuck in an unfavorable area of the search space. Using randomness during the search process gives some likelihood of getting unstuck and finding a better final candidate solution. The word candidate is used because we cannot guarantee that the solution which we have found is the only solution(local optima) ,thus giving the stochastic search process multiple opportunities to start and traverse the space of candidate solutions in search of a better candidate solution–a so-called global optima.

Speaking in context of Artificial neural networks — They use stochastic gradient descent for training and uses randomness in order to find optimal set of weights for modelling the problem space. Thus the weights of the network are initialized to small random values (random, but close to zero, such as in [0.0, 0.1]).

Zero Initialization

Why not set all weights to zero. In this case, the learning algorithm will not make any changes to the network weights, and the model will be stuck. If all the weights are initialized with 0, the derivative with respect to loss function is the same for every wij in Wl, thus all weights have the same value in subsequent iterations. This makes hidden units symmetric and continues for all the n iterations i.e. setting weights to 0 does not make it better than a linear model. It is important to note that the bias weight in each neuron is set to zero by default, not a small random value.

Specifically, nodes that are adjacent in a hidden layer connected to the same inputs must have different weights for the learning algorithm to update the weights.

This is often referred to as the need to break symmetry during training. It means

if two hidden units same activation function are connected to the same inputs, then these units must have different initial parameters. If they have the same initial parameters, then a deterministic learning algorithm applied to a deterministic cost and model will constantly update both of these units in the same way.

In general

We almost always initialize all the weights in the model to values drawn randomly from a Gaussian or uniform distribution. The scale of the initial distribution, have a large effect on both the outcome of the optimization procedure and on the ability of the network to generalize.

Initialization techniques.

As we saw above that we want to initialize the weights with random values that are not “too small” and not “too large” to avoid the problem of vanishing gradients and exploding gradients.

How do we avoid getting stuck in saturated regions? Recall that the activation function

depends on the weights wij and the input x. To avoid output of the activation function being too large or too small, it makes sense to keep the weights wij and the input x in some sensible range. We can restrict x, which comes from our data, by normalizing our dataset using z-scaling or other methods (ensures that the data has zero mean and unit variance).But what about the weights wij

This is where Xavier initialization suggests initializing the weights with a variance so that the variance of Var(wij), is unity. By ensuring that Var(wij), is unity, we are reducing the likelihood of being stuck in saturated regions. This helps us keep the signal from exploding to a high value or vanishing to zero. In other words, we need to initialize the weights in such a way that the variance remains the same for x and a

How to perform Xavier initialization

a) Normal Distribution

Just to reiterate, we want the variance to remain the same as we pass through each layer. Let’s go ahead and compute the variance of y:

Let’s compute the variance of the terms inside the parentheses on the right hand side of the above equation. If you consider a general term, we have:

Here, E() stands for expectation of a given variable, which basically represents the mean value. We have assumed that the inputs and weights are coming from a Gaussian distribution of zero mean. Hence the “E()” term vanishes and we get:

Note that ‘b’ is a constant and has zero variance, so it will vanish. Let’s substitute in the original equation:

Since they are all identically distributed, we can write:

So if we want the variance of a to be the same as x then the term “N var(wi)” should be equal to 1. Hence:

There we go! We arrived at the Xavier initialization formula. We need to pick the weights from a Gaussian distribution with zero mean and a variance of 1/N where N specifies the number of input neurons. In the original paper, the authors take the average of the number input neurons and the output neurons. So the formula becomes:

Plot of Range of Xavier Weight Initialization With Inputs From One to One Hundred

We can see that with very few inputs, the range is large, such as between -1 and 1 or -0.7 to -7. We can then see that our range rapidly drops to about 20 weights to near -0.1 and 0.1, where it remains reasonably constant.

Uniform Distribution

What if we want to use a Uniform distribution? If sampling from a uniform distribution, this translates to sampling the interval [-r,r] where

The weird-looking square root(6) factor comes from the fact that the variance of a uniform distribution over the interval [-r,r] is 𝑟2/3 ( (𝑏−𝑎)2/12 for a random variable following Uniform Distribution (a, b)

Plot of Range of Normalized Xavier Weight Initialization With Inputs From One to One Hundred

Compared to the non-normalized version in the previous section, the range is initially smaller.

He Initialization

Glorot and Bengio considered logistic sigmoid activation function, which was the default choice at that moment for their weight initialization scheme. Later on, the sigmoid activation was surpassed by ReLu, because it allowed to solve vanishing / exploding gradients problem. However, it turns out Xavier (Glorot) Initialization isn’t quite as optimal for ReLU functions. Consequently, there appeared a new initialization technique, which applied the same idea (balancing of the variance of the activation) to this new activation function and now it often referred to as He initialization. The initialization strategy for ReLU activation function and its variants is sometimes called He initialization. There is only one tiny adjustment we need to make, which is to multiply the variance of the weights by square root(2)

Plot of Range of He Weight Initialization with Inputs from One to One Hundred

We can see that with very few inputs, the range is large, near -1.5 and 1.5 or -1.0 to -1.0. We can then see that our range rapidly drops to about 20 weights to near -0.1 and 0.1, where it remains reasonably constant.

Different Activation Functions

Some papers in the literature have provided similar strategies for different activation functions which is shown below:

this blog is inspired by “machinelearningmastery.com”

--

--