## Hyper-parameters in Action! Activation Functions

### Introduction

This is the first of a series of posts aiming at presenting visually, in a clear and concise way, some of the fundamental moving parts of training a neural network: the **hyper-parameters**.

Originally posted on Towards Data Science.

### Motivation

Deep Learning is all about **hyper-parameters**! Maybe this is an exaggeration, but having a sound understanding of the effects of different hyper-parameters on training a deep neural network is definitely going to make your life easier.

While studying Deep Learning, you’re likely to find lots of information on the importance of properly setting the network’s hyper-parameters: ** activation functions, weight initializer, optimizer, learning rate, mini-batch size**, and the network architecture itself, like the

**and the**

*number of hidden layers***in each layer.**

*number of units*So, you learn all the best practices, you set up your network, define the hyper-parameters (or just use its default values), start training and monitor the progress of your model’s ** losses** and

**.**

*metrics*Perhaps the experiment doesn’t go so well as you’d expect, so you **iterate** over it, **tweaking** the network, until you find out the set of values that will do the trick for your particular problem.

### Looking for a deeper understanding (no pun intended!)

Have you ever wondered **what exactly is going on under the hood**? I did, and it turns out that some simple experiments may shed quite some light on this matter.

Take ** activation functions**, for instance, the topic of this post. You and I know that the role of activation functions is to introduce a

**non-linearity**, otherwise the whole neural network could be simply replaced by a corresponding

**affine transformation**(that is, a

**linear transformation**, such as

**rotating**,

**scaling**or

**skewing,**followed by a

**translation**), no matter how deep the network is.

A neural network having only *linear*** activations** (that is,

**no activation**!) would have a hard time handling even a quite simple classification problem like this (each line has 1,000 points, generated for

*x*values equally spaced between -1.0 and 1.0):

If the **only** thing a network can do is to perform an **affine transformation**, this is likely what it would be able to come up with as a solution:

Clearly, this is not even close! Some examples of **much** better solutions are:

These are three fine examples of what ** non-linear activation functions** bring to the table! Can you guess which one of the images corresponds to a

**?**

*ReLU*### Non-linear boundaries (or are they?)

How does these non-linear boundaries come to be? Well, the actual role of the non-linearity is to** twist and turn the feature space** so much so that the boundary turns out to be… **LINEAR**!

OK, things are getting more interesting now (at least, I thought so first time I laid my eyes on it in this **awesome** Chris Olah’s blog post, from which I drew my inspiration to write this). So, let’s investigate it further!

Next step is to build the **simplest** possible neural network to tackle this particular classification problem. There are **two dimensions** in our **feature space** (** x1** and

*x2**)*, and the network has a single hidden layer with

**, so we preserve the number of dimensions when it comes to the outputs of the hidden layer (**

*two units***and**

*z1***).**

*z2*Up to this point, we are still on the realm of **affine transformations**… so, it is time for a ** non-linear activation function**, represented by the Greek letter

**sigma**, resulting in the

**activation values**(

**and**

*a1***) for the hidden layer.**

*a2*These **activation values** represented the **twisted and turned feature space** I referred to in the first paragraph of this section. This is a preview of what it looks like, when using a **sigmoid** as ** activation function**:

As promised, the boundary is **LINEAR**! By the way, the plot above corresponds to the left-most solution with a non-linear boundary on the original feature space (**Figure 3**).

### Neural network’s basic math recap

Just to make sure you and I are on the same page, I am showing you below four representations of the very basic matrix arithmetic performed by the neural network up to the hidden layer, **BEFORE** applying the ** activation function** (that is, just an

**affine transformation**such

*as*

**)**

*xW + b*Time to apply the ** activation function**, represented by the Greek letter

**sigma**on the network diagram.

**Voilà!** We went from the **inputs** to the **activation values** of the hidden layer!

### Implementing the network in Keras

For the implementation of this simple network, I used **Keras Sequential model API**. Apart from distinct ** activation functions**, every model trained used the very same

**hyper-parameters**:

*weight initializers**:*Glorot (Xavier) normal (hidden layer) and random normal (output layer);: Stochastic Gradient Descent (SGD);*optimizer*: 0.05;*learning rate*: 16;*mini-batch size*: 1;*number of hidden layers*(in the hidden layer): 2.*number of units*

Given that this is a **binary classification task**, the **output layer** has a ** single unit** with a

**sigmoid**

**and the**

*activation function***is given by**

*loss***binary cross-entropy**.

Code: simple neural network with a single 2-unit hidden layer

### Activation functions in action!

Now, for the juicy part —** visualizing** the twisted and turned feature space as the network trains, using a different ** activation function** each time:

**sigmoid**,

**tanh**and

**ReLU**.

In addition to showing changes in the feature space, the animations also contain:

**histograms of predicted probabilities**for both negative (blue line) and positive cases (green line), with**misclassified**cases shown in**red bars**(using threshold = 0.5);- line plots of
**accuracy**and**average loss**; **histogram of losses**for every element in the dataset.

### Sigmoid

Let’s start with the most traditional of the ** activation functions**, the

**sigmoid**, even though, nowadays, its usage is pretty much limited to the output layer in classification tasks.

As you can see in **Figure 6**, a **sigmoid** ** activation function** “

*squashes*” the inputs values into the

**range (0, 1)**(same range probabilities can take, the reason why it is used in the output layer for classification tasks). Also, remember that the activation values of any given layer are the inputs of the following layer and, given the range for the

**sigmoid**, the

**activation values**are going to be

**centered around 0.5**, instead of zero (as it usually is the case for normalized inputs).

It is also possible to verify that its **gradient** peak value is 0.25 (for *z* = 0) and that it gets already close to zero as |*z*| reaches a value of 5.

So, how does using a **sigmoid** ** activation functio**n work for this simple network? Let’s take a look at the animation:

There are a couple of observations to be made:

**epochs 15–40**: it is noticeable the typical**sigmoid**“*squashing*” happening on the horizontal axis;**epochs 40–65**: thestays at a plateau, and there is a “*loss**widening*” of the transformed feature space on the vertical axis;**epoch 65**: at this point,**negative cases**(blue line) are all**correctly classified**, even though its associated probabilities still are distributed up to 0.5; while the**positive cases on the edges**are still**misclassified**;**epochs 65–100**: the aforementioned “*widening*” becomes more and more intense, up to the point pretty much all feature space is covered again, while the**loss**falls steadily;**epoch 103**: thanks to the “*widening*”, all**positive cases**are now lying**within the proper boundary**, although some still have probabilities barely above the 0.5 threshold;**epoch 100–150**: there is now some “*squashing*” happening on the vertical axis as well, the**loss**falls a bit more to what seems to be a new plateau and, except for a few of the positive edge cases, the network is pretty**confident**on its predictions.

So, the **sigmoid activation function** succeeds in separating both lines, but the

**declines**

*loss***slowly**, while staying at

**plateaus**for a significant portion of the training time.

Can we do better with a **different activation function**?

### Tanh

The **tanh** ** activation function** was the evolution of the

**sigmoid**, as it outputs values with a

**zero mean**, differently from its predecessor.

As you can see in **Figure 7**, the **tanh** ** activation function** “

*squashes*” the input values into the

**range (-1, 1)**. Therefore, being

**centered at zero**, the activation values are already (somewhat) normalized inputs for the next layer.

Regarding the **gradient**, it has a much bigger peak value of 1.0 (again, for *z* = 0), but its decrease is even faster, approaching zero to values of *|z*| as low as 3. This is the underlying cause to what is referred to as the problem of **vanishing gradients**, which causes the training of the network to be progressively slower.

Now, for the corresponding animation, using **tanh** as ** activation function**:

There are a couple of observations to be made:

**epochs 10–40**: there is a**tanh**“*squashing*” happening on the horizontal axis, though it less pronounced, while the**loss**stays at a plateau;**epochs 40–55**: there is still no improvement in the, but there is a “*loss**widening*” of the transformed feature space on the vertical axis;**epoch 55**: at this point,**negative cases**(blue line) are all**correctly classified**, even though its associated probabilities still are distributed up to 0.5; while the**positive cases on the edges**are still**misclassified**;**epochs 55–65**: the aforementioned “*widening*” quickly reaches the point where pretty much all feature space is covered again, while the**loss**falls abruptly;**epoch 69**: thanks to the “*widening*”, all**positive cases**are now lying**within the proper boundary**, although some still have probabilities barely above the 0.5 threshold;**epochs 65–90**: there is now some “*squashing*” happening on the vertical axis as well, the**loss**keeps falling until reaching a new plateau and the network exhibits a**high level of confidence**for all predictions;**epochs 90–150**: only small improvements in the predicted probabilities happen at this point.

OK, it seems a bit better… the **tanh** ** activation function** reached a

**correct classification**for all cases

**faster**, with the

**also declining**

*loss***faster**(when declining, that is), but it also spends a lot of time in

**plateaus**.

What if we **get rid of** all the “*squashing*”?

### ReLU

**Re**ctified **L**inear **U**nits, or **ReLUs** for short, are the commonplace choice of ** activation function** these days. A

**ReLU**addresses the problem of

**vanishing gradients**so common in its two predecessors, while also being the

**fastest**to compute gradients for.

As you can see in **Figure 8**, the **ReLU** is a totally different beast: it does not “squash” the values into a range — it simply **preserves positive values** and turns all **negative values into zero**.

The upside of using a **ReLU** is that its **gradient** is either 1 (for positive values) or 0 (for negative values) — **no more vanishing gradients**! This pattern leads to a **faster convergence** of the network.

On the other hand, this behavior can lead to what it is called a **“dead neuron”**, that is, a neuron whose inputs are consistently negative and, therefore, always has an activation value of zero.

Time for the last of the animations, which is quite different from the previous two, thanks to the **absence** of “*squashing*” in the **ReLU** ** activation function**:

ReLU in action!

There are a couple of observations to be made:

**epochs 0–10**: the loss falls steadily from the very beginning**epoch 10**: at this point,**negative cases**(blue line) are all**correctly classified**, even though its associated probabilities still are distributed up to 0.5; while the**positive cases on the edges**are still**misclassified**;**epochs 10–60**:**loss**falls until reaching a plateau,**all cases**are already**correctly classified**since**epoch 52**, and the network already exhibits a**high level of confidence**for all predictions;**epochs 60–150**: only small improvements in the predicted probabilities happen at this point.

Well, no wonder the **ReLUs** are the *de facto* standard for **activation functions **nowadays. The ** loss** kept

**falling steadily**from the beginning and only

**plateaued**at a level

**close to zero**, reaching

**correct classification**for all cases in

**about 75% the time**it took

**tanh**to do it.

### Showdown

The animations are cool (ok, I am biased, I made them!), but not very handy to compare the **overall effect** of each and every different ** activation function** on the

**feature space**. So, to make it easier for you to compare them, there they are, side by side:

What about side-by-side ** accuracy** and

**curves, so I can also compare the training speeds? Sure, here we go:**

*loss*### Final Thoughts

The example I used to illustrate this post is almost as simple as it could possibly be, and the **patterns depicted** in the animations are intended to give you just a **general idea** of the **underlying mechanics** of each one of the *activation functions.*

Besides, I got “*lucky*” with my ** initialization of the weights** (maybe using 42 as seed is a good omen?!) and all three networks learned to classify correctly all the cases within 150 epochs of training. It turns out, training is

**VERY**

**sensitive**to the initialization, but this is a topic for a future post.

Nonetheless, I truly hope this post and its animations can give you some **insights** and maybe even some **“a-ha!”** moments while learning about this fascinating topic that is **Deep Learning**.