HandySpark: bringing pandas-like capabilities to Spark DataFrames

“Panda statues on gray concrete stairs during daytime” by chuttersnap on Unsplash

Originally posted on Towards Data Science.


HandySpark is a new Python package designed to improve PySpark user experience, especially when it comes to exploratory data analysis, including visualization capabilities.

Try it yourself using Google Colab:

Check the repository:


Apache Spark is the most popular cluster computing framework. It is listed as a required skill by about 30% of job listings (link).

The majority of Data Scientists uses Python and Pandas, the de facto standard for manipulating data. Therefore, it is only logical that they will want to use PySpark — Spark Python API and, of course, Spark DataFrames.

But, the transition from Pandas to Spark DataFrames may not be as smooth as one could hope…


I’ve been teaching Applied Machine Learning using Apache Spark at Data Science Retreat to more than 100 students over the course of 2 years.

My students were quite often puzzled with some of the quirks of PySpark and, some other times, baffled by the lack of some functionalities Data Scientists take for granted while using the traditional Pandas/Scikit-Learn combo.

I decided to address these problems by developing a Python package that would make exploratory data analysis much easier in PySpark

Introducing HandySpark

HandySpark is really easy to install and to integrate into your PySpark workflow. It takes only 3 steps to make your DataFrame a HandyFrame:

    1. Install HandySpark using pip install handyspark
    1. Import HandySpark with from handyspark import *
  1. Make your DataFrame a HandyFrame with hdf = df.toHandy()

After importing HandySpark, the method toHandy is added to Spark’s DataFrame as an extension, so you’re able to call it straight away.

Let’s take a quick look at everything you can do with HandySpark :-)

1. Fetching Data

No more cumbersome column selection, collection and manual extraction from Row objects!

Now you can fetch data just like you do in Pandas, using cols :


0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

Much, much easier, right? The result is a Pandas Series!

Just keep in mind that, due to the distributed nature of data in Spark, it is only possible to fetch the top rows of any given HandyFrame — so, no, you still cannot do things like [3:5] or [-1] and so on… only [:N].

There are also other pandas-like methods available:

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

If you haven’t guessed yet, the examples above (and all others in this post) are built using the famous Titanic dataset :-)

2. Plotting Data

The lack of an easy way of visualizing data always puzzled my students. And, when one searches the web for examples of plotting data using PySpark, it is even worse: many, many tutorials simply convert the WHOLE dataset to Pandas and then plot it the traditional way.

Please, DON’T EVER DO IT! It will surely work with toy datasets, but it would fail miserably if used with a really big dataset (the ones you likely handle if you’re using Spark).

HandySpark addresses this problem by properly computing statistics using Spark’s distributed computing capabilities and only then turning the results into plots. Then, it turns out to be easy like that:

fig, axs = plt.subplots(1, 4, figsize=(12, 4))
hdf.cols[['Fare', 'Age']].scatterplot(ax=axs[3])

Plotting with HandySpark!

Yes, there is even a scatterplot! How is that possible?! HandySpark splits both features into 30 bins each, computes frequencies for each and every one of the 900 combinations and plots circles which are sized accordingly.

3. Stratify

What if you want to perform stratified operations, using a split-apply-combine approach? The first idea that may come to mind is to use a groupby operation… but groupby operations trigger the dreaded data shuffling in Spark, so they should be avoided.

HandySpark handles this issue by filtering rows accordingly, performing computations on each subset of the data and then combining the results. For instance:

Pclass  Embarked
1       C            85
        Q             2
        S           127
2       C            17
        Q             3
        S           164
3       C            66
        Q            72
        S           353
Name: value_counts, dtype: int64

You can also stratify it with non-categorical columns by leveraging on either Bucket or Quantile objects. And then use it in a stratified plot:

hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].hist()

Stratified histogram

4. Imputing Missing Values

“Thou shall impute missing values”

First things first, though. How many missing values are there?

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
Name: missing(ratio), dtype: float64

OK, now we know there are 3 columns with missing values. Let’s drop Cabin(after all, 77% of its values are missing) and focus on the imputation of values for the other two columns: Age and Embarked.

The imputation of missing values could not be integrated into a Spark pipeline until version 2.2.0, when the Imputer transformer was released. But it still does not handle categorical variables (like Embarked), let alone stratified imputation…

Let’s see how HandySpark can help us with this task:

hdf_filled = hdf.fill(categorical=['Embarked'])
hdf_filled = (hdf_filled.stratify(['Pclass', 'Sex'])
              .fill(continuous=['Age'], strategy=['mean']))

First, it uses the most common value to fill missing values of our categorical column. Then, it stratifies the dataset according to Pclass and Sex to compute the mean value for Age , which is going to be used in the imputation.

Which values did it use for the imputation?

{'Embarked': 'S',
 'Pclass == "1" and Sex == "female"': {'Age': 34.61176470588235},
 'Pclass == "1" and Sex == "male"': {'Age': 41.28138613861386},
 'Pclass == "2" and Sex == "female"': {'Age': 28.722972972972972},
 'Pclass == "2" and Sex == "male"': {'Age': 30.74070707070707},
 'Pclass == "3" and Sex == "female"': {'Age': 21.75},
 'Pclass == "3" and Sex == "male"': {'Age': 26.507588932806325}}

So far, so good! Time to integrate it into a Spark pipeline, generating a custom transformer with transformers:

imputer = hdf_filled.transformers.imputer()

The imputer object is now a full-fledged serializable PySpark transformer! What does that mean? You can use it in your pipeline and save / load at will :-)

5. Detecting Outliers

“You shall not pass!”

How many outliers should we not allow to pass?

hdf_filled.outliers(method='tukey', k=3.)
PassengerId      0.0
Survived         0.0
Pclass           0.0
Age              1.0
SibSp           12.0
Parch          213.0
Fare            53.0
dtype: float64

Currently, only Tukey’s method is available (I am working on Mahalanobis distance!). This method takes an optional k argument, which you can set to larger values (like 3) to allow for a more loose detection.

Take the Fare column, for instance. There are, according to Tukey’s method, 53 outliers. Let’s fence them!

hdf_fenced = hdf_filled.fence(['Fare'])

What are the lower and upper fence values?

{'Fare': [-26.7605, 65.6563]}

Remember that, if you want to, you can also perform a stratified fencing :-)

As you’d probably guessed already, you can also integrate this step into your pipeline, generating the corresponding transformer:

fencer = hdf_fenced.transformers.fencer()

6. Pandas Functions

In Spark 2.3, Pandas UDFs were released! This turned out to be a major improvement for us, PySpark users, as we could finally overcome the performance bottleneck imposed by traditional User Defined Functions (UDFs). Awesome!

HandySpark takes it one step further, by doing all the heavy lifting for you :-) You only need to use its pandas object and voilà — lots of functions from Pandas are immediately available!

For instance, let’s use isin as you’d use with a regular Pandas Series:

some_ports = hdf_fenced.pandas['Embarked'].isin(values=['C', 'Q'])
Column<b'udf(Embarked) AS `<lambda>(Embarked,)`'>

But, remember Spark has lazy evaluation, so the result is a column expression which leverages the power of Pandas UDFs. The only thing left to do is to actually assign the results to a new column, right?

hdf_fenced = hdf_fenced.assign(is_c_or_q=some_ports)
# What's in there?
0     True
1    False
2    False
3     True
4     True
Name: is_c_or_q, dtype: bool

You got that right! HandyFrame has a very convenient assign method, just like in Pandas!

And this is not all! Both specialized str and dt objects from Pandas are available as well! For instance, what if you want to find if a given string contains another substring?

col_mrs = hdf_fenced.pandas['Name'].str.find(sub='Mrs.')
hdf_fenced = hdf_fenced.assign(is_mrs=col_mrs > 0)

For a complete list of all supported functions, please check the repository.

7. Your Own UDFs

The sky is the limit! You can create regular Python functions and use assign to create new columns :-) And they will be turned into Pandas UDFs for you!

The arguments of your function (or lambda) should have the names of the columns you want to use. For instance, to take the log of Fare:

import numpy as np
hdf_fenced = hdf_fenced.assign(logFare=lambda Fare: np.log(Fare + 1))

You can also use functions that take multiple columns as arguments. Keep in mind that the default return type, that is, the data type of the new column, will be the same as the first column used (Fare, in the example).

It is also possible to specify different return types — please check the repository for examples on that.

8. Nicer Exceptions

Spark exceptions are loooong… whenever something breaks, the error bubbles up through a seemingly infinite number of layers!

I always advise my students to scroll all the way down and work their way up trying to figure out the source of the problem… but, not anymore!

HandySpark will parse the error and show you a nice and bold red summary at the very top :-) It may not be perfect, but it will surely help!

Handy Exception

9. Safety First

Some dataframe operations, like collect or toPandas will trigger the retrieval of ALL rows of the dataframe!

To prevent the undesirable side effects of these actions, HandySpark implements a safety mechanism! It will automatically limit the output to 1,000 rows:

Safety mechanism in action!

Of course, you can specify a different limit using set_safety_limit or throw caution to the wind and tell your HandyFrame to ignore the safety using safety_off. Turning the safety mechanism off is good for a single action, though, as it will kick back in after returning the requested unlimited result.

Final Thoughts

My goal is to improve PySpark user experience and allow for a smoother transition from Pandas to Spark DataFrames, making it easier to perform exploratory data analysis and visualize the data. Needless to say, this is a work in progress, and I have many more improvements already planned.

If you are a Data Scientist using PySpark, I hope you give HandySpark a try and let me know your thoughts on it :-)

If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.

Hyper-parameters in Action! Introducing DeepReplay

Photo by Immo Wegmann on Unsplash

Originally posted on Towards Data Science.


In my previous post, I invited you to wonder what exactly is going on under the hood when you train a neural network. Then I investigated the role of activation functions, illustrating the effect they have on the feature space using plots and animations.

Now, I invite you to play an active role on the investigation!

It turns out these plots and animations drew quite some attention. So I decided to organize my code and structure it into a proper Python package, so you can plot and animate your own Deep Learning models!

How do they look like, you ask? Well, if you haven’t checked the original post yet, here it is a quick peek at it:

This is what animating with DeepReplay looks like :-)

So, without further ado, I present you… DeepReplay!


The package is called DeepReplay because this is exactly what it allows you to do: REPLAY the process of training your Deep Learning Model, plotting and animating several aspects of it.

The process is simple enough, consisting of five steps:

  1. It all starts with creating an instance of a callback!
  2. Then, business as usual: build and train your model.
  3. Next, load the collected data into Replay.
  4. Finally, create a figure and attach the visualizations to it.
  5. Plot and/or animate it!

Let’s go through each one of these steps!

1. Creating an instance of a callback

The callback should be an instance of ReplayData.

[gist id=”61394f6733e33ec72522a58614d1425a” /]

The callback takes, as arguments, the model inputs (X and y), as well as the filename and group name where you want to store the collected training data.

Two things to keep in mind:

  • For toy datasets, it is fine to use the same X and y as in your model fitting. These are the examples that will be plot —so, you can choose a random subset of your dataset to keep computation times reasonable, if you are using a bigger dataset.
  • The data is stored in a HDF5 file, and you can use the same file several times over, but never the same group! If you try running it twice using the same group name, you will get an error.

2. Build and train your model

Like I said, business as usual, nothing to see here… just don’t forget to add your callback instance to the list of callbacks when fitting!

[gist id=”86591c9796731c21f920e01ed2376b23″ /]

3. Load collected data into Replay

So, the part that gives the whole thing its name… time to replay it!

It should be straightforward enough: create an instance of Replay, providing the filename and the group name you chose in Step 1.

[gist id=”019637d6d041fdbd269db9a78a2311b6″ /]

4. Create a figure and attach visualizations to it

This is the step where things get interesting, actually. Just use Matplotlib to create a figure, as simple as the one in the example, or as complex as subplot2grid allows you to make it, and start attaching visualizations from your Replay object to the figure.

[gist id=”ba49bdca40a2abaa68af39922e78a556″ /]

The example above builds a feature space based on the output of the layer named, suggestively, hidden.

But there are five types of visualizations available:

  • Feature Space: plot representing the twisted and turned feature space, corresponding to the output of a hidden layer (only 2-unit hidden layers supported for now), including grid lines for 2-dimensional inputs;
  • Decision Boundary: plot of a 2-D grid representing the original feature space, together with the decision boundary (only 2-dimensional inputs supported for now);
  • Probability Histogram: two histograms of the resulting classification probabilities for the inputs, one for each class, corresponding to the model output (only binary classification supported for now);
  • Loss and Metric: line plot for both the loss and a chosen metric, computed over all the inputs you passed as arguments to the callback;
  • Loss Histogram: histogram of the losses computed over all the inputs you passed as arguments to the callback (only binary cross-entropy loss supported for now).

5. Plot and/or animate it!

For this example, with a single visualization, you can use its plot and animate methods directly. These methods will return, respectively, a figure and an animation, which you can then save to a file.

[gist id=”83ef91da63de149f5a58f6e428ab37f3″ /]

If you decide to go with multiple simultaneous visualizations, there are two helper methods that return composed plots and animations, respectively: compose_plots and compose_animations.

To illustrate these methods, here is a gist that comes from the “canonicalexample I used in my original post. There are four visualizations and five plots (Probability Histogram has two plots, for negative and positive cases).

The animated GIF at the beginning of this post is actually the result of this composed animation!

[gist id=”6ad78608f5ae7ebe2c31f84f9b001625″ /]


At this point, you probably noticed that the two coolest visualizations, Feature Space and Decision Boundary, are limited to two dimensions.

I plan on adding support for visualizations in three dimensions also, but most of datasets and models have either more inputs or hidden layers with many more units.

So, these are the options you have:

  • 2D inputs, 2-unit hidden layer: Feature Space with optional grid (check the Activation Functions example);
  • 3D+ inputs, 2-unit hidden layer: Feature Space, but no grid;
  • 2D inputs, hidden layer with 3+ units: Decision Boundary with optional grid (check the Circles example);
  • nothing is two dimensional: well… there is always a workaround, right?

Working around multidimensionality

What do we want to achieve? Since we can only do 2-dimensional plots, we want 2-dimensional outputs — simple enough.

How to get 2-dimensional outputs? Adding an extra hidden layer with two units, of course! OK, I know this is suboptimal, as it is actually modifying the model (did I mention this is a workaround?!). We can then use the outputs of this extra layer for plotting.

You can check either the Moons or the UCI Spambase notebooks, for examples on adding an extra hidden layer and plotting it.

NOTE: The following part is a bit more advanced, it delves deeper into the reasoning behind adding the extra hidden layer and what it represents. Proceed at your own risk :-)

What are we doing with the model, anyway? By adding an extra hidden layer, we can think of our model as having two components: an encoder and a decoder. Let’s dive just a bit deeper into those:

  • Encoder: the encoder goes from the inputs all the way to our extra hidden layer. Let’s consider its 2-dimensional output as features and call them f1 and f2.
  • Decoder: the decoder, in this case, is just a plain and simple logistic regression, which takes two inputs, say, f1 and f2, and outputs a classification probability.

Let me try to make it more clear with a network diagram:

Encoder / Decoder after adding an extra hidden layer

What do we have here? A 9-dimensional input, an original hidden layer with 5 units, an extra hidden layer with two units, its corresponding two outputs (features) and a single unit output layer.

So, what happens with the inputs along the way? Let’s see:

  1. Inputs (x1 through x9) are fed into the encoder part of the model.
  2. The original hidden layer twists and turns the inputs. The outputs of the hidden layer can also be thought of as features (these would be the outputs of units h1 through h5 in the diagram), but these are assumed to be n-dimensional and therefore not suited for plotting. So far, business as usual.
  3. Then comes the extra hidden layer. Its weights matrix has shape (n, 2) (in the diagram, n = 5 and we can count 10 arrows between h and e nodes). If we assume a linear activation function, this layer is actually performing an affine transformation, mapping points from a n-dimensional to a 2-dimensional feature space. These are our features, f1 and f2, the output of the encoder part.
  4. Since we assumed a linear activation function for the extra hidden layer, f1 and f2 are going to be directly fed to the decoder (output layer), that is, to a single unit with a sigmoid activation function. This is a plain and simple logistic regression.

What does it all mean? It means that our model is also learning a latent space with two latent factors (f1 and f2) now! Fancy, uh?! Don’t get intimidated by the fanciness of these terms, though… it basically means the model learned to best compress the information to only two features, given the task at hand — a binary classification.

This is the basic underlying principle of auto-encoders, the major difference being the fact that the auto-encoder’s task is to reconstruct its inputs, not classify them in any way.

Final Thoughts

I hope this post enticed you to try DeepReplay out :-)

If you come up with nice and cool visualizations for different datasets, or using different network architectures or hyper-parameters, please share it on the comments section. I am considering starting a Gallery page, if there is enough interest in it.

For more information about the DeepReplay package, like installation, documentation, examples and notebooks (which you can play with using Google Colab), please go to my GitHub repository:

Have fun animating your models! :-)

If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.

Hyper-parameters in Action! Activation Functions


This is the first of a series of posts aiming at presenting visually, in a clear and concise way, some of the fundamental moving parts of training a neural network: the hyper-parameters.

Originally posted on Towards Data Science.


Deep Learning is all about hyper-parameters! Maybe this is an exaggeration, but having a sound understanding of the effects of different hyper-parameters on training a deep neural network is definitely going to make your life easier.

While studying Deep Learning, you’re likely to find lots of information on the importance of properly setting the network’s hyper-parameters: activation functions, weight initializer, optimizer, learning rate, mini-batch size, and the network architecture itself, like the number of hidden layers and the number of units in each layer.

So, you learn all the best practices, you set up your network, define the hyper-parameters (or just use its default values), start training and monitor the progress of your model’s losses and metrics.

Perhaps the experiment doesn’t go so well as you’d expect, so you iterate over it, tweaking the network, until you find out the set of values that will do the trick for your particular problem.

Looking for a deeper understanding (no pun intended!)

Have you ever wondered what exactly is going on under the hood? I did, and it turns out that some simple experiments may shed quite some light on this matter.

Take activation functions, for instance, the topic of this post. You and I know that the role of activation functions is to introduce a non-linearity, otherwise the whole neural network could be simply replaced by a corresponding affine transformation (that is, a linear transformation, such as rotating, scaling or skewing, followed by a translation), no matter how deep the network is.

A neural network having only linear activations (that is, no activation!) would have a hard time handling even a quite simple classification problem like this (each line has 1,000 points, generated for x values equally spaced between -1.0 and 1.0):

Figure 1: in this two-dimensional feature space, the blue line represents the negative cases (y = 0), while the green line represents the positive cases (y= 1).

If the only thing a network can do is to perform an affine transformation, this is likely what it would be able to come up with as a solution:

Figure 2: linear boundary — doesn’t look so good, right?

Clearly, this is not even close! Some examples of much better solutions are:

Figure 3: Non-linearities to the rescue!

These are three fine examples of what non-linear activation functions bring to the table! Can you guess which one of the images corresponds to a ReLU?

Non-linear boundaries (or are they?)

How does these non-linear boundaries come to be? Well, the actual role of the non-linearity is to twist and turn the feature space so much so that the boundary turns out to be… LINEAR!

OK, things are getting more interesting now (at least, I thought so first time I laid my eyes on it in this awesome Chris Olah’s blog post, from which I drew my inspiration to write this). So, let’s investigate it further!

Next step is to build the simplest possible neural network to tackle this particular classification problem. There are two dimensions in our feature space (x1 and x2), and the network has a single hidden layer with two units, so we preserve the number of dimensions when it comes to the outputs of the hidden layer (z1 and z2).

Figure 4: diagram of a simple neural network with a single 2-unit hidden layer

Up to this point, we are still on the realm of affine transformations… so, it is time for a non-linear activation function, represented by the Greek letter sigma, resulting in the activation values (a1 and a2) for the hidden layer.

These activation values represented the twisted and turned feature space I referred to in the first paragraph of this section. This is a preview of what it looks like, when using a sigmoid as activation function:

Figure 5: two-dimensional feature space: twisted and turned!

As promised, the boundary is LINEAR! By the way, the plot above corresponds to the left-most solution with a non-linear boundary on the original feature space (Figure 3).

Neural network’s basic math recap

Just to make sure you and I are on the same page, I am showing you below four representations of the very basic matrix arithmetic performed by the neural network up to the hidden layer, BEFORE applying the activation function (that is, just an affine transformation such as xW + b)

Basic matrix arithmetic: 4 ways of representing the same thing in the network

Time to apply the activation function, represented by the Greek letter sigma on the network diagram.

Activation function: applied on the results of the affine transformations

Voilà! We went from the inputs to the activation values of the hidden layer!

Implementing the network in Keras

For the implementation of this simple network, I used Keras Sequential model API. Apart from distinct activation functions, every model trained used the very same hyper-parameters:

  • weight initializers: Glorot (Xavier) normal (hidden layer) and random normal (output layer);
  • optimizer: Stochastic Gradient Descent (SGD);
  • learning rate: 0.05;
  • mini-batch size: 16;
  • number of hidden layers: 1;
  • number of units (in the hidden layer): 2.

Given that this is a binary classification task, the output layer has a single unit with a sigmoid activation function and the loss is given by binary cross-entropy.

[gist id=”e2536b9f45c4884f90d20d68e1b3d8c3″]

Code: simple neural network with a single 2-unit hidden layer

Activation functions in action!

Now, for the juicy part — visualizing the twisted and turned feature space as the network trains, using a different activation function each time: sigmoid, tanh and ReLU.

In addition to showing changes in the feature space, the animations also contain:

  • histograms of predicted probabilities for both negative (blue line) and positive cases (green line), with misclassified cases shown in red bars (using threshold = 0.5);
  • line plots of accuracy and average loss;
  • histogram of losses for every element in the dataset.


Let’s start with the most traditional of the activation functions, the sigmoid, even though, nowadays, its usage is pretty much limited to the output layer in classification tasks.

Figure 6: sigmoid activation function and its gradient

As you can see in Figure 6, a sigmoid activation functionsquashes” the inputs values into the range (0, 1) (same range probabilities can take, the reason why it is used in the output layer for classification tasks). Also, remember that the activation values of any given layer are the inputs of the following layer and, given the range for the sigmoid, the activation values are going to be centered around 0.5, instead of zero (as it usually is the case for normalized inputs).

It is also possible to verify that its gradient peak value is 0.25 (for z = 0) and that it gets already close to zero as |z| reaches a value of 5.

So, how does using a sigmoid activation function work for this simple network? Let’s take a look at the animation:

Sigmoid in action!

There are a couple of observations to be made:

  • epochs 15–40: it is noticeable the typical sigmoidsquashing” happening on the horizontal axis;
  • epochs 40–65: the loss stays at a plateau, and there is a “widening” of the transformed feature space on the vertical axis;
  • epoch 65: at this point, negative cases (blue line) are all correctly classified, even though its associated probabilities still are distributed up to 0.5; while the positive cases on the edges are still misclassified;
  • epochs 65–100: the aforementioned “widening” becomes more and more intense, up to the point pretty much all feature space is covered again, while the loss falls steadily;
  • epoch 103: thanks to the “widening”, all positive cases are now lying within the proper boundary, although some still have probabilities barely above the 0.5 threshold;
  • epoch 100–150: there is now some “squashing” happening on the vertical axis as well, the loss falls a bit more to what seems to be a new plateau and, except for a few of the positive edge cases, the network is pretty confident on its predictions.

So, the sigmoid activation function succeeds in separating both lines, but the loss declines slowly, while staying at plateaus for a significant portion of the training time.

Can we do better with a different activation function?


The tanh activation function was the evolution of the sigmoid, as it outputs values with a zero mean, differently from its predecessor.

Figure 7: tanh activation function and its gradient

As you can see in Figure 7, the tanh activation functionsquashes” the input values into the range (-1, 1). Therefore, being centered at zero, the activation values are already (somewhat) normalized inputs for the next layer.

Regarding the gradient, it has a much bigger peak value of 1.0 (again, for z = 0), but its decrease is even faster, approaching zero to values of |z| as low as 3. This is the underlying cause to what is referred to as the problem of vanishing gradients, which causes the training of the network to be progressively slower.

Now, for the corresponding animation, using tanh as activation function:

Tanh in action!

There are a couple of observations to be made:

  • epochs 10–40: there is a tanhsquashing” happening on the horizontal axis, though it less pronounced, while the loss stays at a plateau;
  • epochs 40–55: there is still no improvement in the loss, but there is a “widening” of the transformed feature space on the vertical axis;
  • epoch 55: at this point, negative cases (blue line) are all correctly classified, even though its associated probabilities still are distributed up to 0.5; while the positive cases on the edges are still misclassified;
  • epochs 55–65: the aforementioned “widening” quickly reaches the point where pretty much all feature space is covered again, while the loss falls abruptly;
  • epoch 69: thanks to the “widening”, all positive cases are now lying within the proper boundary, although some still have probabilities barely above the 0.5 threshold;
  • epochs 65–90: there is now some “squashing” happening on the vertical axis as well, the loss keeps falling until reaching a new plateau and the network exhibits a high level of confidence for all predictions;
  • epochs 90–150: only small improvements in the predicted probabilities happen at this point.

OK, it seems a bit better… the tanh activation function reached a correct classification for all cases faster, with the loss also declining faster (when declining, that is), but it also spends a lot of time in plateaus.

What if we get rid of all the “squashing”?


Rectified Linear Units, or ReLUs for short, are the commonplace choice of activation function these days. A ReLU addresses the problem of vanishing gradients so common in its two predecessors, while also being the fastest to compute gradients for.

Figure 8: ReLU activation function and its gradient

As you can see in Figure 8, the ReLU is a totally different beast: it does not “squash” the values into a range — it simply preserves positive values and turns all negative values into zero.

The upside of using a ReLU is that its gradient is either 1 (for positive values) or 0 (for negative values) — no more vanishing gradients! This pattern leads to a faster convergence of the network.

On the other hand, this behavior can lead to what it is called a “dead neuron”, that is, a neuron whose inputs are consistently negative and, therefore, always has an activation value of zero.

Time for the last of the animations, which is quite different from the previous two, thanks to the absence of “squashing” in the ReLU activation function:

ReLU in action!

There are a couple of observations to be made:

  • epochs 0–10: the loss falls steadily from the very beginning
  • epoch 10: at this point, negative cases (blue line) are all correctly classified, even though its associated probabilities still are distributed up to 0.5; while the positive cases on the edges are still misclassified;
  • epochs 10–60: loss falls until reaching a plateau, all cases are already correctly classified since epoch 52, and the network already exhibits a high level of confidence for all predictions;
  • epochs 60–150: only small improvements in the predicted probabilities happen at this point.

Well, no wonder the ReLUs are the de facto standard for activation functions nowadays. The loss kept falling steadily from the beginning and only plateaued at a level close to zero, reaching correct classification for all cases in about 75% the time it took tanh to do it.


The animations are cool (ok, I am biased, I made them!), but not very handy to compare the overall effect of each and every different activation function on the feature space. So, to make it easier for you to compare them, there they are, side by side:

Figure 9: linear boundaries on transformed feature space (top row), non-linear boundaries on original feature space (bottom row)

What about side-by-side accuracy and loss curves, so I can also compare the training speeds? Sure, here we go:

Figure 10: accuracy and loss curves for each activation function

Final Thoughts

The example I used to illustrate this post is almost as simple as it could possibly be, and the patterns depicted in the animations are intended to give you just a general idea of the underlying mechanics of each one of the activation functions.

Besides, I got “lucky” with my initialization of the weights (maybe using 42 as seed is a good omen?!) and all three networks learned to classify correctly all the cases within 150 epochs of training. It turns out, training is VERY sensitive to the initialization, but this is a topic for a future post.

Nonetheless, I truly hope this post and its animations can give you some insights and maybe even some “a-ha!” moments while learning about this fascinating topic that is Deep Learning.