How to pick a successful AI project, part 1: Finding the problem and collecting data

This post is part of a series on ‘how to pick a successful AI project’. Part 2, Part 3.

Imagine that you have 200 hrs of your life to improve your career prospects in ML. What is the best use of this time? I’m betting on doing a portfolio project.

Let’s compare different options:

1. You could go to meetups

2. You could do tutorials, follow template code and be the Github user # 2245 who has this same exercise

3. You could take classes or MOOCs

Option 1 relies on serendipity. Assume you meet a potential job opportunity every 5 meetups (which could be a bit optimistic!) You may well spend a year going to meetups and getting exactly zero job opportunities if:

  • you are not looking amazing on paper/online, and
  • you don’t communicate very well in person

Assuming commuting to and attending the meetup takes 5hrs, How many meetups can you do? 40. That’s 20 conversations worth having. Still, without some strong signal that you can perform (such as work experience, or a tangible ML project you built from scratch), you will not convert these random conversations into job offers.

Options 2 and 3 (tutorials and MOOCs) don’t differentiate you enough from the masses. They are a prerequisite to reaching the level that would make a hiring manager take notice. Because work experience is hard to get when you are trying to get into a new field (chicken and egg problem) that leaves you with my preferred option, and the reason I wrote this series of posts: show the world what you can do by having an AI portfolio project.

A substantial project strongly dominates the other options

Once I asked Ted Dunning: “What is the one thing you care about as an interviewer?”. His answer: “I only care about one thing: What you have done when nobody told you what to do.”

Having a portfolio of projects shows creativity, which is extremely important in data problems.

Data cleaning, even model fitting, is somewhat grunt work. We will likely automate this in the future.

What is not automatable (and where you want to excel) is at:

• finding a problem

• That is worth solving (produces business value; helps someone)

• and is solvable with current tech

Ok, so you want to do a good AI project

The rest of this post is what I’ve learned after mentoring more than 150 AI projects over five years at the companies I’ve founded.

I’ve grouped what I know into four classes:

• Finding the problem

• Collecting data

• Working with data

• working with models

Finding the Problem

What human task is your project helping or displacing? If the task’s decisions take longer than a second for a human, pick another one

Machines are good at helping humans with boring tasks. The rule of thumb (from Andrew Ng) is that you want to tackle tasks that take a human less than one second. Driving, for example, is a long series of sub-second decisions. There’s little deep thinking there. Writing a novel is not like that. While NLP progress may look like we are getting closer and closer, writing a novel with AI is not a good project.

Myth: you need a lot of data. you need to be google to get value out of machine learning

When looking for problems, you are looking for data too. Often you feel there’s no data to address the problem you found. Or not enough data. Data is valuable, and those who have data don’t make them public. There’s plenty of public data, but nothing that matches what you need.

You come up with hacks, of course. Maybe you can combine different sources? Maybe you generate and label the data yourself?

Then you read online that you need giant datasets. You are screwed. You are not going to label millions of images just by yourself.

It’s true that for very complex deep learning models, you need those giant datasets. An excellent boost for ML in the 21st century is that now we have lots more data, and compute power, to train more complex models.

Many of the revolutionary results in computer vision and NLP come from having a lot of data.

Does this mean that you cannot do productive machine learning if you have no big compute clusters or datasets in the Terabyte range? Of course not.

Libraries contain pretrained models (like Inception V3, ResNet, AlexNet). You can use then to save computation time.

Take, for example, YOLO (you only look once). Anyone with an off the shelf GPU can do real-time object detection and segmentation in video. That is amazing. How many projects can you conceive with this feature? Make it an exercise: list 10 projects that you could build based on YOLO. By the time you read this, it’s likely there’s a new algorithm that does the task better. It’s a beautiful time to be alive. You don’t need terabyte-sized datasets to get value off of machine learning.

Pick a problem that passes ‘the eyebrow test’

The skill of picking a problem is as useful as the skill to solve it. Some problems will capture your imagination, and some will leave you unmoved. Pick the first.

And then there are those projects that make people think that what you are saying is impossible. You want one of those.

How do you know you picked the right problem? Use ‘the eyebrow test.’ You want to see the eyebrows of the other person going up when they hear it. If they don’t, keep looking.

Some problems are mere curiosities and can suck your resources. Don’t go for those. There’s tremendous potential for impact using ML right now. Why spend your time creating a translation algorithm from Klingon to Dothraki (both invented languages) when you can make people’s lives easier?

At Data Science Retreat we believe that because there are plenty of ‘good problems,’ there’s no excuse to pick one that doesn’t pass the eyebrow test. It takes time to find a good problem. How long should you spend? As much as needed to pass the eyebrow test.

It helps if you are in the surroundings of people who have found a good problem before. Sometimes, those with deep domain knowledge in ML are not as useful as you may think to pick up ideas.

You are selling the value of AI. When you do ‘eyebrow test’ projects, you help not only the people-who-have-that-problem (always keep them in mind) but… your fellow data scientists. The more impressive your projects are, the more the general public will appreciate how transformative AI is for society. And the more opportunities you create for everyone else in the field.

You only need to be in the ballpark of a good idea

One participant came to me saying that he was running late to pick up an idea and that he had exhausted all his possibilities. He had nothing.

“Ok, let’s look at what you are passionate about. What’s alive in you?”

“Well,… I hate waste.”

So we googled ‘waste trash machine learning.’ Nothing too obvious, nor too exciting, came about. After some more search, we found a Stanford project that categorized trash. The students had a trash dataset, painfully labeled. Still, this was not exactly an idea that passes the ‘eyebrow test.’

“What if instead of categorizing trash, we could build something that picks up trash?” (iterating on the idea)

“You mean something that goes around autonomously?”

“Yes, a self-driving toy car. There was one project in batch 08 of DSR that did exactly that. The car ran laps on a circuit. You would have to modify it to identify trash, get close to it, and pick it.”

“That sounds amazing!” (eyebrow test passed)

With time, this ballpark idea morphed from general trash to picking up cigarette butts, which are terrible for the environment. Details about how to pick the cigarette butts improved with time: stabbing them, instead of trying to grab them with a robotic arm. We will talk more about this project later in this series.

Pick a problem that moves you. If nothing does, use ‘watering holes’ to listen to problems that move people

If you have the problem you want to tackle, you have a significant advantage. You understand the need. You can build the solution and know how well it works by applying it to yourself. You can tell between ‘nice to have’ and ‘pain point.’ At this point, what we are doing is no different from what startup founders and product managers do.

You have in your hands the closest thing to a shortcut: have the problem yourself. You will save time but not going into detours that don’t help. You will have a keen sense of what to build.

In 2014 I was building a company that eventually failed doing customer lifetime value (CLV) predictions for e-commerce stores. The product CLV predictions, though, was something I could deliver myself as a consultant. So I became a ‘CLV consultant.’ One of my clients was hiring a data scientist, and they hired me fulltime, so I magically transitioned into data science.

Many others had the same problem: they had tech skills, maybe a Ph.D., and they wanted to become data scientists, but they didn’t know how. Remember, this was 2014, before the web started boiling with advice on how to become a data scientist. I built a business around helping others solving this problem, and it’s been working well for the last five years.

I knew well what the problem is; too much information, unclear guidelines, interviewers who don’t know how to recognize talent. Every step of the way, I felt I knew what I was doing with this business; a feeling that is extremely valuable. Transitioning to data science is an excellent problem; 5 years later, people still struggle with it.

I don’t play golf. If I wanted to build a product for golfers, I’d be lost entirely. I would build features nobody need; I would miss the pain points. Even if I ran interviews and listened to the market, I would at a disadvantage to a golfer.

So my advice to pick a portfolio project: pick a problem you know well. Even better if it’s a problem that moves you. If you lost three friends to suicide, build something to prevent suicide. In this case, you don’t have the problem yourself, but you have a strong motivation to solve it.

What if you have no problems whatsoever? You have been in the same industry forever (say Oil and Gas), and all valuable problems there got taken care of!

I don’t believe you. There’s no industry so mature that all problems are solved. But, ok, you cannot come up with something that moves you, some problem that you have.

Then observe what problems other people have. Large groups of people. They tend to congregate in public spaces and bitch about their problems; every time you see people bitching about something… turn that into an opportunity to do a project.

Which public spaces? I call these ‘watering holes’ (HT to Amy Hoy). Online, you can have obscure forums, but Reddit and twitter are the easiest. Just sit there and ‘listen’ to people discuss the problem. Learn every detail of it. Is it a real problem, or a ‘nice to have’?

For example, you may join a gaming subreddit to see if gamers care about having stronger AI in videogames. Or if they care for VR. These ideas are too abstract to be good project ideas, but you get where I’m going.

Collecting data

Integrate distinct data sources

Often companies are so focused on getting value out of the data they have they forget they can increase value using the data not in their company but publicly available.

There’s plenty of open access data. And APIs for data that changes frequently. There’s no reason not to use multiple data sources. You can solve a more interesting problem (one that was not obvious before) by integrating APIs.

For a collection of APIs, check https://www.programmableweb.com/.

Problems that seemed impossible with a single source of data become solvable when you add a new data source. Boring projects come alive. Thankless tasks become a joy to work with if you manage to find a twist that shows more value.

When using multiple data sources, you have to stitch them together using a shared key (a column that is present in both datasets.) You cannot combine data sources that don’t have a shared key, and that tends to be a showstopper for many ideas.

Instead of collecting or reusing data, produce your own data

You don’t need to find data, or own it. Thanks to pretrained models (see later section on them), you don’t need all that much data, which means you can produce it yourself. Removing “I don’t much/any data” as an obstacle opens up the space of problems you want to solve.

To produce original data, I found one big hack: use hardware. Sensors are cheap, and they give you data that you own.

JD.com’s Shanghai fulfillment center uses automated warehouse robotics to organize, pick, and ship 200k orders per day. 4 human workers tend the facility. JD.com grew its warehouse count and surface area 45% YoY. AI affects manufacturing too. It’s easier than ever to produce things en masse, and this means there are a lot more hardware ‘toys’ in the market. Things that you would never consider to be affordable like microscopes, spectroscopes are reaching the mass consumer market and are eminently hackable. These are wonderful data sources!

Because of Shenzhen, Kickstarter, etc. hardware is evolving far faster than before. It’s never going to be as fast to iterate on as software, but we are getting there. Have you checked what’s available on Ali express? There are multiple sensors you can buy for under 100 bucks. Attach one of these to a phone running your code, and you have a portable purpose-specific machine.

For example, you can buy a microscope and use it to detect Malaria without a human doctor. Deep learning running on a phone is good enough to count parasites on blood.

Ali express is full of cheap hardware that you can attach to a phone. You can add a lot of value to someone’s life using a mixture of phones (that run the ML code) and

Example: our Malaria Microscope.

Eduardo, AIscope’s founder, after reaching 1000x the first time

Malaria kills about 400k people per year, mostly children. It’s curable, but detecting it is not trivial, and it happens in parts of the world where hospitals and doctors are not very accessible. Malaria parasites are quite big, and a simple microscope can show them; the standard diagnostic method involves a doctor counting them.

It turns out you can attach a USB microscope to a mobile phone, and run DL code on the phone that counts parasites with accuracy comparable to a human. Eduardo Peire, DSR alumni, started with a pubic Malaria dataset. While small, this was enough to demonstrate value to people, and his crowdfunding campaign got him enough funds to fly to the Amazon and collect more samples. You can follow their progress here: http://aiscope.net.

For another example, you can buy a spectroscope that pointed at any material, and it’d tell you its composition. It’s small enough to attach to a phone for a hand-held scanner. Can you detect traces of peanuts in food? Yes! There you go, a solution to a problem real people have. If you are allergic to peanuts, this will buy you a certain quality of life.

Sensors are cheap nowadays, and they will help you get unique data. You can turn a phone into a microscope, a spectroscope, or any other tool. The built-in camera and accelerometer are excellent sources of data too.

Next on this series: what I’ve learned about working with data and how this can help you pick successful AI projects.

Understanding binary cross-entropy / log loss: a visual explanation

Photo by G. Crescoli on Unsplash

Originally posted on Towards Data Science.

Introduction

If you are training a binary classifier, chances are you are using binary cross-entropy / log loss as your loss function.

Have you ever thought about what exactly does it mean to use this loss function? The thing is, given the ease of use of today’s libraries and frameworks, it is very easy to overlook the true meaning of the loss function used.

Motivation

I was looking for a blog post that would explain the concepts behind binary cross-entropy / log loss in a visually clear and concise manner, so I could show it to my students at Data Science Retreat. Since I could not find any that would fit my purpose, I took the task of writing it myself :-)

A Simple Classification Problem

Let’s start with 10 random points:

x = [-2.2, -1.4, -0.8, 0.2, 0.4, 0.8, 1.2, 2.2, 2.9, 4.6]

This is our only feature: x.

Figure 0: the feature

Now, let’s assign some colors to our points: red and green. These are our labels.

Figure 1: the data

So, our classification problem is quite straightforward: given our feature x, we need to predict its label: red or green.

Since this is a binary classification, we can also pose this problem as: “is the point green” or, even better, “what is the probability of the point being green”? Ideally, green points would have a probability of 1.0 (of being green), while red points would have a probability of 0.0 (of being green).

In this setting, green points belong to the positive class (YES, they are green), while red points belong to the negative class (NO, they are not green).

If we fit a model to perform this classification, it will predict a probability of being green to each one of our points. Given what we know about the color of the points, how can we evaluate how good (or bad) are the predicted probabilities? This is the whole purpose of the loss function! It should return high values for bad predictions and low values for good predictions.

For a binary classification like our example, the typical loss function is the binary cross-entropy / log loss.

Loss Function: Binary Cross-Entropy / Log Loss

If you look this loss function up, this is what you’ll find:

Binary Cross-Entropy / Log Loss

where y is the label (1 for green points and 0 for red points) and p(y) is the predicted probability of the point being green for all N points.

Reading this formula, it tells you that, for each green point (y=1), it adds log(p(y)) to the loss, that is, the log probability of it being green. Conversely, it adds log(1-p(y)), that is, the log probability of it being red, for each red point (y=0). Not necessarily difficult, sure, but no so intuitive too…

Besides, what does entropy have to do with all this? Why are we taking log of probabilities in the first place? These are valid questions and I hope to answer them on the “Show me the math” section below.

But, before going into more formulas, let me show you a visual representation of the formula above…

Computing the Loss — the visual way

First, let’s split the points according to their classes, positive or negative, like the figure below:

Figure 2: splitting the data!

Now, let’s train a Logistic Regression to classify our points. The fitted regression is a sigmoid curve representing the probability of a point being green for any given x . It looks like this:

Figure 3: fitting a Logistic Regression

Then, for all points belonging to the positive class (green), what are the predicted probabilities given by our classifier? These are the green bars under the sigmoid curve, at the x coordinates corresponding to the points.

Figure 4: probabilities of classifying points in the POSITIVE class correctly

OK, so far, so good! What about the points in the negative class? Remember, the green bars under the sigmoid curve represent the probability of a given point being green. So, what is the probability of a given point being red? The red bars ABOVE the sigmoid curve, of course :-)

Figure 5: probabilities of classifying points in the NEGATIVE class correctly

Putting it all together, we end up with something like this:

Figure 6: all probabilities put together!

The bars represent the predicted probabilities associated with the corresponding true class of each point!

OK, we have the predicted probabilities… time to evaluate them by computing the binary cross-entropy / log loss!

These probabilities are all we need, so, let’s get rid of the x axis and bring the bars next to each other:

Figure 7: probabilities of all points

Well, the hanging bars don’t make much sense anymore, so let’s reposition them:

Figure 8: probabilities of all points — much better :-)

Since we’re trying to compute a loss, we need to penalize bad predictions, right? If the probability associated with the true class is 1.0, we need its loss to be zero. Conversely, if that probability is low, say, 0.01, we need its loss to be HUGE!

It turns out, taking the (negative) log of the probability suits us well enough for this purpose (since the log of values between 0.0 and 1.0 is negative, we take the negative log to obtain a positive value for the loss).

Actually, the reason we use log for this comes from the definition of cross-entropy, please check the “Show me the math” section below for more details.

The plot below gives us a clear picture —as the predicted probability of the true class gets closer to zero, the loss increases exponentially:

Figure 9: Log Loss for different probabilities

Fair enough! Let’s take the (negative) log of the probabilities — these are the corresponding losses of each and every point.

Finally, we compute the mean of all these losses.

Figure 10: finally, the loss!

Voilà! We have successfully computed the binary cross-entropy / log loss of this toy example. It is 0.3329!

Show me the code

If you want to double check the value we found, just run the code below and see for yourself :-)

[gist id=”c8cf1c6bbbb4422d082bfd77074bb257″]

Show me the math (really?!)

Jokes aside, this post is not intended to be very mathematically inclined… but for those of you, my readers, looking to understand the role of entropy, logarithms in all this, here we go :-)

If you want to go deeper into information theory, including all these concepts — entropy, cross-entropy and much, much more — check Chris Olah’s post out, it is incredibly detailed!

Distribution

Let’s start with the distribution of our points. Since y represents the classes of our points (we have 3 red points and 7 green points), this is what its distribution, let’s call it q(y), looks like:

Figure 11: q(y), the distribution of our points

Entropy

Entropy is a measure of the uncertainty associated with a given distribution q(y).

What if all our points were green? What would be the uncertainty of that distribution? ZERO, right? After all, there would be no doubt about the color of a point: it is always green! So, entropy is zero!

On the other hand, what if we knew exactly half of the points were green and the other half, red? That’s the worst case scenario, right? We would have absolutely no edge on guessing the color of a point: it is totally random! For that case, entropy is given by the formula below (we have two classes (colors)— red or green — hence, 2):

Entropy for a half-half distribution

For every other case in between, we can compute the entropy of a distribution, like our q(y), using the formula below, where C is the number of classes:

Entropy

So, if we know the true distribution of a random variable, we can compute its entropy. But, if that’s the case, why bother training a classifier in the first place? After all, we KNOW the true distribution…

But, what if we DON’T? Can we try to approximate the true distribution with some other distribution, say, p(y)? Sure we can! :-)

Cross-Entropy

Let’s assume our points follow this other distribution p(y). But we know they are actually coming from the true (unknown) distribution q(y), right?

If we compute entropy like this, we are actually computing the cross-entropy between both distributions:

Cross-Entropy

If we, somewhat miraculously, match p(y) to q(y) perfectly, the computed values for both cross-entropy and entropy will match as well.

Since this is likely never happening, cross-entropy will have a BIGGER value than the entropy computed on the true distribution.

Cross-Entropy minus Entropy

It turns out, this difference between cross-entropy and entropy has a name…

Kullback-Leibler Divergence

The Kullback-Leibler Divergence,or “KL Divergence” for short, is a measure of dissimilarity between two distributions:

KL Divergence

This means that, the closer p(y) gets to q(y), the lower the divergence and, consequently, the cross-entropy, will be.

So, we need to find a good p(y) to use… but, this is what our classifier should do, isn’t it?! And indeed it does! It looks for the best possible p(y), which is the one that minimizes the cross-entropy.

Loss Function

During its training, the classifier uses each of the N points in its training set to compute the cross-entropy loss, effectively fitting the distribution p(y)! Since the probability of each point is 1/N, cross-entropy is given by:

Cross-Entropy —point by point

Remember Figures 6 to 10 above? We need to compute the cross-entropy on top of the probabilities associated with the true class of each point. It means using the green bars for the points in the positive class (y=1) and the red hanging bars for the points in the negative class (y=0) or, mathematically speaking:

Mathematical expression corresponding to Figure 10 :-)

The final step is to compute the average of all points in both classes, positive and negative:

Binary Cross-Entropy — computed over positive and negative classes

Finally, with a little bit of manipulation, we can take any point, either from the positive or negative classes, under the same formula:

Binary Cross-Entropy — the usual formula

Voilà! We got back to the original formula for binary cross-entropy / log loss :-)

Final Thoughts

I truly hope this post was able shine some new light on a concept that is quite often taken for granted, that of binary cross-entropy as loss function. Moreover, I also hope it served to show you a little bit how Machine Learning and Information Theory are linked together.

If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.

Hyper-parameters in Action! Introducing DeepReplay

Photo by Immo Wegmann on Unsplash

Originally posted on Towards Data Science.

Introduction

In my previous post, I invited you to wonder what exactly is going on under the hood when you train a neural network. Then I investigated the role of activation functions, illustrating the effect they have on the feature space using plots and animations.

Now, I invite you to play an active role on the investigation!

It turns out these plots and animations drew quite some attention. So I decided to organize my code and structure it into a proper Python package, so you can plot and animate your own Deep Learning models!

How do they look like, you ask? Well, if you haven’t checked the original post yet, here it is a quick peek at it:

This is what animating with DeepReplay looks like :-)

So, without further ado, I present you… DeepReplay!

DeepReplay

The package is called DeepReplay because this is exactly what it allows you to do: REPLAY the process of training your Deep Learning Model, plotting and animating several aspects of it.

The process is simple enough, consisting of five steps:

  1. It all starts with creating an instance of a callback!
  2. Then, business as usual: build and train your model.
  3. Next, load the collected data into Replay.
  4. Finally, create a figure and attach the visualizations to it.
  5. Plot and/or animate it!

Let’s go through each one of these steps!

1. Creating an instance of a callback

The callback should be an instance of ReplayData.

[gist id=”61394f6733e33ec72522a58614d1425a” /]

The callback takes, as arguments, the model inputs (X and y), as well as the filename and group name where you want to store the collected training data.

Two things to keep in mind:

  • For toy datasets, it is fine to use the same X and y as in your model fitting. These are the examples that will be plot —so, you can choose a random subset of your dataset to keep computation times reasonable, if you are using a bigger dataset.
  • The data is stored in a HDF5 file, and you can use the same file several times over, but never the same group! If you try running it twice using the same group name, you will get an error.

2. Build and train your model

Like I said, business as usual, nothing to see here… just don’t forget to add your callback instance to the list of callbacks when fitting!

[gist id=”86591c9796731c21f920e01ed2376b23″ /]

3. Load collected data into Replay

So, the part that gives the whole thing its name… time to replay it!

It should be straightforward enough: create an instance of Replay, providing the filename and the group name you chose in Step 1.

[gist id=”019637d6d041fdbd269db9a78a2311b6″ /]

4. Create a figure and attach visualizations to it

This is the step where things get interesting, actually. Just use Matplotlib to create a figure, as simple as the one in the example, or as complex as subplot2grid allows you to make it, and start attaching visualizations from your Replay object to the figure.

[gist id=”ba49bdca40a2abaa68af39922e78a556″ /]

The example above builds a feature space based on the output of the layer named, suggestively, hidden.

But there are five types of visualizations available:

  • Feature Space: plot representing the twisted and turned feature space, corresponding to the output of a hidden layer (only 2-unit hidden layers supported for now), including grid lines for 2-dimensional inputs;
  • Decision Boundary: plot of a 2-D grid representing the original feature space, together with the decision boundary (only 2-dimensional inputs supported for now);
  • Probability Histogram: two histograms of the resulting classification probabilities for the inputs, one for each class, corresponding to the model output (only binary classification supported for now);
  • Loss and Metric: line plot for both the loss and a chosen metric, computed over all the inputs you passed as arguments to the callback;
  • Loss Histogram: histogram of the losses computed over all the inputs you passed as arguments to the callback (only binary cross-entropy loss supported for now).

5. Plot and/or animate it!

For this example, with a single visualization, you can use its plot and animate methods directly. These methods will return, respectively, a figure and an animation, which you can then save to a file.

[gist id=”83ef91da63de149f5a58f6e428ab37f3″ /]

If you decide to go with multiple simultaneous visualizations, there are two helper methods that return composed plots and animations, respectively: compose_plots and compose_animations.

To illustrate these methods, here is a gist that comes from the “canonicalexample I used in my original post. There are four visualizations and five plots (Probability Histogram has two plots, for negative and positive cases).

The animated GIF at the beginning of this post is actually the result of this composed animation!

[gist id=”6ad78608f5ae7ebe2c31f84f9b001625″ /]

Limitations

At this point, you probably noticed that the two coolest visualizations, Feature Space and Decision Boundary, are limited to two dimensions.

I plan on adding support for visualizations in three dimensions also, but most of datasets and models have either more inputs or hidden layers with many more units.

So, these are the options you have:

  • 2D inputs, 2-unit hidden layer: Feature Space with optional grid (check the Activation Functions example);
  • 3D+ inputs, 2-unit hidden layer: Feature Space, but no grid;
  • 2D inputs, hidden layer with 3+ units: Decision Boundary with optional grid (check the Circles example);
  • nothing is two dimensional: well… there is always a workaround, right?

Working around multidimensionality

What do we want to achieve? Since we can only do 2-dimensional plots, we want 2-dimensional outputs — simple enough.

How to get 2-dimensional outputs? Adding an extra hidden layer with two units, of course! OK, I know this is suboptimal, as it is actually modifying the model (did I mention this is a workaround?!). We can then use the outputs of this extra layer for plotting.

You can check either the Moons or the UCI Spambase notebooks, for examples on adding an extra hidden layer and plotting it.

NOTE: The following part is a bit more advanced, it delves deeper into the reasoning behind adding the extra hidden layer and what it represents. Proceed at your own risk :-)

What are we doing with the model, anyway? By adding an extra hidden layer, we can think of our model as having two components: an encoder and a decoder. Let’s dive just a bit deeper into those:

  • Encoder: the encoder goes from the inputs all the way to our extra hidden layer. Let’s consider its 2-dimensional output as features and call them f1 and f2.
  • Decoder: the decoder, in this case, is just a plain and simple logistic regression, which takes two inputs, say, f1 and f2, and outputs a classification probability.

Let me try to make it more clear with a network diagram:

Encoder / Decoder after adding an extra hidden layer

What do we have here? A 9-dimensional input, an original hidden layer with 5 units, an extra hidden layer with two units, its corresponding two outputs (features) and a single unit output layer.

So, what happens with the inputs along the way? Let’s see:

  1. Inputs (x1 through x9) are fed into the encoder part of the model.
  2. The original hidden layer twists and turns the inputs. The outputs of the hidden layer can also be thought of as features (these would be the outputs of units h1 through h5 in the diagram), but these are assumed to be n-dimensional and therefore not suited for plotting. So far, business as usual.
  3. Then comes the extra hidden layer. Its weights matrix has shape (n, 2) (in the diagram, n = 5 and we can count 10 arrows between h and e nodes). If we assume a linear activation function, this layer is actually performing an affine transformation, mapping points from a n-dimensional to a 2-dimensional feature space. These are our features, f1 and f2, the output of the encoder part.
  4. Since we assumed a linear activation function for the extra hidden layer, f1 and f2 are going to be directly fed to the decoder (output layer), that is, to a single unit with a sigmoid activation function. This is a plain and simple logistic regression.

What does it all mean? It means that our model is also learning a latent space with two latent factors (f1 and f2) now! Fancy, uh?! Don’t get intimidated by the fanciness of these terms, though… it basically means the model learned to best compress the information to only two features, given the task at hand — a binary classification.

This is the basic underlying principle of auto-encoders, the major difference being the fact that the auto-encoder’s task is to reconstruct its inputs, not classify them in any way.

Final Thoughts

I hope this post enticed you to try DeepReplay out :-)

If you come up with nice and cool visualizations for different datasets, or using different network architectures or hyper-parameters, please share it on the comments section. I am considering starting a Gallery page, if there is enough interest in it.

For more information about the DeepReplay package, like installation, documentation, examples and notebooks (which you can play with using Google Colab), please go to my GitHub repository:

Have fun animating your models! :-)

If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.

Hyper-parameters in Action! Activation Functions

Introduction

This is the first of a series of posts aiming at presenting visually, in a clear and concise way, some of the fundamental moving parts of training a neural network: the hyper-parameters.

Originally posted on Towards Data Science.

Motivation

Deep Learning is all about hyper-parameters! Maybe this is an exaggeration, but having a sound understanding of the effects of different hyper-parameters on training a deep neural network is definitely going to make your life easier.

While studying Deep Learning, you’re likely to find lots of information on the importance of properly setting the network’s hyper-parameters: activation functions, weight initializer, optimizer, learning rate, mini-batch size, and the network architecture itself, like the number of hidden layers and the number of units in each layer.

So, you learn all the best practices, you set up your network, define the hyper-parameters (or just use its default values), start training and monitor the progress of your model’s losses and metrics.

Perhaps the experiment doesn’t go so well as you’d expect, so you iterate over it, tweaking the network, until you find out the set of values that will do the trick for your particular problem.

Looking for a deeper understanding (no pun intended!)

Have you ever wondered what exactly is going on under the hood? I did, and it turns out that some simple experiments may shed quite some light on this matter.

Take activation functions, for instance, the topic of this post. You and I know that the role of activation functions is to introduce a non-linearity, otherwise the whole neural network could be simply replaced by a corresponding affine transformation (that is, a linear transformation, such as rotating, scaling or skewing, followed by a translation), no matter how deep the network is.

A neural network having only linear activations (that is, no activation!) would have a hard time handling even a quite simple classification problem like this (each line has 1,000 points, generated for x values equally spaced between -1.0 and 1.0):

Figure 1: in this two-dimensional feature space, the blue line represents the negative cases (y = 0), while the green line represents the positive cases (y= 1).

If the only thing a network can do is to perform an affine transformation, this is likely what it would be able to come up with as a solution:

Figure 2: linear boundary — doesn’t look so good, right?

Clearly, this is not even close! Some examples of much better solutions are:

Figure 3: Non-linearities to the rescue!

These are three fine examples of what non-linear activation functions bring to the table! Can you guess which one of the images corresponds to a ReLU?

Non-linear boundaries (or are they?)

How does these non-linear boundaries come to be? Well, the actual role of the non-linearity is to twist and turn the feature space so much so that the boundary turns out to be… LINEAR!

OK, things are getting more interesting now (at least, I thought so first time I laid my eyes on it in this awesome Chris Olah’s blog post, from which I drew my inspiration to write this). So, let’s investigate it further!

Next step is to build the simplest possible neural network to tackle this particular classification problem. There are two dimensions in our feature space (x1 and x2), and the network has a single hidden layer with two units, so we preserve the number of dimensions when it comes to the outputs of the hidden layer (z1 and z2).

Figure 4: diagram of a simple neural network with a single 2-unit hidden layer

Up to this point, we are still on the realm of affine transformations… so, it is time for a non-linear activation function, represented by the Greek letter sigma, resulting in the activation values (a1 and a2) for the hidden layer.

These activation values represented the twisted and turned feature space I referred to in the first paragraph of this section. This is a preview of what it looks like, when using a sigmoid as activation function:

Figure 5: two-dimensional feature space: twisted and turned!

As promised, the boundary is LINEAR! By the way, the plot above corresponds to the left-most solution with a non-linear boundary on the original feature space (Figure 3).

Neural network’s basic math recap

Just to make sure you and I are on the same page, I am showing you below four representations of the very basic matrix arithmetic performed by the neural network up to the hidden layer, BEFORE applying the activation function (that is, just an affine transformation such as xW + b)

Basic matrix arithmetic: 4 ways of representing the same thing in the network

Time to apply the activation function, represented by the Greek letter sigma on the network diagram.

Activation function: applied on the results of the affine transformations

Voilà! We went from the inputs to the activation values of the hidden layer!

Implementing the network in Keras

For the implementation of this simple network, I used Keras Sequential model API. Apart from distinct activation functions, every model trained used the very same hyper-parameters:

  • weight initializers: Glorot (Xavier) normal (hidden layer) and random normal (output layer);
  • optimizer: Stochastic Gradient Descent (SGD);
  • learning rate: 0.05;
  • mini-batch size: 16;
  • number of hidden layers: 1;
  • number of units (in the hidden layer): 2.

Given that this is a binary classification task, the output layer has a single unit with a sigmoid activation function and the loss is given by binary cross-entropy.

[gist id=”e2536b9f45c4884f90d20d68e1b3d8c3″]

Code: simple neural network with a single 2-unit hidden layer

Activation functions in action!

Now, for the juicy part — visualizing the twisted and turned feature space as the network trains, using a different activation function each time: sigmoid, tanh and ReLU.

In addition to showing changes in the feature space, the animations also contain:

  • histograms of predicted probabilities for both negative (blue line) and positive cases (green line), with misclassified cases shown in red bars (using threshold = 0.5);
  • line plots of accuracy and average loss;
  • histogram of losses for every element in the dataset.

Sigmoid

Let’s start with the most traditional of the activation functions, the sigmoid, even though, nowadays, its usage is pretty much limited to the output layer in classification tasks.

Figure 6: sigmoid activation function and its gradient

As you can see in Figure 6, a sigmoid activation functionsquashes” the inputs values into the range (0, 1) (same range probabilities can take, the reason why it is used in the output layer for classification tasks). Also, remember that the activation values of any given layer are the inputs of the following layer and, given the range for the sigmoid, the activation values are going to be centered around 0.5, instead of zero (as it usually is the case for normalized inputs).

It is also possible to verify that its gradient peak value is 0.25 (for z = 0) and that it gets already close to zero as |z| reaches a value of 5.

So, how does using a sigmoid activation function work for this simple network? Let’s take a look at the animation:

Sigmoid in action!

There are a couple of observations to be made:

  • epochs 15–40: it is noticeable the typical sigmoidsquashing” happening on the horizontal axis;
  • epochs 40–65: the loss stays at a plateau, and there is a “widening” of the transformed feature space on the vertical axis;
  • epoch 65: at this point, negative cases (blue line) are all correctly classified, even though its associated probabilities still are distributed up to 0.5; while the positive cases on the edges are still misclassified;
  • epochs 65–100: the aforementioned “widening” becomes more and more intense, up to the point pretty much all feature space is covered again, while the loss falls steadily;
  • epoch 103: thanks to the “widening”, all positive cases are now lying within the proper boundary, although some still have probabilities barely above the 0.5 threshold;
  • epoch 100–150: there is now some “squashing” happening on the vertical axis as well, the loss falls a bit more to what seems to be a new plateau and, except for a few of the positive edge cases, the network is pretty confident on its predictions.

So, the sigmoid activation function succeeds in separating both lines, but the loss declines slowly, while staying at plateaus for a significant portion of the training time.

Can we do better with a different activation function?

Tanh

The tanh activation function was the evolution of the sigmoid, as it outputs values with a zero mean, differently from its predecessor.

Figure 7: tanh activation function and its gradient

As you can see in Figure 7, the tanh activation functionsquashes” the input values into the range (-1, 1). Therefore, being centered at zero, the activation values are already (somewhat) normalized inputs for the next layer.

Regarding the gradient, it has a much bigger peak value of 1.0 (again, for z = 0), but its decrease is even faster, approaching zero to values of |z| as low as 3. This is the underlying cause to what is referred to as the problem of vanishing gradients, which causes the training of the network to be progressively slower.

Now, for the corresponding animation, using tanh as activation function:

Tanh in action!

There are a couple of observations to be made:

  • epochs 10–40: there is a tanhsquashing” happening on the horizontal axis, though it less pronounced, while the loss stays at a plateau;
  • epochs 40–55: there is still no improvement in the loss, but there is a “widening” of the transformed feature space on the vertical axis;
  • epoch 55: at this point, negative cases (blue line) are all correctly classified, even though its associated probabilities still are distributed up to 0.5; while the positive cases on the edges are still misclassified;
  • epochs 55–65: the aforementioned “widening” quickly reaches the point where pretty much all feature space is covered again, while the loss falls abruptly;
  • epoch 69: thanks to the “widening”, all positive cases are now lying within the proper boundary, although some still have probabilities barely above the 0.5 threshold;
  • epochs 65–90: there is now some “squashing” happening on the vertical axis as well, the loss keeps falling until reaching a new plateau and the network exhibits a high level of confidence for all predictions;
  • epochs 90–150: only small improvements in the predicted probabilities happen at this point.

OK, it seems a bit better… the tanh activation function reached a correct classification for all cases faster, with the loss also declining faster (when declining, that is), but it also spends a lot of time in plateaus.

What if we get rid of all the “squashing”?

ReLU

Rectified Linear Units, or ReLUs for short, are the commonplace choice of activation function these days. A ReLU addresses the problem of vanishing gradients so common in its two predecessors, while also being the fastest to compute gradients for.

Figure 8: ReLU activation function and its gradient

As you can see in Figure 8, the ReLU is a totally different beast: it does not “squash” the values into a range — it simply preserves positive values and turns all negative values into zero.

The upside of using a ReLU is that its gradient is either 1 (for positive values) or 0 (for negative values) — no more vanishing gradients! This pattern leads to a faster convergence of the network.

On the other hand, this behavior can lead to what it is called a “dead neuron”, that is, a neuron whose inputs are consistently negative and, therefore, always has an activation value of zero.

Time for the last of the animations, which is quite different from the previous two, thanks to the absence of “squashing” in the ReLU activation function:

ReLU in action!

There are a couple of observations to be made:

  • epochs 0–10: the loss falls steadily from the very beginning
  • epoch 10: at this point, negative cases (blue line) are all correctly classified, even though its associated probabilities still are distributed up to 0.5; while the positive cases on the edges are still misclassified;
  • epochs 10–60: loss falls until reaching a plateau, all cases are already correctly classified since epoch 52, and the network already exhibits a high level of confidence for all predictions;
  • epochs 60–150: only small improvements in the predicted probabilities happen at this point.

Well, no wonder the ReLUs are the de facto standard for activation functions nowadays. The loss kept falling steadily from the beginning and only plateaued at a level close to zero, reaching correct classification for all cases in about 75% the time it took tanh to do it.

Showdown

The animations are cool (ok, I am biased, I made them!), but not very handy to compare the overall effect of each and every different activation function on the feature space. So, to make it easier for you to compare them, there they are, side by side:

Figure 9: linear boundaries on transformed feature space (top row), non-linear boundaries on original feature space (bottom row)

What about side-by-side accuracy and loss curves, so I can also compare the training speeds? Sure, here we go:

Figure 10: accuracy and loss curves for each activation function

Final Thoughts

The example I used to illustrate this post is almost as simple as it could possibly be, and the patterns depicted in the animations are intended to give you just a general idea of the underlying mechanics of each one of the activation functions.

Besides, I got “lucky” with my initialization of the weights (maybe using 42 as seed is a good omen?!) and all three networks learned to classify correctly all the cases within 150 epochs of training. It turns out, training is VERY sensitive to the initialization, but this is a topic for a future post.

Nonetheless, I truly hope this post and its animations can give you some insights and maybe even some “a-ha!” moments while learning about this fascinating topic that is Deep Learning.