Originally posted on Towards Data Science.
Yes, you got that right, through food! :-)
Imagine yourself ordering a pizza and, after a short while, getting that nice, warm and delicious pizza delivered to your home.
Have you ever wondered the workflow behind getting such a pizza delivered to your home? I mean, the full workflow, from the sowing of tomato seeds to the bike rider buzzing at your door! It turns out, it is not so different from a Machine Learning workflow.
Really! Let’s check it out!
This post draws inspiration from a talk given by Cassie Kozyrkov, Chief Decision Scientist at Google, at the Data Natives Conference in Berlin.
The farmer sows the seeds that will grow to become some of the ingredients to our pizza, like the tomatoes.
This is equivalent to the data generating process, be it a user action, be it movement, heat or noise triggering a sensor, for instance.
Then it comes the time for the harvest, that is, when the vegetables or fruits are ripe.
This is equivalent to the data collection, meaning the browser or sensor will translate the user action or the event that triggered the sensor into actual data.
After the harvest, the products must be transported to their destination to be used as ingredients in our pizza.
This is equivalent to ingesting the data into a repository where its going be fetched from later, like a database or data lake.
4. Choosing Appliances and Utensils
For every ingredient, there is the most appropriate utensil for handling it. If you need to slice, use a knife. If you need to stir, a spoon. The same reasoning is valid for the appliances: if you need to bake, use an oven. If you need to fry, a stove. You can also use a more sophisticated appliance like a microwave, with many, many more available options for setting it up.
Sometimes, it is even better to use a simpler appliance — have you ever seen a restaurant advertise “microwaved pizzas”?! I haven’t!
In Machine Learning, utensils are techniques for preprocessing the data, while the appliances are the algorithms, like a Linear Regression or a Random Forest. You can also use a microwave, I mean, Deep Learning. The different options available are the hyper-parameters. There are only a few in simple appliances, I mean, algorithms. But there are many, many more in a sophisticated one. Besides, there is no guarantee a sophisticated algorithm will deliver a better performance (or do you like microwaved pizzas better?!). So, choose your algorithms wisely.
5. Choosing a Recipe
It is not enough to have ingredients and appliances. You also need a recipe, which has all the steps you need to follow to prepare your dish.
This is your model. And no, your model is not the same as your algorithm. The model includes all pre– and post–processing required by your algorithm. And, talking about pre-processing…
6. Preparing the Ingredients
I bet you the first instructions in most recipes are like: “slice this”, “peel that” and so on. They don’t tell you to wash the vegetables, because that’s a given — no one wants to eat dirty vegetables, right?
Well, the same holds true for data. No one wants dirty data. You have to clean it , that is, handling missing values and outliers. And then you have to peel it and slice it, I mean, pre-process it, like encoding categorical variables (male or female, for instance) into numeric ones (0 or 1).
No one likes that part. Neither the data scientists nor the cooks (I guess).
7. Special Preparations
Sometimes you can get creative with your ingredients to achieve either a better taste or a more sophisticated presentation.
You can dry-age a steak for a different flavor or carve a carrot to look like a rose and place it on top of your dish :-)
This is feature engineering! It is an important step that may substantially improve the performance of your model, if done in a clever way.
Pretty much every data scientist enjoys that part. I guess the cooks like it too.
The fundamental step — without actually cooking, there is no dish. Obviously. You put the prepared ingredients into the appliance, adjust the heat and wait a while before checking it again.
This is the training of your model. You feed the data to your algorithm, adjust its hyper-parameters and wait a while before checking it again.
Even if you follow a recipe to the letter, you cannot guarantee everything is exactly right. So, how do you know if you got it right? You taste it! If it is not good, you may add more salt to try and fix it. You may also change the temperature. But you keep on cooking!
Unfortunately, sometimes your pizza is going to burn, or taste horribly no matter what you do to try to salvage it. You throw it in the garbage, learn from your mistakes and start over.
Hopefully, persistence and a bit of luck will produce a delicious pizza :-)
Tasting is evaluating. You need to evaluate your model to check if it is doing alright. If not, you may need to add more features. You may also change a hyper-parameter. But you keep on training!
Unfortunately, sometimes your model is not going to converge to a solution, or make horrible predictions no matter what you do to try to salvage it. You discard your model, learn from your mistakes and start over.
Hopefully, persistence and a bit of luck will result in a high-performing model :-)
From the point of the view of the cook, his/her work is done. He/she cooked a delicious pizza. Period.
But if the pizza does not get delivered nicely and in time to the customer, the pizzeria is going out of business and the cook is losing his/her job.
After the pizza is cooked, it must be promptly packaged to keep it warm and carefully handled to not look all squishy when it reaches the hungry customer. If the bike rider doesn’t reach his/her destination, loses the pizza along the way or shake it beyond recognition, all cooking effort is good for nothing.
Delivering is deployment. Not pizzas, but predictions. Predictions, like pizzas, must be packaged, not in boxes, but as data products, so they can be delivered to the eager customers. If the pipeline fails, breaks along the way or modifies the predictions in any way, all model training and evaluation is good for nothing.
That’s it! Machine Learning is like cooking food — there are several people involved in the process and it takes a lot of effort, but the final result can be delicious!
Just a few takeaways:
- if ingredients are bad, the dish is going to be bad — no recipe can fix that and certainly, no appliance, either;
- if you are a cook, never forget that, without delivering, there is no point in cooking, as no one will ever taste your delicious food;
- if you are a restaurant owner, don’t try to impose appliances on your cook — sometimes microwaves are not the best choice — and you’ll get a very unhappy cook if he/she spends all his/her time washing and slicing ingredients…
I don’t know about you, but I feel like ordering a pizza now! :-)
If you have any thoughts, comments or questions, please leave a comment below or contact me on Twitter.