I’ve mentored about 165 projects during the last five years at the companies where I work. The people who built those project all got jobs (mostly). The more I talk to companies interviewing today, the more apparent it is: A portfolio project is decisive when making hiring judgments. Jeremy Howard recommends it. Andrew Ng recommends it. Why? It’s far better at discriminating talent than any other proxy (CVs don’t work; pedigree doesn’t work; puzzles don’t work).
And yet, the proportion of people who finish MOOCs AND have a portfolio project is tiny. What’s the catch?
Well, it could be that these are orthogonal skills. Finishing something as demanding as a deep learning MOOC (or a uni degree) show that you have grit. You can start and finish difficult, demanding projects. But what hiring managers want to know is: “What did you do when nobody told you what to do?”, “Can you find ways to generate value (that we may have overlooked)?”, “Can you work with other people?”. Many people can prepare for, and pass, the ‘typical developer interview.’ That one with the whiteboard, and the typical questions that you never encounter in real life. However, what everyone, on both sides, hopes for is to meet someone who can work well beyond ‘the standard.’ What separates you from the rest is a substantial portfolio project. Not that many people can display the creativity and grit to get a killer portfolio project off the ground.
Because this post is about how to pick an ML project, you have to make a decision: you need to focus on machine learning (ML) or on data management to start your career. The second option is plenty lucrative and doesn’t require you to know much ML. But then you will be stuck with the dreaded ‘data cleaning’ (A data scientist spends 80% of her time cleaning data, and 20% of her time bitching about cleaning data). You may realize that after working extra-hard to become competent in machine learning, you are not doing all that much of it once you get a job.
There’s a trick to avoid too much data cleaning: go to a company that has a good team of data engineers. These are rare, though. So it’s safe to assume that in a ‘normal’ company you will have to do some data cleaning, even if you want to stay away from it. I have more to say about companies; the last bit of this post will cover how to choose which companies to apply for. But now let’s focus on the project.
How do you pick a good project? (Proof Of Concept)
Morten T. Hansen’s book ‘great at work’ shows that exceptional performers tend to ‘Do less, then obsess.’ He shows pretty convincing empirical evidence. So I’m going to recommend: Do one project, not many that are half-assed. If the hiring manager looks at your GitHub, there must be a clear ‘wow’ story there. Something she still remembers after an entire day looking at resumes.
You want to have something to show that is a Proof of Concept (POC). While you will not have resources to build an entire product, your GitHub should have something that demonstrates that such product is feasible. The interviewer must think, ‘ok, if he came to my company and did something like this, he would be adding serious value.’
Picking the right problem is a great skill, and it takes years of practice to do it reliably. It’s a mixture of ‘Fermi-style’ estimations (‘how many piano tuners there are in a city?’) time management, and general technical know-how.
The good news is that in ML there’s lots of low hanging fruit. There’s lots of code already, so many pretrained networks, .and so few solved-with-ML problems! No matter how many papers per year we get (too many to track even in a reduced domain), there’s plenty of opportunity in ML right now. Applications of technology this complex have never been more accessible! If you want to practice this skill, tweet 2-3 project-worthy problems a day, and ask people for feedback. If you tweet them at me (@quesada), I’ll do my best to rate them with a boolean for plausibility (in case lots of people take up this suggestion!)
We data scientists have the luxury to pick problems. And you should practice that skill! Knowing what is possible today is the first step. Twitter is a great place to pick up what’s happening in ML. You must have heard of most of the breakthroughs, and have an internal ranking of methods and applications. For example, everybody seems to have watched David Ha’s ‘Everybody dances’ demo. If you didn’t, stop what you are doing and watch it now. Your immediate reaction should be: ‘how can I apply this in a way that generates business value?’
Over a year, at least a couple of dozen papers make a big splash in the scene. Your job is to keep a catalog and match it to business problems. Medium has several writers that are amazing at implementing the latest algos fast. If you want the most exhaustive collection of implementations, try https://paperswithcode.com. To navigate arXiV, try http://www.arxiv-sanity.com (this is good to pick up trends, I don’t recommend you to make paper reading a priority if you want to be a practitioner.) About videos: https://nips.cc has now videos for most talks. ‘Processing’ NeurIPS is a serious job, so it’s easier to read summaries from people soon after they attended. Remember, you are doing this to pick up “what’s possible,” not to become a learned scholar. It’s tempting to get lost in the rabbit hole of awesomeness.
Which problem you pick says a lot about your maturity as a data scientist and your creativity. You are demonstrating your taste and your business acumen. Passion projects are ok if your passion is not too obscure (for example, I wouldn’t work on detecting fake Cohiba cigars, no matter how passionate you are about those). If knowing the domain can give you an unfair advantage, by all means, use it. A good idea gets people saying ‘I would use that’ or ‘How can I help?’ If you don’t get these reactions, keep looking.
Deep learning (DL) has two good advantages over ‘classical’ machine learning: 1/ it can use unstructured data (images, audio, text) which is both flashier and more abundant than tabular data, and 2/ You can use pretrained networks! If there’s a pretrained network that does something remotely similar to what you want to do, you have just saved yourself a lot of time. Because of transfer learning and lots of pretrained networks, demonstrating value should be easy. Note that taking what you build to production would be problematic if you depend on GPUs, but this is beyond the scope of your proof of concept. Just be prepared to defend your choices in an interview.
If you don’t want to do DL, there’s another useful trick: pick a topic companies care about. For example, If you want to impress companies in the EU, you can do a POC that uses the company’s proprietary datasets while protecting privacy. Impossible, until recently: Andrew Trask is building a great framework to train models in data you cannot access at once. His book ‘Grokking deep learning’ (great stuff!) has an entire chapter on privacy, secure aggregation and homomorphic encryption. The keyword you want is ‘federated learning.’ Google is doing quite a bit of research about it because it could save the company from a frontal attach coming from regulators.
How long should you work on a ‘good problem’? At least 1.5 months. This length seems to work fine at the companies I work for. Anything shorter can look like a ‘weekend project’; you want to avoid that at all costs.
Rather than having one of the popular datasets, for example, ‘mammograms,’ or worse ‘Titanic,’ pick some original data and problem. Remember you are trying to impress a hiring manager that sees hundreds of CVs and many have the same old datasets. For example, MOOCs and bootcamps that ask everybody to do a project on the same topic produce reams of CVs that all look the same. Now imagine you are the hiring manager, and you have seen dozens in the last month; will you be impressed?
You are going to do all that alone (or in a small team of two). In a company, accepting a project is something that goes through a more substantial group:
- Product defines the problem and creates a user story that describes the need.
- The research team reads relevant articles and looks for similar problems which other data science teams have faced.
- The data team collects relevant data, and someone (internal or external) needs to label it.
- The team tries some algorithms and approaches and returns with some kind of baseline.
Because you are working on a project to showcase your skills, you would have to do this without a large team. But is it impossible? No. Here are some examples:
Malaria kills about 400k people per year, mostly children. It’s curable, but detecting it is not trivial, and it happens in parts of the world where hospitals and doctors are not very accessible. Malaria parasites are quite big and a simple microscope can show them; the standard diagnostic method involves a doctor counting them.
It turns out you can attach a USB microscope to a mobile phone, and run DL code on the phone that counts parasites with accuracy comparable to a human. Eduardo Peire, DSR alumni, started with a pubic Malaria dataset. While small, this was enough to demonstrate value to people, and his crowdfunding campaign got him enough funds to fly to the Amazon and collect more samples. This is a story for another day; you can follow their progress here: http://aiscope.net.
Information about pedestrian road surface type and its roughness are essential for vehicles that need smoothness to operate. Especially for wheelchair users, scooters, or bicycle riders, road surface roughness significantly affects the comfortability of the riding. A large number of street images are available on the web, but for the pedestrian road, they typically do not have roughness information or surface labels.
Masanori Kanazu and Dmitry Efimenko classify road surface type and predict roughness value using only a smartphone camera image. They mount a smartphone to a wheeled device and record video and accelerometer data simultaneously. From the accelerometer data, they calculate a roughness index and train a classifier to predict the roughness of the road and surface type.
Their system can be utilized to annotate crowdsourcing street maps like OpenStreetMap with roughness values and surface types. Also, their approach demonstrates that one can map the road surface with an off-the-shelf device such as a smartphone camera and accelerometer.
These two projects are original. They demonstrate what a small team can do with modern DL and … well, cheap sensors that already come with a phone or you can buy online. If I were a hiring manager, these projects would make me pay attention.
A project can be original, but useless. Example; translate Klingon to Dothraki language (Languages used on TV series that don’t exist in reality). You may laugh, but after > 165 projects I’ve mentored, I’ve seen people generating ideas that fail to meet the relevant criterion.
How do you know what is relevant?
If you are part of an industry, you should know of pain points. What is something trivial that looks automatable? What is something everyone in your sector loves to complain about?
Imagine you are not part of any industry; If you have a hobby, you can do a portfolio project on it IF:
1/ It’s damn interesting and memorable even if way outside the range of interest of the industry. Example: “recommender system for draft beer.” Many people will pay attention to beer.
2/ It showcases your skills: You found or created an exceptionally good dataset. You integrated multiple data sources. You made something that on first thought, most people wouldn’t believe it’s possible. You showed serious technical chops.
You may complain that there’s not that much data available out there, or that the massive, nice datasets are all inside companies.
The good news is that you don’t need big data to deliver value with ML! No matter how much hype there’s around the importance of large datasets. Take it from Andrew Ng:
Ok, so where do you get data?
Generate it yourself (see previous examples; they both generated their own data; or begged anyone who had data to contribute it (hospitals that deal with infectious diseases for the Malaria microscope).
Some hiring managers may not care much about how you fitted which model; they have seen it before. But if the dataset you generated or found is new to them, that may impress them.
Have a clean project on GitHub, something the interviewer can run on his computer. If you are lucky enough to be dealing with an interviewer who checks code, the last thing you want is to disappoint her with something that is hard to run or doesn’t run at all.
Most people have ‘academic quality’ code on their GitHubs, i.e., something that works just enough to get you a plot. It’s damn hard to replicate, even putting in some effort (something a hiring manager will not do).
If you have clean code, it will feel refreshing. A good readme, appropriate comments, modules in python, good commit messages, a requirements.txt file, and … a simple way to replicate your results, such as ‘python train.py.’ That’s all it takes to be above 90% of the code most people learning data science have on their GitHub. If you can defend your library or model choices in an interview, that’s moving you up the ranking too.
Do not obsess with performance
I call this the “Kaggle Mirage.” You have seen Kagglers making real money and raking in the prestige just by moving a decimal point. Then you may think “I need to get to 90% accuracy because my manager (not a data scientist) set it as a goal.” Don’t let that happen to you without asking some important questions. What would happen if you only got to 85%? Are there nonalgorithmic ways to compensate for the missing 5%? What is the baseline? What is the algorithm? What is the current accuracy? Is 90% a reasonable goal? Is it worth the effort if it’s significantly harder than getting to 85%?
In all but a few real projects, moving performance a decimal point may not bring that much business value.
Now, if you have been working on a problem for a week and performance didn’t change… is it time to stop? Maybe your peers on a standup notice the pattern and are encouraging you to drop that project and join another one (coincidentally, theirs). Expectation management for data scientists would be a great post.
In data science, progress is not linear. That is you could spend days working on something that doesn’t move the needle, and then, one day, ‘boom,’ you have gained 5 points in accuracy. Companies not used to work with data science will have a hard time to understand that.
I recommend not to do Kaggle (or not only!) if your goal is to build a portfolio project to get hired.
Deciding whether to apply and preparing for an interview
Even if you do have a great project to show off, you still have to (1) find good companies to apply to, and (2) prepare. I estimate about 10% of companies are a slam dunk: They have a team that writes blog posts, they are using reasonable algos for their problems, they have a good team already, and AI is a core component of their product. Startups with funding and big N companies land in this box. For the remaining 90% of companies, it’s far harder to estimate whether they are a good match for someone building an AI career. I have quite a few heuristics, but this is worth a different post.
What doesn’t discriminate excellent and bad AI-career-building companies? The job ad.
It is difficult to write up a job description for a data scientist role; more so if there isn’t already a DS team in the company to write it! Even when there is one, HR could have mangled the job ad by adding criteria.
As a candidate, you can use this rule of thumb: ignore requirements and apply; If you only have 50% of the bullet points, that might be better than most other applicants. Knowing this would add some companies to your list, ones in which you could have disqualified yourself.
When companies don’t have an easy way to tell if a candidate is any good, they revert to ‘correlates’: pedigree, paper certifications, titles. Terrible choice, because so many great data scientists today are self-taught. Those companies asking for a Ph.D. or Masters are missing out on great talent that cut their teeth on fast.ai and Kaggle.
Before you start carpet bombing LinkedIn, … Have you prepared for interviews? Because it’s not a walk in the park! Don’t try to go interviewing without preparing; your competitors will.
The standard software developer interview in the Valley takes about three months of preparation. If you don’t believe me, go to this Reddit. That goes equally for fresh CS graduates (who should still remember all those ‘algorithms and data structures’ classes) and for more senior people (who have all the rights in the world to have forgotten them).
In less competitive places, it may not be that important to know all these CS tricks. In fact, preparing for a DS and CS interview at the same time may be overkill. Some positions (machine learning engineer) may require you to be a senior engineer on top of knowing machine learning.
For data science positions today, you don’t necessarily need to prepare ‘algorithms and data structures’ questions. But the trend is that companies ask for more and more engineering from their data scientists. You can know in advance by checking the profiles of company data scientists in LinkedIn. Are they very engineer-y? Then get the book ‘cracking the coding interview,’ and do a few exercises.
Work on one project. Spend 1.5 months on it. Make sure it’s impressive to a hiring manager that sees hundreds of CVs for each position. Make sure you pick a ‘good problem’, one that is, original, relevant, and uses good data.
Make sure you produce clean code, pick decent companies before you mass-apply, and prepare for interviews.