Here I cover hacks and tricks concerning working with data when selecting an AI project.
Your minimum viable dataset is smaller than you think
How can we double food production by 2050 to feed 9 billion people? Could the solution be started by two people walking around with taking pictures with a smartphone?
For an example of a bootstrapped dataset, take Blueriver Inc. They do precision agriculture. 90% of herbicide usage can be reduced by precisely spraying only at the right spots.
See & Spray machines use deep learning to identify a greater variety of plants with better accuracy and then make crop management decisions on the spot. Custom nozzle designs enable <1-inch spray resolution. They focus on cotton and soybeans.
In September 2017, John Deere acquired Blue River for 300 million.
What was the dataset Blue River started with? Collected by a handful of guys with phones taking pictures while walking down an entire crop field:
Often the amount of data you need for proof of concept is ‘not much’ (thanks to pretrained models!).
You can collect and label the data yourself, with tiny, tiny resources. Maybe 50% of the portfolio projects I have directed started with data that didn’t exist and the team generated them. Jeremy Howard, the founder of fast.ai, is also adamant about destroying the myth that ‘you need google-sized datasets’ to get value out of AI today.
What a great opportunity; it’s exciting to be alive now, with so many problems that are solvable with off-the-shelf tech. A farming robot can crunch mobile camera data opens a new path. Next time you think up AI projects, use Blue River as a reference of what’s possible with data you create.
Knowing “your minimum viable dataset is smaller than you think” opens up the range of projects you can tackle tremendously.
Because of pretrained models, you don’t need as much data
Model zoos are collections of pretrained networks. Each network there saves a tremendous amount of time AND data to anyone solving a problem similar to one already solved. For example:
Curators at model zoos make your life far easier. With recent success in ML, researchers are getting bigger grants, publish models from academia that have enjoyed powerful machines and weeks of computation. Industry leaders publish their models often, in the hope of attracting talent or nullifying a competitor’s advantage.
80% of the time of a data scientist is cleaning data; the other 20% is bitching about cleaning data
It’s a joke, but it’s not far from the truth. don’t be surprised if you spend most of your time with data preparation.
Data augmentation works well on images
Don’t do it ‘by hand’, today there are libraries for most common data augmentation tasks (Horizontal and Vertical Shift, Horizontal and Vertical Flip, random rotation, etc). It’s a pretty standard preprocessing step.
Get unique data by asking companies/people
At Data Science Retreat, One team once needed bird songs. It turns out there’s an association of bird watchers that has a giant dataset of songs. They could give it to our team, no questions asked.
Companies may have data they don’t care much about. Some other times, they do care about them but still would give them to you if you offer something of value in exchange, such as a predictive model they can use. Governments often have to give you data if you request it.
If you run out of ideas to get data with more traditional means, try asking for it. Sending a few emails may be a good use of your time. Academics should share their data if they published a paper about it and you asked. It doesn’t always work, but it’s worth trying. Small companies looking for any advantage may want to partner with you as long as they get some benefit.
Compute on the edge (federated learning), avoid privacy problems
This is our life in 2019: Huawei’s new ‘superzoom’ P30 Pro smartphone camera can identify people from far away and apply neural networks to lip reading. Progress with computer vision systems that can re-identify the same person when they change location (for example, emerging from a subway) all indicates that mass surveillance is growing in technical sophistication. Privacy is ‘top of mind.’
Corporations get access to more and more citizens’ private data; regulators try to protect our privacy and limit access to such data. Privacy protection doesn’t come without side effects: it often produces situations where scientific progress suffers. For example, GDPR regulation in Europe seems to be a severe obstacle to both researchers and companies to apply machine learning to lots of interesting problems. At the same time, datasets such as health data benefit form privacy; imagine if you had a severe illness and an employer would not hire you because of this.
A solution: What if, instead of bringing the corpus of training data to one place to train a model, you could bring the model to the data wherever it’s generated? That is called federated learning, first presented by Google in 2017.
This way, even if you don’t have access to an entire dataset at once, you can still learn from it.
If you are interested in this topic, follow Andrew Trask. He has a coursera course on federated learning and a handy jupyter notebook with a worked-out example.
Why is this important in a conversation about picking AI projects? Because if your project uses federated learning, you may have a much easier time getting people to give you data and use your product. It opens the door to a different class of projects.
Data reuse and partnerships
Data has excellent secondary value.
That is, often, you can find uses for data that the entity collecting it didn’t think about. Often through partnerships, new uses of data can produce a secondary revenue stream. For example, you can integrate:
– data on fraud
– data from credit scoring,
– data from churn,
– data about purchases (from different sources)
The organisation that published these data (containing personal data) might use a license that restricts your usage. You need to check if they explicitly have a license for data reuse, otherwise it is best to contact them and agree on a license for your project.
Beware of problems though:
1. If your product depends on data that only that one single partner can produce, you are in their hands. The moment you decide to end the partnership, they ended your business.
2. Data integration will be difficult, more so if the only shared variable in different datasets is a person. Data that would help identify a person is subject to regulation in some parts of the world.
Even if you can legally integrate these different data sources, remember there are entire teams dedicated to integration in big companies. Never assume this will go smoothly.
Using public data is not the ‘only option.’ Some people will complain that for every dataset out in the open, there are plenty of projects already. It’s hard to stand out. Maybe it’s worth it to email people in industry to get some data nobody else has (data partnerships). If you offer the result of your models in exchange for access, some companies may be persuaded.
How about doing Kaggle? Not a great idea for a portfolio project, because 1/ the hard pard of finding the problem and data is done and 2/ it’s hard to stand out from the crowd of competitors that probably spent more time than you fitting models and have better performance.
Finding secondary use in data is a fantastic skill to have. Coming up with project ideas trains this skill.
Use unstructured data
For decades, all data ML could consume was in the form of tables. Those excel files flying around as attachments, those SQL databases… tabular data was the only thing that could benefit from ML.
Since the 2010s, that changed.
Unstructured data are:
- words written by real people that don’t follow a pre-defined model, using language riddled with nuances.
- audio (including speech)
- sometimes, sensor data (streams)
Data that is defined as unstructured is growing at 55-65 percent each year. Emails, social media posts, call center transcripts,… all excellent examples of unstructured datasets that can provide value to a business.
You may think that ‘everyone knows this, so why mention it ‘… but in my experience, there are large companies left and right that didn’t receive the memo. If you work for one of these and happen to find a use case for unstructured data that they may have, you are onto something that could be a career changer.
Take, for example, banks. For them, data means numerical information from markets and security prices. Now they use satellite images of night light intensity, oil tank shadows, and the number of cars in parking lots, for example, can be used to estimate economic activity.
In my experience, Data Science Retreat, most people chose unstructured data for portfolio projects. And it’s easy to pass ‘the eyebrow test’ with these. Plus, they are abundant in the wild. Everyone has access to pictures, text… in contrast to tabular data such as money transactions.
One downside: unstructured data can trigger compliance issues. You never know what is lurking on giant piles of text. Is there confidential information in these emails? Is our users’ personal data leaking, even when we tried to anonymize them?
You may remember the AOL fiasco. On August 4, 2006, AOL Research released a compressed text file on one of its websites containing twenty million search keywords for over 650,000 users over a 3-month period intended for research purposes. AOL deleted the search data on their site by August 7, but not before it had been mirrored and distributed on the Internet. AOL did not identify users in the report; however, personally identifiable information was present in many of the queries, creating a privacy nightmare for the users in the dataset.
There you go; things I’ve learned about picking a good project that have to do with collecting and cleaning data.
In the last part of this series, I’ll cover what I’ve learned on model building that affects how you pick an AI project.