Alessandro Pedori joined Yahoo! in 2000, when it was still hip. Later, he worked on accessing the web with phones that weren't particularly smart, proceeded to spend years in the localization industry, and fell in love with Python (it just made sense). He entered the world of data engineering from the backdoor.
Alessandro is an independent consultant, mostly concentrating on distributed systems, natural language processing (NLP), and systems to move data around, and, if possible, learn from it.
Alessandro holds a Master of Science in Computer Engineering from the Università di Bologna.
Data in the wild
- get hand on experience on how to access data in real world situation
- understand the challenges and trade offs in getting data
- create a small pipeline project, from rough to usable data
In many cases, the data we need is all nice and clean, ready for us to work with it, extract information or knowledge, or at least some neat visualization. However, when we start working on real projects, we generally find data that is messy, scattered, and difficult to use. If we can find the data at all. Getting the data in a state in which we can use it, often is labor intensive, and often not very sexy.
We'll see what we can expect, and what can we do to make it, if not sexy, interesting.
ASSUMPTIONS ABOUT DATA (and how they fail)
- it exists
- it is complete
- and available
- and unchanging
- scraping: how, when, why, when not
- APIs, APIs, APIs all around
- the good thing about formats: the are so many of them! (mail, PDF, CSV, XSL...)
- missing data
- choosing a structure (then choosing again)
PUTTING IT ALL TOGETHER
- packaging the steps
- quick checks/visualizations
Having an idea of what data we need will be helpful in the practical part.
Making sense of language: a (gentle) introduction to NLP
- be able to understand the possibilities and challenges of working with language
- know some of the tools, their strength, their limitations
- be able to build an application dealing with natural language
Language is one of the main ways humans communicate, and a lot of information is in some sort of linguistic form.
This has traditionally posed a big challenge for computers: while humans automatically make sense of meaning, and naturally use contextual clues, computers cannot really understand language and its meaning. What machines can do is extract some data, units of information, and correlations.
Modern libraries and techniques offer tools performing quite sophisticated high-level tasks on language. It is helpful understanding how they work under the hood, what they can do, what they cannot (yet) do, and the grey area where the interesting stuff happens.
INTRO TO LANGUAGE AND ITS PROCESSING
- language: it's complex
- language from the point of view of a computer
- basic applications
- advanced applications
- meet our main tool: NLTK
- morphology, stemming, syntax
- grammar, or something like that
- bag of words
- more tools: spacy, textacy, scikit-learn
- statistical analysis
- language models
- machine learning
(we will cover the basics, and concentrate on a couple)
- basic spam filter
- sentiment analysis
- basic text classification
- text summarization
- text production
- information extraction and text mining
- a basic chat bot: let's rebuild siri
Being comfortable with at least a programming language. We will use python, mostly cover NLTK with some excursions in other tools. Fluency in English is assumed, acquaintance with at least another language for comparison (English is often a corner case) is helpful but not necessary.