A possible conversation mentor-learner (the method in action)


Mentor: Predict telecom customer churn based on information about their account. The data consist of 200 predictors related to the customer account, such as the number of customer service calls, the area code, and the number of minutes, etc. We have data from 2008-2015, and want to predict 2015.

Learner: "Before, I need to make a decision that is not trivial: how do I split the data into holdout, test, and training?" (proposes two options)


Mentor: Two options: (1) Take all the available data from 2008-2015, reserve some data for a test set, and use resampling with the remainder of the samples for tuning the various models. (2) create models using the data before 2015, but tune them based on how well they fit the 2015 data. This strategy may produce overfitting. What if 2015 is very different from 2008-2014?

Learner: (Finds a compromise over the two methods on his own): "Tried a regression model with the 200 variables, didn't work great. Also tried a random forest, slightly better, not ideal. I'm stuck. Are there more specific techniques I should be using?"


Mentor: Are the predictors good for this task? Maybe this dataset doesn't contain predictors that are good enough. Try dropping predictors with zero variance. Run a PCA on the remaining continuous predictors. Rerun existing models. If that fails to improve results, try to find better predictors outside the data you have (web APIs). You can try model X, Y and Z.

Learner: (Goes off for two days and does that, finds some improvements by transforming features;also gets 20 more from web APIs). "It could be that the way I'm looking at model fit is not ideal (percentage correct). What other methods are there? And I cannot find an implementation of model Z that converges..."


Mentor: We can use ROC curves. If you need a good implementation of model Z, check this paper:"Implementing model Z the way God intended". You can try these two python libraries: Zmod, blahmod, and argmod.


Interested in becoming better with data using the Meerkat method and Chief Data Scientist / CTO mentors?