Ready4R (2024/04/08): Before You Model and Predict
Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.
Before You Predict
Predictive modeling is often treated by non data scientists as a black box: you put in data and an answer comes out. This has become especially the case with AI predictive models.
People want to not think about what data is used to train the model or what data is used as a predictive inputs. They do this at their own peril.
Let's start with a step before even looking at the data: understanding the business problem you've been brought in to solve as a data scientist.
Defining the Business Problem
In this step, before you look at data, before you start cleaning it, and before you build a predictive model, you need to understand what the stakeholders want to use your model for.
For example, our model predicting churn will probably be used to target and provide potential deals to groups of people who are more likely to churn. In our case, we might reach out to these people with special offers to prevent them from leaving our telecom company. Certain kinds of errors are more costly than others.
Good questions to ask at this point: what business issues are we trying to solve? What is the impact of being successful? What is the impact of doing nothing?
Errors Are Costly
For example, what is the cost of falsely predicting (a false positive) a customer will churn? Is it more than if we missed a customer that churns (false negative)? A false positive means that we might offer a customer who wouldn't churn a slightly better deal. This might be preferable to false negatives, which could result in months of billing lost to their churning.
Our discussion is different based on what we are using the predictive model for. Say we trained our model to predict sepsis on patient data. Our goal here is to target patients in a hospital who might need an extra dose of antibiotics. The cost of falsely predicting (perhaps giving a patient extra antibiotics when they don't need them) is much less than the cost of missing a patient (which is death).
Having this discussion with your stakeholders is important because your predictive model can be tuned based on the costs associated with false positives and false negatives. It also is part of setting expectations for what you and your predictive model can do for them.
Discussing is Preparing for EDA
Coming in with these priorities spelled out, we can start to think about what approaches we want to take, and what to look for (and worry about) in our data.
These priorities inform every aspect of our analysis, including understanding the quality of the data and what parts of the data need to be cleaned, to the choice of transformations, to the building of the predictive model.
In future newsletters, I hope to go through building a predictive model and show some pitfalls that modelers fall into.