Ready4R (2024/04/17): Building a Predictive Model
Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.
Ready4R (2024/04/17): Building a Predictive Model
Apologies for the lateness of this week’s newsletter. Trying to condense building an entire model into a single newsletter was very difficult, and life happened.
Last time, we talked about how our model might be applied and got an handle on the cost of errors in our predictive model. We decided that we were going to use our predictive model to target customers who were likely to churn, so we could offer them a better deal to retain them.
We also talked about whether false positives or false negatives were more costly in our model. In the case of the customer churn model, we decided that false negatives (customers we miss who do churn) were more costly mistakes we want to minimize in our model compared to the false positives (customers who aren’t going to churn which we offer a deal). Let’s apply these thoughts to building our model.
In logistic regression, the output is a logit, which is (roughly) the
-ln
probability of our predicted state (in our case, the
probability that a customer will churn). We can actually adjust the
proportion of false negatives and false positive by adjusting the
threshold of prediction. In our case, we want to move our threshold
towards increasing false positives so we can lower our false
negatives.
But first, we want to build our predictive model.
Initial Prep
library(tidyverse)
customer_churn <- readr::read_csv("https://bit.ly/churn_ready4r") |>
janitor::clean_names()
customer_churn2 <- customer_churn |>
mutate(across(-c(customer_id, tenure,
monthly_charges, total_charges), as.factor)) |>
tidyr::drop_na(total_charges)
head(customer_churn2) |>
select(churn, tenure, monthly_charges, dependents, multiple_lines) |>
knitr::kable()
churn | tenure | monthly_charges | dependents | multiple_lines |
---|---|---|---|---|
No | 1 | 29.85 | No | No phone service |
No | 34 | 56.95 | No | No |
Yes | 2 | 53.85 | No | No |
No | 45 | 42.30 | No | No phone service |
Yes | 2 | 70.70 | No | No |
Yes | 8 | 99.65 | No | Yes |
I’m going to go a little further and recode the
multiple_lines
column into a binary variable with
1=No phone service
and 0=Phone service
. I
suspect that those who don’t have phone service are more likely to
churn.
customer_churn3 <- customer_churn2 |>
mutate(multiple_lines = as.character(multiple_lines)) |>
mutate(no_phone = ifelse(multiple_lines == "No phone service", 1, 0))
head(customer_churn3) |>
select(churn, tenure, monthly_charges, dependents, no_phone) |>
knitr::kable()
churn | tenure | monthly_charges | dependents | no_phone |
---|---|---|---|---|
No | 1 | 29.85 | No | 1 |
No | 34 | 56.95 | No | 0 |
Yes | 2 | 53.85 | No | 0 |
No | 45 | 42.30 | No | 1 |
Yes | 2 | 70.70 | No | 0 |
Yes | 8 | 99.65 | No | 0 |
Divide the Data
We want to make sure that we build the best model we can, and one of
the only ways to do that is to split the data into a training set and a
testing set. We hold out some percentage of the data (10%) to ensure
that we are not overfitting the model. We’ll use the functions in
{rsample}
to make our training data and our test data.
library(tidymodels)
data_split <- rsample::initial_split(customer_churn3,
prop = 0.9, strata = "churn")
train_data <- rsample::training(data_split)
test_data <- rsample::testing(data_split)
The sizes of the two datasets are what we expect:
## [1] 6328 22
## [1] 704 22
Let’s confirm the proportions are similar for the testing and training sets:
## churn n percent
## No 5163 0.734215
## Yes 1869 0.265785
## churn n percent
## No 4646 0.7341972
## Yes 1682 0.2658028
We have now confirmed that the proportions of churn
are
similar in the test and train sets.
Now we can train our logistic model and fit the coefficients to our model.
Build our model by adding our covariates
We saw by exploring the boxplots a
couple weeks ago that there are two numeric covariates that were
predictive: monthly_charges
and tenure
.
I’ll add a couple of factor
covariates into the model.
One idea that might be families that have dependents
might
be more likely to churn if they find a better deal because they have
more family members to pay for. So a discount with another carrier could
be really an attractive reason to churn. We might include other
covariates in our model that represent inertia, or the pain of switching
carriers. I created the covariate no_phone
above because it
was a category in multiple_lines
, because I thought that
not having a phone would make you more likely to churn.
model1 <- glm(churn ~ monthly_charges + tenure +
no_phone + dependents,
family = "binomial", data = train_data)
tidy(model1)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -2.06 0.110 -18.6 1.27e- 77
## 2 monthly_charges 0.0372 0.00157 23.8 5.88e-125
## 3 tenure -0.0562 0.00184 -30.5 1.83e-204
## 4 no_phone 0.932 0.122 7.66 1.82e- 14
## 5 dependentsYes -0.403 0.0811 -4.97 6.82e- 7
How Well Did We Do?
Ok, let’s see how we did on the test set. How many customers did we
correctly predict churn for (1=yes, 0=no)? As noted, what is returned by
logistic regression is the log probability, so we need to transform it
back to probability. We’ll use the augment
command in
{broom}
to add the logit to our test set, and then inverse
the logit to get a probability.
test_logit <- augment(model1, newdata = test_data) |>
dplyr::mutate(prob = gtools::inv.logit(.fitted))
head(test_logit) |>
dplyr::select(monthly_charges, dependents, tenure, no_phone, churn, prob)
## # A tibble: 6 × 6
## monthly_charges dependents tenure no_phone churn prob
## <dbl> <fct> <dbl> <dbl> <fct> <dbl>
## 1 50.0 Yes 13 0 No 0.209
## 2 55.2 Yes 10 0 Yes 0.275
## 3 20.2 No 1 0 Yes 0.203
## 4 55.3 No 30 0 No 0.156
## 5 69.5 No 25 0 No 0.294
## 6 94.4 No 9 0 Yes 0.721
We can visualize the distribution to understand our probabilities and understand where to draw the threshold for our probabilities.
ggplot(test_logit, aes(x = prob, fill = churn)) +
geom_histogram() +
geom_vline(xintercept = 0.5) +
geom_vline(xintercept = 0.25) +
annotate(geom = "text", x = 0.57, y = 30, label = "0.5 cutoff") +
annotate(geom = "text", x = 0.32, y = 30, label = "0.25 cutoff")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice below that there seems to be a natural cutoff at 0.25, so let’s compare that cutoff to the 0.5 probability cutoff.
What about adjusting our threshold to focus on false negatives? We
can do that by adjusting our threshold for prediction. In our case, we
want to move our threshold down from 0.5
to our
0.25
threshold. We’ll use the ifelse()
function to transform the probabilities into our churn predictions.
predictions <- test_logit |>
mutate(prob05 = ifelse(prob <= 0.5, "No", "Yes")) |>
mutate(prob025 = ifelse(prob <= 0.25, "No", "Yes")) |>
mutate(across(c(churn, prob05, prob025), as.factor))
If we look at the 2x2 table of our prediction (prob05
)
versus the truth (churn
) for the 0.5 threshold, we get the
following:
## churn No Yes
## No 457 60
## Yes 111 76
If we look at the 2x2 table of our adjusted threshold
(prob025
) versus the truth (churn
) for the
0.25 threshold, we get the following.
## churn No Yes
## No 352 165
## Yes 49 138
Notice that we’ve reduced our false negatives (we predict no, but the truth is yes). But there is a cost of doing so.
Metrics
Our best metric for understand whether we’re targeting false negatives is specificity, which is our ability to detect false negatives. Here’s the initial specificity at our 0.5 cutoff:
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 specificity binary 0.406
We detect false negatives about 47.6% of the time. Here’s the specificity for the adjusted threshold:
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 specificity binary 0.738
So as we lower the probability threshold, our specificity, or ability to detect false negatives increases.
However, there’s a tradeoff: our ability to detect false positives (sensitivity) decreases.
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 sensitivity binary 0.884
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 sensitivity binary 0.681
Yeah, I know, draw an ROC Curve
More informed readers will ask why I just didn’t plot a ROC curve, which shows the tradeoffs of sensitivity/specificity. I wanted to make a particular point about adjusting to reduce false negatives.
How would we report this to our stakeholders?
We need to think about the message of our model and how it will be used. Remember, in the last newsletter we discussed using it to target customers with the potential to churn.
Given our number of errors we observed in our model with the adjusted cutoff, we’ll talk about it over in terms of cost. If we say that the average increase of tenure is 4 months, then targeting these customers will bring in $x. This is also the cost of losing 4 months of payments. We can quantify an average monthly payment with a base discount as the cost associated with a false positive.
I’d write out the calculations, but this newsletter is already late.
Next Time
We took a very frequentist statistics approach to building our model. There is a reason why logistic regression is commonly used in predictive modeling; it’s easy to interpret and adjust parameters (in our case, the probability threshold) in most cases. Next time, we’ll approach it from a more machine learning perspective, using decision tree models.
Thanks for your suggestions!
As you can see, predictive modeling was the most popular topic chosen. I’ll wrap up this series, and then move on to clustering.