Ready for R Mailing List

Subscribe
Archives
April 19, 2024

Ready4R (2024/04/17): Building a Predictive Model

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

Ready4R (2024/04/17): Building a Predictive Model

Apologies for the lateness of this week’s newsletter. Trying to condense building an entire model into a single newsletter was very difficult, and life happened.

Last time, we talked about how our model might be applied and got an handle on the cost of errors in our predictive model. We decided that we were going to use our predictive model to target customers who were likely to churn, so we could offer them a better deal to retain them.

We also talked about whether false positives or false negatives were more costly in our model. In the case of the customer churn model, we decided that false negatives (customers we miss who do churn) were more costly mistakes we want to minimize in our model compared to the false positives (customers who aren’t going to churn which we offer a deal). Let’s apply these thoughts to building our model.

In logistic regression, the output is a logit, which is (roughly) the -ln probability of our predicted state (in our case, the probability that a customer will churn). We can actually adjust the proportion of false negatives and false positive by adjusting the threshold of prediction. In our case, we want to move our threshold towards increasing false positives so we can lower our false negatives.

But first, we want to build our predictive model.

Initial Prep

library(tidyverse)

customer_churn <- readr::read_csv("https://bit.ly/churn_ready4r") |>
  janitor::clean_names()

customer_churn2 <- customer_churn |>
  mutate(across(-c(customer_id, tenure, 
                   monthly_charges, total_charges), as.factor)) |>
  tidyr::drop_na(total_charges)

head(customer_churn2) |>
  select(churn, tenure, monthly_charges, dependents, multiple_lines) |>
  knitr::kable()
churn tenure monthly_charges dependents multiple_lines
No 1 29.85 No No phone service
No 34 56.95 No No
Yes 2 53.85 No No
No 45 42.30 No No phone service
Yes 2 70.70 No No
Yes 8 99.65 No Yes

I’m going to go a little further and recode the multiple_lines column into a binary variable with 1=No phone service and 0=Phone service. I suspect that those who don’t have phone service are more likely to churn.

customer_churn3 <- customer_churn2 |>
  mutate(multiple_lines = as.character(multiple_lines)) |>
  mutate(no_phone = ifelse(multiple_lines == "No phone service", 1, 0))

head(customer_churn3) |>
  select(churn, tenure, monthly_charges, dependents, no_phone) |>
  knitr::kable()
churn tenure monthly_charges dependents no_phone
No 1 29.85 No 1
No 34 56.95 No 0
Yes 2 53.85 No 0
No 45 42.30 No 1
Yes 2 70.70 No 0
Yes 8 99.65 No 0

Divide the Data

We want to make sure that we build the best model we can, and one of the only ways to do that is to split the data into a training set and a testing set. We hold out some percentage of the data (10%) to ensure that we are not overfitting the model. We’ll use the functions in {rsample} to make our training data and our test data.

library(tidymodels)

data_split <- rsample::initial_split(customer_churn3, 
                                     prop = 0.9, strata = "churn")
train_data <- rsample::training(data_split)
test_data <- rsample::testing(data_split)

The sizes of the two datasets are what we expect:

dim(train_data)
## [1] 6328   22
dim(test_data)
## [1] 704  22

Let’s confirm the proportions are similar for the testing and training sets:

customer_churn2 |>
  janitor::tabyl(churn)
##  churn    n  percent
##     No 5163 0.734215
##    Yes 1869 0.265785
train_data |>
  janitor::tabyl(churn)
##  churn    n   percent
##     No 4646 0.7341972
##    Yes 1682 0.2658028

We have now confirmed that the proportions of churn are similar in the test and train sets.

Now we can train our logistic model and fit the coefficients to our model.

Build our model by adding our covariates

We saw by exploring the boxplots a couple weeks ago that there are two numeric covariates that were predictive: monthly_charges and tenure.

I’ll add a couple of factor covariates into the model. One idea that might be families that have dependents might be more likely to churn if they find a better deal because they have more family members to pay for. So a discount with another carrier could be really an attractive reason to churn. We might include other covariates in our model that represent inertia, or the pain of switching carriers. I created the covariate no_phone above because it was a category in multiple_lines, because I thought that not having a phone would make you more likely to churn.

model1 <- glm(churn ~ monthly_charges + tenure + 
                no_phone + dependents,
              family = "binomial", data = train_data)

tidy(model1)
## # A tibble: 5 × 5
##   term            estimate std.error statistic   p.value
##   <chr>              <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      -2.06     0.110      -18.6  1.27e- 77
## 2 monthly_charges   0.0372   0.00157     23.8  5.88e-125
## 3 tenure           -0.0562   0.00184    -30.5  1.83e-204
## 4 no_phone          0.932    0.122        7.66 1.82e- 14
## 5 dependentsYes    -0.403    0.0811      -4.97 6.82e-  7

How Well Did We Do?

Ok, let’s see how we did on the test set. How many customers did we correctly predict churn for (1=yes, 0=no)? As noted, what is returned by logistic regression is the log probability, so we need to transform it back to probability. We’ll use the augment command in {broom} to add the logit to our test set, and then inverse the logit to get a probability.

test_logit <- augment(model1, newdata = test_data) |>
  dplyr::mutate(prob = gtools::inv.logit(.fitted))

head(test_logit) |>
  dplyr::select(monthly_charges, dependents, tenure, no_phone, churn, prob)
## # A tibble: 6 × 6
##   monthly_charges dependents tenure no_phone churn  prob
##             <dbl> <fct>       <dbl>    <dbl> <fct> <dbl>
## 1            50.0 Yes            13        0 No    0.209
## 2            55.2 Yes            10        0 Yes   0.275
## 3            20.2 No              1        0 Yes   0.203
## 4            55.3 No             30        0 No    0.156
## 5            69.5 No             25        0 No    0.294
## 6            94.4 No              9        0 Yes   0.721

We can visualize the distribution to understand our probabilities and understand where to draw the threshold for our probabilities.

ggplot(test_logit, aes(x = prob, fill = churn)) +
  geom_histogram() +
  geom_vline(xintercept = 0.5) +
  geom_vline(xintercept = 0.25) +
  annotate(geom = "text", x = 0.57, y = 30, label = "0.5 cutoff") +
  annotate(geom = "text", x = 0.32, y = 30, label = "0.25 cutoff")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice below that there seems to be a natural cutoff at 0.25, so let’s compare that cutoff to the 0.5 probability cutoff.

What about adjusting our threshold to focus on false negatives? We can do that by adjusting our threshold for prediction. In our case, we want to move our threshold down from 0.5 to our 0.25 threshold. We’ll use the ifelse() function to transform the probabilities into our churn predictions.

predictions <- test_logit |>
  mutate(prob05 = ifelse(prob <= 0.5, "No", "Yes")) |>
  mutate(prob025 = ifelse(prob <= 0.25, "No", "Yes")) |>
  mutate(across(c(churn, prob05, prob025), as.factor))

If we look at the 2x2 table of our prediction (prob05) versus the truth (churn) for the 0.5 threshold, we get the following:

predictions |>
  janitor::tabyl(churn, prob05)
##  churn  No Yes
##     No 457  60
##    Yes 111  76

If we look at the 2x2 table of our adjusted threshold (prob025) versus the truth (churn) for the 0.25 threshold, we get the following.

predictions |>
  janitor::tabyl(churn, prob025)
##  churn  No Yes
##     No 352 165
##    Yes  49 138

Notice that we’ve reduced our false negatives (we predict no, but the truth is yes). But there is a cost of doing so.

Metrics

Our best metric for understand whether we’re targeting false negatives is specificity, which is our ability to detect false negatives. Here’s the initial specificity at our 0.5 cutoff:

predictions |>
  specificity(churn, prob05)
## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 specificity binary         0.406

We detect false negatives about 47.6% of the time. Here’s the specificity for the adjusted threshold:

predictions |>
  specificity(churn, prob025)
## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 specificity binary         0.738

So as we lower the probability threshold, our specificity, or ability to detect false negatives increases.

However, there’s a tradeoff: our ability to detect false positives (sensitivity) decreases.

predictions |>
  sensitivity(churn, prob05)
## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 sensitivity binary         0.884
predictions |>
  sensitivity(churn, prob025)
## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 sensitivity binary         0.681

Yeah, I know, draw an ROC Curve

More informed readers will ask why I just didn’t plot a ROC curve, which shows the tradeoffs of sensitivity/specificity. I wanted to make a particular point about adjusting to reduce false negatives.

How would we report this to our stakeholders?

We need to think about the message of our model and how it will be used. Remember, in the last newsletter we discussed using it to target customers with the potential to churn.

Given our number of errors we observed in our model with the adjusted cutoff, we’ll talk about it over in terms of cost. If we say that the average increase of tenure is 4 months, then targeting these customers will bring in $x. This is also the cost of losing 4 months of payments. We can quantify an average monthly payment with a base discount as the cost associated with a false positive.

I’d write out the calculations, but this newsletter is already late.

Next Time

We took a very frequentist statistics approach to building our model. There is a reason why logistic regression is commonly used in predictive modeling; it’s easy to interpret and adjust parameters (in our case, the probability threshold) in most cases. Next time, we’ll approach it from a more machine learning perspective, using decision tree models.

Thanks for your suggestions!

As you can see, predictive modeling was the most popular topic chosen. I’ll wrap up this series, and then move on to clustering.

Don't miss what's next. Subscribe to Ready for R Mailing List:
Start the conversation:
Powered by Buttondown, the easiest way to start and grow your newsletter.