Ready4R (2024/04/17): Building a Predictive Model

false negatives

                April 19, 2024

            Ready4R (2024/04/17): Building a Predictive Model

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

Ready4R
(2024/04/17): Building a Predictive Model

Apologies for the lateness of this week’s newsletter. Trying to
condense building an entire model into a single newsletter was very
difficult, and life happened.

Last
time, we talked about how our model might be applied and got an
handle on the cost of errors in our predictive model. We decided that we
were going to use our predictive model to target customers who were
likely to churn, so we could offer them a better deal to retain
them.
We also talked about whether false positives or false negatives were
more costly in our model. In the case of the customer churn model, we
decided that false negatives (customers we miss who do churn) were more
costly mistakes we want to minimize in our model compared to the false
positives (customers who aren’t going to churn which we offer a deal).
Let’s apply these thoughts to building our model.
In logistic regression, the output is a logit, which is (roughly) the
-ln probability of our predicted state (in our case, the
probability that a customer will churn). We can actually adjust the
proportion of false negatives and false positive by adjusting the
threshold of prediction. In our case, we want to move our threshold
towards increasing false positives so we can lower our false
negatives.
But first, we want to build our predictive model.
Initial Prep
library(tidyverse)

customer_churn <- readr::read_csv("https://bit.ly/churn_ready4r") |>
  janitor::clean_names()

customer_churn2 <- customer_churn |>
  mutate(across(-c(customer_id, tenure, 
                   monthly_charges, total_charges), as.factor)) |>
  tidyr::drop_na(total_charges)

head(customer_churn2) |>
  select(churn, tenure, monthly_charges, dependents, multiple_lines) |>
  knitr::kable()

churn
tenure
monthly_charges
dependents
multiple_lines

No
1
29.85
No
No phone service

No
34
56.95
No
No

Yes
2
53.85
No
No

No
45
42.30
No
No phone service

Yes
2
70.70
No
No

Yes
8
99.65
No
Yes

I’m going to go a little further and recode the
multiple_lines column into a binary variable with
1=No phone service and 0=Phone service. I
suspect that those who don’t have phone service are more likely to
churn.
customer_churn3 <- customer_churn2 |>
  mutate(multiple_lines = as.character(multiple_lines)) |>
  mutate(no_phone = ifelse(multiple_lines == "No phone service", 1, 0))

head(customer_churn3) |>
  select(churn, tenure, monthly_charges, dependents, no_phone) |>
  knitr::kable()

churn
tenure
monthly_charges
dependents
no_phone

No
1
29.85
No
1

No
34
56.95
No
0

Yes
2
53.85
No
0

No
45
42.30
No
1

Yes
2
70.70
No
0

Yes
8
99.65
No
0

Divide the Data
We want to make sure that we build the best model we can, and one of
the only ways to do that is to split the data into a training set and a
testing set. We hold out some percentage of the data (10%) to ensure
that we are not overfitting the model. We’ll use the functions in
{rsample} to make our training data and our test data.
library(tidymodels)

data_split <- rsample::initial_split(customer_churn3, 
                                     prop = 0.9, strata = "churn")
train_data <- rsample::training(data_split)
test_data <- rsample::testing(data_split)
The sizes of the two datasets are what we expect:
dim(train_data)
## [1] 6328   22
dim(test_data)
## [1] 704  22
Let’s confirm the proportions are similar for the testing and
training sets:
customer_churn2 |>
  janitor::tabyl(churn)
##  churn    n  percent
##     No 5163 0.734215
##    Yes 1869 0.265785
train_data |>
  janitor::tabyl(churn)
##  churn    n   percent
##     No 4646 0.7341972
##    Yes 1682 0.2658028
We have now confirmed that the proportions of churn are
similar in the test and train sets.
Now we can train our logistic model and fit the coefficients to our
model.
Build our model by
adding our covariates
We saw by exploring the boxplots a
couple weeks ago that there are two numeric covariates that were
predictive: monthly_charges and tenure.
I’ll add a couple of factor covariates into the model.
One idea that might be families that have dependents might
be more likely to churn if they find a better deal because they have
more family members to pay for. So a discount with another carrier could
be really an attractive reason to churn. We might include other
covariates in our model that represent inertia, or the pain of switching
carriers. I created the covariate no_phone above because it
was a category in multiple_lines, because I thought that
not having a phone would make you more likely to churn.
model1 <- glm(churn ~ monthly_charges + tenure + 
                no_phone + dependents,
              family = "binomial", data = train_data)

tidy(model1)
## # A tibble: 5 × 5
##   term            estimate std.error statistic   p.value
##   <chr>              <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      -2.06     0.110      -18.6  1.27e- 77
## 2 monthly_charges   0.0372   0.00157     23.8  5.88e-125
## 3 tenure           -0.0562   0.00184    -30.5  1.83e-204
## 4 no_phone          0.932    0.122        7.66 1.82e- 14
## 5 dependentsYes    -0.403    0.0811      -4.97 6.82e-  7
How Well Did We Do?
Ok, let’s see how we did on the test set. How many customers did we
correctly predict churn for (1=yes, 0=no)? As noted, what is returned by
logistic regression is the log probability, so we need to transform it
back to probability. We’ll use the augment command in
{broom} to add the logit to our test set, and then inverse
the logit to get a probability.
test_logit <- augment(model1, newdata = test_data) |>
  dplyr::mutate(prob = gtools::inv.logit(.fitted))

head(test_logit) |>
  dplyr::select(monthly_charges, dependents, tenure, no_phone, churn, prob)
## # A tibble: 6 × 6
##   monthly_charges dependents tenure no_phone churn  prob
##             <dbl> <fct>       <dbl>    <dbl> <fct> <dbl>
## 1            50.0 Yes            13        0 No    0.209
## 2            55.2 Yes            10        0 Yes   0.275
## 3            20.2 No              1        0 Yes   0.203
## 4            55.3 No             30        0 No    0.156
## 5            69.5 No             25        0 No    0.294
## 6            94.4 No              9        0 Yes   0.721
We can visualize the distribution to understand our probabilities and
understand where to draw the threshold for our probabilities.
ggplot(test_logit, aes(x = prob, fill = churn)) +
  geom_histogram() +
  geom_vline(xintercept = 0.5) +
  geom_vline(xintercept = 0.25) +
  annotate(geom = "text", x = 0.57, y = 30, label = "0.5 cutoff") +
  annotate(geom = "text", x = 0.32, y = 30, label = "0.25 cutoff")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice below that there seems to be a natural cutoff at 0.25, so let’s
compare that cutoff to the 0.5 probability cutoff.
What about adjusting our threshold to focus on false negatives? We
can do that by adjusting our threshold for prediction. In our case, we
want to move our threshold down from 0.5 to our
0.25 threshold. We’ll use the ifelse()
function to transform the probabilities into our churn predictions.
predictions <- test_logit |>
  mutate(prob05 = ifelse(prob <= 0.5, "No", "Yes")) |>
  mutate(prob025 = ifelse(prob <= 0.25, "No", "Yes")) |>
  mutate(across(c(churn, prob05, prob025), as.factor))
If we look at the 2x2 table of our prediction (prob05)
versus the truth (churn) for the 0.5 threshold, we get the
following:
predictions |>
  janitor::tabyl(churn, prob05)
##  churn  No Yes
##     No 457  60
##    Yes 111  76
If we look at the 2x2 table of our adjusted threshold
(prob025) versus the truth (churn) for the
0.25 threshold, we get the following.
predictions |>
  janitor::tabyl(churn, prob025)
##  churn  No Yes
##     No 352 165
##    Yes  49 138
Notice that we’ve reduced our false negatives (we predict no, but the
truth is yes). But there is a cost of doing so.
Metrics
Our best metric for understand whether we’re targeting false
negatives is specificity, which is our ability to
detect false negatives. Here’s the initial specificity at our
0.5 cutoff:
predictions |>
  specificity(churn, prob05)
## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 specificity binary         0.406
We detect false negatives about 47.6% of the time. Here’s the
specificity for the adjusted threshold:
predictions |>
  specificity(churn, prob025)
## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 specificity binary         0.738
So as we lower the probability threshold, our specificity, or ability
to detect false negatives increases.
However, there’s a tradeoff: our ability to detect false positives
(sensitivity) decreases.
predictions |>
  sensitivity(churn, prob05)
## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 sensitivity binary         0.884
predictions |>
  sensitivity(churn, prob025)
## # A tibble: 1 × 3
##   .metric     .estimator .estimate
##   <chr>       <chr>          <dbl>
## 1 sensitivity binary         0.681
Yeah, I know, draw an ROC
Curve
More informed readers will ask why I just didn’t plot a ROC curve,
which shows the tradeoffs of sensitivity/specificity. I wanted to make a
particular point about adjusting to reduce false negatives.
How would we
report this to our stakeholders?
We need to think about the message of our model and how it will be
used. Remember, in the last newsletter we discussed using it to target
customers with the potential to churn.
Given our number of errors we observed in our model with the adjusted
cutoff, we’ll talk about it over in terms of cost. If we say that the
average increase of tenure is 4 months, then targeting these customers
will bring in $x. This is also the cost of losing 4 months of payments.
We can quantify an average monthly payment with a base discount as the
cost associated with a false positive.
I’d write out the calculations, but this newsletter is already
late.
Next Time
We took a very frequentist statistics approach to building our model.
There is a reason why logistic regression is commonly used in predictive
modeling; it’s easy to interpret and adjust parameters (in our case, the
probability threshold) in most cases. Next time, we’ll approach it from
a more machine learning perspective, using decision tree models.
Thanks for your suggestions!
As you can see, predictive modeling was the most popular topic
chosen. I’ll wrap up this series, and then move on to clustering.

Don't miss what's next. Subscribe to Ready for R Mailing List:

Start the conversation:

churn	tenure	monthly_charges	dependents	multiple_lines
No	1	29.85	No	No phone service
No	34	56.95	No	No
Yes	2	53.85	No	No
No	45	42.30	No	No phone service
Yes	2	70.70	No	No
Yes	8	99.65	No	Yes

Ready for R Mailing List

Ready4R (2024/04/17): Building a Predictive Model

Ready4R (2024/04/17): Building a Predictive Model

Initial Prep

Divide the Data

Build our model by adding our covariates

How Well Did We Do?

Metrics

Yeah, I know, draw an ROC Curve

How would we report this to our stakeholders?

Next Time

Thanks for your suggestions!