Ready4R (2024/03/25): Boxplots

multicollinearity

                March 26, 2024

            Ready4R (2024/03/25): Boxplots

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

            Ready4R (2024/03/25): Boxplots
On Boxplots
I’d like to talk about one of the most important tools in your
repertoire - the humble boxplot and its box. There’s a lot you can
understand about your data with just a boxplot.
Boxplots let us understand whether our assumptions are met in the data.
I rarely show just graph one boxplot with my data. It’s a tool that I
use to explore the data, especially in understanding possible
associations and predictors in the dataset. In particular, I want to
talk about boxplots as useful for understanding the predictive power of
covariates.
In short, the boxplot gives you a sense of the distributions and skew of
your continuous variable conditioned on a categorical variable.
Medians and IQRs

(figure gratefuly adapted from
here)
The box of a boxplot shows the IQR, or interquartile range. Within the
box is 50% of your data, centered around the median of the data, at the
25th and 75th quartiles.
Why is this useful? Boxplots give us an idea of the skew in the
distributions in our data. As a first pass, they give us a clue to
what’s going on in continuous data, especially when conditioned on a
categorical variable.
The whiskers part of the boxplot shows the bottom and top 25% of the
data. I use this to further understand the skewness of the variables.
Boxplots for Assessing Covariates
The first case I’d like to talk about is using boxplots to understand
the role of covariates in a classfication model. Say we want to predict
whether a customer leaves, also known as churn, from a telecom
company.
This is a pretty famous sample dataset from
IBM for
building and testing classification models that’s been done a little bit
to death. I’ve put a copy of the data
here
if you want to try this out yourself.
library(tidyverse)

customer_churn <- readr::read_csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv") |>
  janitor::clean_names()

customer_churn2 <- customer_churn |>
  mutate(across(-c(customer_id, tenure, monthly_charges, total_charges), as.factor)) |> tidyr::drop_na(total_charges) 

knitr::kable(head(customer_churn2 |> select(tenure, monthly_charges, total_charges, churn)))

tenure
monthly_charges
total_charges
churn

1
29.85
29.85
No

34
56.95
1889.50
No

2
53.85
108.15
Yes

45
42.30
1840.75
No

2
70.70
151.65
Yes

8
99.65
820.50
Yes

Let’s use a boxplot to understand the predictive power of adding
monthly_charges into a predictive model of churn. Making a boxplot
with monthly_charges as our continuous variable, and churn as our
discrete variable gives us a boxplot:
ggplot(customer_churn2, aes(x=churn, y=monthly_charges)) + geom_boxplot() + 
  labs(title="How are monthly charges associated with churn?", subtitle="Customers with high monthly charges are more likely to churn")

We can see that if the monthly charges are high, churn is more likely in
those customers. This does make sense, because if you have high charges,
you might be more likely to look for a better deal. So this is one clue
that we might want to add monthly_charges as a variable in our
predictive model.
Similarly, doing a boxplot using tenure and churn shows us that
tenure separates out the two groups nicely as well. This makes some
sense, because of the “lock-in” effect: you are much less likely to
leave the longer you stay a customer.
ggplot(customer_churn2, aes(x=churn, y=tenure)) + geom_boxplot() + 
  labs(title = "Do customers with longer tenure churn?", subtitle="The longer the tenure, the less likely they are to churn")

By plotting these two boxplots, we can already see that we have two
predictive variables of churn: monthly_charges and tenure. If we
were building a predictive model of churn, these are two covariates
I’d include. (How would we figure out which categorical variables to
include in our model? We can do that with crosstabs.)
Proof in the Pudding: Logistic Regression
Let’s do that by building a logistic regression model using both
tenure and monthly_charges. We’ll use broom::tidy() to get a
nicely formatted summary of the results.
library(broom)

churn_model <- glm(churn ~ tenure + monthly_charges,
                   family=binomial(link='logit'),
                   data=customer_churn2)

broom::tidy(churn_model)

## # A tibble: 3 × 5
##   term            estimate std.error statistic   p.value
##   <chr>              <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      -1.79     0.0866      -20.7 5.95e- 95
## 2 tenure           -0.0550   0.00169     -32.5 3.79e-232
## 3 monthly_charges   0.0329   0.00130      25.3 1.56e-141

Notice that the p-values associated with the two variables are very low,
showing that they are highly predictive in the model.
What if we added total_charges to the model? Well, one thing to keep
in mind is that this is not independent information from
monthly_charges. We can see they are correlated if we plot
total_charges versus monthly_charges on a scatterplot:
ggplot(customer_churn2, aes(x=total_charges, y=monthly_charges)) + geom_point()

The correlation isn’t perfect, but they are contributing similar
information. Let’s see what happens if we include total_charges:
library(broom)

churn_model <- glm(churn ~ tenure + monthly_charges + total_charges,
                   family=binomial(link='logit'),
                   data=customer_churn2)

broom::tidy(churn_model)

    ## # A tibble: 4 × 5
    ##   term             estimate std.error statistic  p.value
    ##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>
    ## 1 (Intercept)     -1.60     0.117        -13.6  2.74e-42
    ## 2 tenure          -0.0671   0.00546      -12.3  9.40e-35
    ## 3 monthly_charges  0.0302   0.00172       17.6  3.23e-69
    ## 4 total_charges    0.000145 0.0000614      2.36 1.82e- 2

Notice that our p-values have increased overall, and that
total_charges has a larger p-value than the others, indicating that
there is probably multicollinearity between our variables, so adding
total_charges is probably not a good idea here.
I’ll leave it to you to add the factor variables to the predictive
model. Notice I didn’t split the data into testing and training sets;
I’m mostly showing the thought process behind using boxplots in building
a predictive model.
For further information

A Love Letter to the Boxplot by Melissa
  Santos - this was a
  wonderful talk by Melissa talking about boxplots and was one of the
  inspirations for this newsletter.

Don't miss what's next. Subscribe to Ready for R Mailing List:

Start the conversation:

tenure	monthly_charges	total_charges	churn
1	29.85	29.85	No
34	56.95	1889.50	No
2	53.85	108.15	Yes
45	42.30	1840.75	No
2	70.70	151.65	Yes
8	99.65	820.50	Yes