Ready for R Mailing List

Subscribe
Archives
March 26, 2024

Ready4R (2024/03/25): Boxplots

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

Ready4R (2024/03/25): Boxplots

On Boxplots

I’d like to talk about one of the most important tools in your repertoire - the humble boxplot and its box. There’s a lot you can understand about your data with just a boxplot.

Boxplots let us understand whether our assumptions are met in the data. I rarely show just graph one boxplot with my data. It’s a tool that I use to explore the data, especially in understanding possible associations and predictors in the dataset. In particular, I want to talk about boxplots as useful for understanding the predictive power of covariates.

In short, the boxplot gives you a sense of the distributions and skew of your continuous variable conditioned on a categorical variable.

Medians and IQRs

(figure gratefuly adapted from here)

The box of a boxplot shows the IQR, or interquartile range. Within the box is 50% of your data, centered around the median of the data, at the 25th and 75th quartiles.

Why is this useful? Boxplots give us an idea of the skew in the distributions in our data. As a first pass, they give us a clue to what’s going on in continuous data, especially when conditioned on a categorical variable.

The whiskers part of the boxplot shows the bottom and top 25% of the data. I use this to further understand the skewness of the variables.

Boxplots for Assessing Covariates

The first case I’d like to talk about is using boxplots to understand the role of covariates in a classfication model. Say we want to predict whether a customer leaves, also known as churn, from a telecom company.

This is a pretty famous sample dataset from IBM for building and testing classification models that’s been done a little bit to death. I’ve put a copy of the data here if you want to try this out yourself.

library(tidyverse)

customer_churn <- readr::read_csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv") |>
  janitor::clean_names()

customer_churn2 <- customer_churn |>
  mutate(across(-c(customer_id, tenure, monthly_charges, total_charges), as.factor)) |> tidyr::drop_na(total_charges) 

knitr::kable(head(customer_churn2 |> select(tenure, monthly_charges, total_charges, churn)))
tenure monthly_charges total_charges churn
1 29.85 29.85 No
34 56.95 1889.50 No
2 53.85 108.15 Yes
45 42.30 1840.75 No
2 70.70 151.65 Yes
8 99.65 820.50 Yes

Let’s use a boxplot to understand the predictive power of adding monthly_charges into a predictive model of churn. Making a boxplot with monthly_charges as our continuous variable, and churn as our discrete variable gives us a boxplot:

ggplot(customer_churn2, aes(x=churn, y=monthly_charges)) + geom_boxplot() + 
  labs(title="How are monthly charges associated with churn?", subtitle="Customers with high monthly charges are more likely to churn")

We can see that if the monthly charges are high, churn is more likely in those customers. This does make sense, because if you have high charges, you might be more likely to look for a better deal. So this is one clue that we might want to add monthly_charges as a variable in our predictive model.

Similarly, doing a boxplot using tenure and churn shows us that tenure separates out the two groups nicely as well. This makes some sense, because of the “lock-in” effect: you are much less likely to leave the longer you stay a customer.

ggplot(customer_churn2, aes(x=churn, y=tenure)) + geom_boxplot() + 
  labs(title = "Do customers with longer tenure churn?", subtitle="The longer the tenure, the less likely they are to churn")

By plotting these two boxplots, we can already see that we have two predictive variables of churn: monthly_charges and tenure. If we were building a predictive model of churn, these are two covariates I’d include. (How would we figure out which categorical variables to include in our model? We can do that with crosstabs.)

Proof in the Pudding: Logistic Regression

Let’s do that by building a logistic regression model using both tenure and monthly_charges. We’ll use broom::tidy() to get a nicely formatted summary of the results.

library(broom)

churn_model <- glm(churn ~ tenure + monthly_charges,
                   family=binomial(link='logit'),
                   data=customer_churn2)

broom::tidy(churn_model)
## # A tibble: 3 × 5
##   term            estimate std.error statistic   p.value
##   <chr>              <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      -1.79     0.0866      -20.7 5.95e- 95
## 2 tenure           -0.0550   0.00169     -32.5 3.79e-232
## 3 monthly_charges   0.0329   0.00130      25.3 1.56e-141

Notice that the p-values associated with the two variables are very low, showing that they are highly predictive in the model.

What if we added total_charges to the model? Well, one thing to keep in mind is that this is not independent information from monthly_charges. We can see they are correlated if we plot total_charges versus monthly_charges on a scatterplot:

ggplot(customer_churn2, aes(x=total_charges, y=monthly_charges)) + geom_point()

The correlation isn’t perfect, but they are contributing similar information. Let’s see what happens if we include total_charges:

library(broom)

churn_model <- glm(churn ~ tenure + monthly_charges + total_charges,
                   family=binomial(link='logit'),
                   data=customer_churn2)

broom::tidy(churn_model)
    ## # A tibble: 4 × 5
    ##   term             estimate std.error statistic  p.value
    ##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>
    ## 1 (Intercept)     -1.60     0.117        -13.6  2.74e-42
    ## 2 tenure          -0.0671   0.00546      -12.3  9.40e-35
    ## 3 monthly_charges  0.0302   0.00172       17.6  3.23e-69
    ## 4 total_charges    0.000145 0.0000614      2.36 1.82e- 2

Notice that our p-values have increased overall, and that total_charges has a larger p-value than the others, indicating that there is probably multicollinearity between our variables, so adding total_charges is probably not a good idea here.

I’ll leave it to you to add the factor variables to the predictive model. Notice I didn’t split the data into testing and training sets; I’m mostly showing the thought process behind using boxplots in building a predictive model.

For further information

  • A Love Letter to the Boxplot by Melissa Santos - this was a wonderful talk by Melissa talking about boxplots and was one of the inspirations for this newsletter.
Don't miss what's next. Subscribe to Ready for R Mailing List:
Start the conversation:
Powered by Buttondown, the easiest way to start and grow your newsletter.