Ready4R (2024/03/25): Boxplots
Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.
Ready4R (2024/03/25): Boxplots
On Boxplots
I’d like to talk about one of the most important tools in your repertoire - the humble boxplot and its box. There’s a lot you can understand about your data with just a boxplot.
Boxplots let us understand whether our assumptions are met in the data. I rarely show just graph one boxplot with my data. It’s a tool that I use to explore the data, especially in understanding possible associations and predictors in the dataset. In particular, I want to talk about boxplots as useful for understanding the predictive power of covariates.
In short, the boxplot gives you a sense of the distributions and skew of your continuous variable conditioned on a categorical variable.
Medians and IQRs
(figure gratefuly adapted from here)
The box of a boxplot shows the IQR, or interquartile range. Within the box is 50% of your data, centered around the median of the data, at the 25th and 75th quartiles.
Why is this useful? Boxplots give us an idea of the skew in the distributions in our data. As a first pass, they give us a clue to what’s going on in continuous data, especially when conditioned on a categorical variable.
The whiskers part of the boxplot shows the bottom and top 25% of the data. I use this to further understand the skewness of the variables.
Boxplots for Assessing Covariates
The first case I’d like to talk about is using boxplots to understand
the role of covariates in a classfication model. Say we want to predict
whether a customer leaves, also known as churn
, from a telecom
company.
This is a pretty famous sample dataset from IBM for building and testing classification models that’s been done a little bit to death. I’ve put a copy of the data here if you want to try this out yourself.
library(tidyverse)
customer_churn <- readr::read_csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv") |>
janitor::clean_names()
customer_churn2 <- customer_churn |>
mutate(across(-c(customer_id, tenure, monthly_charges, total_charges), as.factor)) |> tidyr::drop_na(total_charges)
knitr::kable(head(customer_churn2 |> select(tenure, monthly_charges, total_charges, churn)))
tenure | monthly_charges | total_charges | churn |
---|---|---|---|
1 | 29.85 | 29.85 | No |
34 | 56.95 | 1889.50 | No |
2 | 53.85 | 108.15 | Yes |
45 | 42.30 | 1840.75 | No |
2 | 70.70 | 151.65 | Yes |
8 | 99.65 | 820.50 | Yes |
Let’s use a boxplot to understand the predictive power of adding
monthly_charges
into a predictive model of churn
. Making a boxplot
with monthly_charges
as our continuous variable, and churn
as our
discrete variable gives us a boxplot:
ggplot(customer_churn2, aes(x=churn, y=monthly_charges)) + geom_boxplot() +
labs(title="How are monthly charges associated with churn?", subtitle="Customers with high monthly charges are more likely to churn")
We can see that if the monthly charges are high, churn is more likely in
those customers. This does make sense, because if you have high charges,
you might be more likely to look for a better deal. So this is one clue
that we might want to add monthly_charges
as a variable in our
predictive model.
Similarly, doing a boxplot using tenure
and churn
shows us that
tenure
separates out the two groups nicely as well. This makes some
sense, because of the “lock-in” effect: you are much less likely to
leave the longer you stay a customer.
ggplot(customer_churn2, aes(x=churn, y=tenure)) + geom_boxplot() +
labs(title = "Do customers with longer tenure churn?", subtitle="The longer the tenure, the less likely they are to churn")
By plotting these two boxplots, we can already see that we have two
predictive variables of churn
: monthly_charges
and tenure
. If we
were building a predictive model of churn
, these are two covariates
I’d include. (How would we figure out which categorical variables to
include in our model? We can do that with crosstabs.)
Proof in the Pudding: Logistic Regression
Let’s do that by building a logistic regression model using both
tenure
and monthly_charges
. We’ll use broom::tidy()
to get a
nicely formatted summary of the results.
library(broom)
churn_model <- glm(churn ~ tenure + monthly_charges,
family=binomial(link='logit'),
data=customer_churn2)
broom::tidy(churn_model)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.79 0.0866 -20.7 5.95e- 95
## 2 tenure -0.0550 0.00169 -32.5 3.79e-232
## 3 monthly_charges 0.0329 0.00130 25.3 1.56e-141
Notice that the p-values associated with the two variables are very low, showing that they are highly predictive in the model.
What if we added total_charges
to the model? Well, one thing to keep
in mind is that this is not independent information from
monthly_charges
. We can see they are correlated if we plot
total_charges
versus monthly_charges
on a scatterplot:
ggplot(customer_churn2, aes(x=total_charges, y=monthly_charges)) + geom_point()
The correlation isn’t perfect, but they are contributing similar
information. Let’s see what happens if we include total_charges
:
library(broom)
churn_model <- glm(churn ~ tenure + monthly_charges + total_charges,
family=binomial(link='logit'),
data=customer_churn2)
broom::tidy(churn_model)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.60 0.117 -13.6 2.74e-42
## 2 tenure -0.0671 0.00546 -12.3 9.40e-35
## 3 monthly_charges 0.0302 0.00172 17.6 3.23e-69
## 4 total_charges 0.000145 0.0000614 2.36 1.82e- 2
Notice that our p-values have increased overall, and that
total_charges
has a larger p-value than the others, indicating that
there is probably multicollinearity between our variables, so adding
total_charges
is probably not a good idea here.
I’ll leave it to you to add the factor
variables to the predictive
model. Notice I didn’t split the data into testing and training sets;
I’m mostly showing the thought process behind using boxplots in building
a predictive model.
For further information
- A Love Letter to the Boxplot by Melissa Santos - this was a wonderful talk by Melissa talking about boxplots and was one of the inspirations for this newsletter.