Ready4R (2024-03-04): The Chronicles of `{naniar}`
Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.
Ready4R (2024-03-04): The Chronicles of {naniar}
As promised, the topic this week is about missing values.
I would personally like to thumbs down the person responsible for the naming of Missing Completely at Random (MCAR) (which are actually randonly distibuted) and Missing at Random (MAR) (which are not randomly distibuted, and are conditioned on another variable). Specifically looking for patterns we don’t expect, such as structural zeroes, or whole swaths of missing values. If you don’t know the distinction, here’s a YT Video explaining the differences and a nice page detailing more information.
Why Should We Care?
As a data scientist, you need to be aware of missing values and how they impact your analysis. There are methods of dealing with missing values, such as imputation, that are highly dependent on the kinds of missingness in your data. Some modeling methods, like zero-inflated models, have different assumptions for using them properly.
Visualizing Missingness: vis_miss()
My favorite way to look for these patterns is a package called
{naniar}
written by my
friend Nick Tierney. naniar/visdat visualizes rows of data as lines in a
rectangle. Columns are represented by line sections. Here’s an example
with the cereal data we’ve been looking at the past few weeks:
library(naniar)
library(tidyverse)
cereals <- readr::read_csv("data/mascot_count.csv") |>
dplyr::select(-X)
naniar::vis_miss(cereals)
What I like about this visual representation is that it lets you see the association of missing values as holes in the visualization, as well as percent missing values in each variable. In this example, you can see that cereals are missing mascots; in our case, we expect this because we merged in a list of cereal mascots and not every cereal has a mascot.
Don’t get UpSet About Combinations: gg_miss_upset()
What about the association of missing values between data types? NaNiar
has some clever visualizations showing these associations. For example,
{naniar}
has the gg_miss_upset()
plot that will show how many
combinations of variables with missing values there are in a dataset. An
UpSet plot shows which variables
have combinations of missing values, and how many of those combinations
exist. Let’s look at the penguins data for more insight:
library(palmerpenguins)
naniar::gg_miss_upset(penguins)
In this example, reading the combinations from left to right, we can see:
- 9 penguins had missing values for
sex
- 2 penguins had missing values in
bill_length
,bill_depth
,flipper_length
,body_mass
, andsex
.
Visualizing the combinations of missing values helps us discover patterns of association in missingness that we don’t expect.
Note: UpSet plots are very interesting and will probably warrant a further exploration in a future Ready for R newsletter.
Continuous Values and Missingness: geom_miss_point()
Most of these visualizations use a shadow matrix representation of missing values. This shadow matrix lets you do clever things such as visualize two continuous variables on a plot but include those missing values to assess whether those missing values are MNAR, MAR, or MCAR.
When you are plotting two continuous values, you need to be curious
about whether there are biases in the missingness. geom_miss_point()
gives us a way to visualize the missing values when we plot.
ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
naniar::geom_miss_point() +
##everything past this point is just
#to explain the visualization
theme_minimal() +
geom_vline(xintercept=0) +
geom_hline(yintercept = 0) +
annotate("text",x=-5 ,y=150, label= "missing ozone", angle=90) +
annotate("text", y=-15, x=75, label="missing Solar.R") +
annotate("text", y=-20, x=-20, label="missing\nboth") +
annotate("text", y=150, x=75, label="no missing data")
In this plot,
the missing values are represented by red points that are below the zero
line for both axes (they are jittered so they don’t all occupy the same
line). Specifically, the points on the left side have values for
Solar.R
but are missing values for Ozone
. In this case, the points
are distributed across the entire range of Solar.R
. Note that this
isn’t the case for missing values of Solar.R
, which are represented in
the lower right of the plot. These missing values are not distributed
evenly across Ozone
, showing a bias towards lower values of Ozone
.
geom_miss_point()
becomes especially powerful when you facet on a
categorical variable, to look for conditioned randomness, MAR/MNAR.
ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
naniar::geom_miss_point() +
facet_wrap(~Month)
Here we can see a possible bias in missing values by the month (compare month=6 to month=9).
In Conclusion: We Miss You, Missing Values
I’ve barely scratched the surface of all you can do with {naniar}
.
Nick has come up with all sorts of visualizations to address issues with
missing values. I especially like the visualizations he’s added around
imputations, which is one way to address missing values. Check his
package out!
What’s next?
I’m going to write about a very special package called {nullabor}
from Di Cook that
asks the question of you and your friends, “Is this pattern in the data real?” Honestly, it made me rethink the role of exploratory data analysis.