Ready4R (2024-03-04): The Chronicles of `{naniar}`

not

                March 5, 2024

            Ready4R (2024-03-04): The Chronicles of `{naniar}`

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

            Ready4R (2024-03-04): The Chronicles of {naniar}
As promised, the topic this week is about missing values.
I would personally like to thumbs down the person responsible for the
naming of Missing Completely at Random (MCAR) (which are actually
randonly distibuted) and Missing at Random (MAR) (which are not
randomly distibuted, and are conditioned on another variable).
Specifically looking for patterns we don’t expect, such as structural
zeroes, or whole swaths of missing values. If you don’t know the
distinction, here’s a YT Video explaining the
differences and a nice
page detailing more
information.
Why Should We Care?
As a data scientist, you need to be aware of missing values and how they
impact your analysis. There are methods of dealing with missing values,
such as imputation, that are highly dependent on the kinds of
missingness in your data. Some modeling methods, like zero-inflated
models, have different assumptions for using them properly.
Visualizing Missingness: vis_miss()
My favorite way to look for these patterns is a package called
{naniar} written by my
friend Nick Tierney. naniar/visdat visualizes rows of data as lines in a
rectangle. Columns are represented by line sections. Here’s an example
with the cereal data we’ve been looking at the past few weeks:
library(naniar)
library(tidyverse)

cereals <- readr::read_csv("data/mascot_count.csv") |>
  dplyr::select(-X)

naniar::vis_miss(cereals)

What I like about this visual representation is that it lets you see the
association of missing values as holes in the visualization, as well as
percent missing values in each variable. In this example, you can see
that cereals are missing mascots; in our case, we expect this because we
merged in a list of cereal mascots and not every cereal has a mascot.
Don’t get UpSet About Combinations: gg_miss_upset()
What about the association of missing values between data types? NaNiar
has some clever visualizations showing these associations. For example,
{naniar} has the gg_miss_upset() plot that will show how many
combinations of variables with missing values there are in a dataset. An
UpSet plot shows which variables
have combinations of missing values, and how many of those combinations
exist. Let’s look at the penguins data for more insight:
library(palmerpenguins)

naniar::gg_miss_upset(penguins)

In this example, reading the combinations from left to right, we can
see:

9 penguins had missing values for sex
2 penguins had missing values in bill_length, bill_depth,
  flipper_length, body_mass, and sex.

Visualizing the combinations of missing values helps us discover
patterns of association in missingness that we don’t expect.

Note: UpSet plots are very interesting and will probably warrant a
further exploration in a future Ready for R newsletter.

Continuous Values and Missingness: geom_miss_point()
Most of these visualizations use a shadow matrix representation of
missing values. This shadow matrix lets you do clever things such as
visualize two continuous variables on a plot but include those missing
values to assess whether those missing values are MNAR, MAR, or MCAR.
When you are plotting two continuous values, you need to be curious
about whether there are biases in the missingness. geom_miss_point()
gives us a way to visualize the missing values when we plot.
ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 naniar::geom_miss_point() +

  ##everything past this point is just 
  #to explain the visualization
  theme_minimal() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept = 0) +
  annotate("text",x=-5 ,y=150, label= "missing ozone", angle=90) +
  annotate("text", y=-15, x=75, label="missing Solar.R") +
  annotate("text", y=-20, x=-20, label="missing\nboth") +
  annotate("text", y=150, x=75, label="no missing data")

 In this plot,
the missing values are represented by red points that are below the zero
line for both axes (they are jittered so they don’t all occupy the same
line). Specifically, the points on the left side have values for
Solar.R but are missing values for Ozone. In this case, the points
are distributed across the entire range of Solar.R. Note that this
isn’t the case for missing values of Solar.R, which are represented in
the lower right of the plot. These missing values are not distributed
evenly across Ozone, showing a bias towards lower values of Ozone.
geom_miss_point() becomes especially powerful when you facet on a
categorical variable, to look for conditioned randomness, MAR/MNAR.
ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 naniar::geom_miss_point() + 
facet_wrap(~Month)

 Here we can
see a possible bias in missing values by the month (compare month=6 to
month=9).
In Conclusion: We Miss You, Missing Values
I’ve barely scratched the surface of all you can do with {naniar}.
Nick has come up with all sorts of visualizations to address issues with
missing values. I especially like the visualizations he’s added around
imputations, which is one way to address missing values. Check his
package out!
What’s next?
I’m going to write about a very special package called {nullabor} from Di Cook that
asks the question of you and your friends, “Is this pattern in the data real?” Honestly, it made me rethink the role of exploratory data analysis.

Don't miss what's next. Subscribe to Ready for R Mailing List:

Start the conversation:

Ready for R Mailing List

Ready4R (2024-03-04): The Chronicles of `{naniar}`

Ready4R (2024-03-04): The Chronicles of `{naniar}`

Why Should We Care?

Visualizing Missingness: `vis_miss()`

Don’t get UpSet About Combinations: `gg_miss_upset()`

Continuous Values and Missingness: `geom_miss_point()`

In Conclusion: We Miss You, Missing Values

What’s next?