Ready4R (2024/3/12): One of these things is not like the others: `{nullabor}`
Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.
Ready4R (2024/3/12): One of these things is not like the others:
{nullabor}
One of the questions that should haunt you when you find a pattern in the data through exploratory data analysis is: is it real?. Is the trend you noticed over the past few months of your data real, or is it just noise? Is there a real difference between the means of your two groups, or not?
One way to think about this is to ask your friends is it real, or am I wrong? If you’ve collected the data, or have a stake in the results otherwise, you may be biased about your result. Asking people who don’t have a stake in the outcome can be a really good reality check.
The {nullabor}
package
comes from Di Cook’s group at Monash University[^1], and it’s an
example of a package that stimulates thinking about data. The paper
(with Hadley Wickham as first author) talks about the idea of visual
inference.
What is Visual Inference?
What is Visual Inference? Imagine you have a group of ten friends and you show them a number of plots (the lineup). You ask them, can you pick the real data from the other plots, which are permutations of the real data?
What proportion of people that you ask find your real data? That is the idea behind visual inference: you can do statistics by having a bunch of people look at your data and finding the proportion of them who can find the data.
Say you have 4 friends out of your 10 that can correctly pick out the data. You can calculate a p-value based on this, and the probability of picking the right dataset at random: (1/20 = 0.05). This can be done by using a binomial distribution.
Let’s try it out: this is a dataset with three different ways of
measuring body fat percentage, and we want to see if the measurements
agree in genral. Here, we’re comparing two of the methods: dxa
and
br
.
library(tidyverse)
body_fat <- readr::read_csv(here::here("data/body_composition.csv")) |>
janitor::clean_names() |>
dplyr::select(-gender) |>
tidyr::pivot_longer(c(dxa, st, br),names_to = "method", values_to = "value") |>
dplyr::filter(method %in% c("dxa", "br")) |>
dplyr::mutate(method = as.factor(method))
ggplot(body_fat) +
aes(x=method, y=value, color=method) +
geom_boxplot()
## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).
Could you pick this dataset out of a lineup? Here, we’re permuting
across the method
variable:
library(nullabor)
lineup_method <- nullabor::lineup(nullabor::null_permute("method"), body_fat)
## decrypt("clZx bKhK oL 3OHohoOL 0d")
ggplot(data=lineup_method, aes(x=method, y=value, color=method)) + geom_boxplot() + facet_wrap(~ .sample) +
labs(title="One of these datasets is not like the others", subtitle = "Calculate a p-value with you and your friends!")
## Warning: Removed 20 rows containing non-finite values (`stat_boxplot()`).
Not so easy, is it? Say we have 4 out of 10 friends who find the correct
data. pvisual()
will calculate the visual p-value using both
simulation and the binomial distribution.
pvisual(m=20, K = 10, x = 4)
## x simulated binom
## [1,] 4 0.0019 0.001028498
Surprisingly, with just 4 friends out of 10 who are correct, we have a p-value around 0.0010
. To me, this shows the power of exploratory data analysis.
Looking at Data Does Matter
Thinking about visual inference is important in these days because we, as data scientists, make exploratory dashboards and someone (such as your stakeholder) discovers something using it. How do we know that we can trust this discovery? By framing this discovery as a test, we can put bounds on the likelihood that the discovery is true given our observed data.
Before we get too esoteric, I think it’s important to understand that exploratory visualization helps us identify patterns, and statistical frameworks give us a handle on how unique that pattten is from the surrounding noise. This is the connection between exploratory versus confirmatory data analysis that we need to make explicit as non-statisticians make more discoveries using the tools we build.
Further Reading
I’ve just scratched the surface of visual inference. For me, it is an extremely interesting idea and it points to a statistical basis to understanding the patterns we identify with EDA within data.
- There is still only one
test -
this is another blog post that broke my brain. It talks about how a
lot of statistics can arise from permuting the data like we do with
{nullabor}
. - Graphical Inference for
Infovis - This
is the paper (with Hadley Wickham as first author) that outlines the
ideas behind
{nullabor}
. - What is Graphical
Inference? -
this is a nice demonstration of
{nullabor}
- Designing for Interactive Analysis Requires a Theory of Graphical Inference - really interesting paper about how we need to think about inference when we give people dashboards.
Next Week
Thanks for reading this far! Next week we’ll explore boxplots for understanding data and distributions.