Ready4R (2024/3/12): One of these things is not like the others: `{nullabor}`

is it real?

                March 12, 2024

            Ready4R (2024/3/12): One of these things is not like the others: `{nullabor}`

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

            Ready4R (2024/3/12): One of these things is not like the others:
{nullabor}
One of the questions that should haunt you when you find a pattern in
the data through exploratory data analysis is: is it real?. Is the
trend you noticed over the past few months of your data real, or is it
just noise? Is there a real difference between the means of your two
groups, or not?
One way to think about this is to ask your friends is it real, or am I
wrong? If you’ve collected the data, or have a stake in the results
otherwise, you may be biased about your result. Asking people who don’t
have a stake in the outcome can be a really good reality check.
The {nullabor}
package
comes from Di Cook’s group at Monash University[^1], and it’s an
example of a package that stimulates thinking about data. The paper
(with Hadley Wickham as first author) talks about the idea of visual
inference.
What is Visual Inference?
What is Visual Inference? Imagine you have a group of ten friends and
you show them a number of plots (the lineup). You ask them, can you
pick the real data from the other plots, which are permutations of the
real data?
What proportion of people that you ask find your real data? That is the
idea behind visual inference: you can do statistics by having a bunch of
people look at your data and finding the proportion of them who can find
the data.
Say you have 4 friends out of your 10 that can correctly pick out the
data. You can calculate a p-value based on this, and the probability of
picking the right dataset at random: (1/20 = 0.05). This can be done by
using a binomial distribution.
Let’s try it out: this is a dataset with three different ways of
measuring body fat percentage, and we want to see if the measurements
agree in genral. Here, we’re comparing two of the methods: dxa and
br.
library(tidyverse)
body_fat <- readr::read_csv(here::here("data/body_composition.csv")) |>
  janitor::clean_names() |>
  dplyr::select(-gender) |>
  tidyr::pivot_longer(c(dxa, st, br),names_to = "method", values_to = "value") |>
  dplyr::filter(method %in% c("dxa", "br")) |>
  dplyr::mutate(method = as.factor(method))

ggplot(body_fat) +
  aes(x=method, y=value, color=method) +
  geom_boxplot()

## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).

Could you pick this dataset out of a lineup? Here, we’re permuting
across the method variable:
library(nullabor)
lineup_method <- nullabor::lineup(nullabor::null_permute("method"), body_fat)

## decrypt("clZx bKhK oL 3OHohoOL 0d")

ggplot(data=lineup_method, aes(x=method, y=value, color=method)) + geom_boxplot() + facet_wrap(~ .sample) +
  labs(title="One of these datasets is not like the others", subtitle = "Calculate a p-value with you and your friends!")

## Warning: Removed 20 rows containing non-finite values (`stat_boxplot()`).

Not so easy, is it? Say we have 4 out of 10 friends who find the correct
data. pvisual() will calculate the visual p-value using both
simulation and the binomial distribution.
pvisual(m=20, K = 10, x = 4)

##      x simulated       binom
## [1,] 4    0.0019 0.001028498

Surprisingly, with just 4 friends out of 10 who are correct, we have a p-value around 0.0010. To me, this shows the power of exploratory data analysis.
Looking at Data Does Matter
Thinking about visual inference is important in these days because we,
as data scientists, make exploratory dashboards and someone (such as
your stakeholder) discovers something using it. How do we know that we
can trust this discovery? By framing this discovery as a test, we can
put bounds on the likelihood that the discovery is true given our observed data.
Before we get too esoteric, I think it’s important to understand that
exploratory visualization helps us identify patterns, and statistical
frameworks give us a handle on how unique that pattten is from the
surrounding noise. This is the connection between exploratory versus
confirmatory data analysis that we need to make explicit as
non-statisticians make more discoveries using the tools we build.
Further Reading
I’ve just scratched the surface of visual inference. For me, it is an
extremely interesting idea and it points to a statistical basis to
understanding the patterns we identify with EDA within data.

There is still only one
  test -
  this is another blog post that broke my brain. It talks about how a
  lot of statistics can arise from permuting the data like we do with
  {nullabor}.
Graphical Inference for
  Infovis - This
  is the paper (with Hadley Wickham as first author) that outlines the
  ideas behind {nullabor}.
What is Graphical
  Inference? -
  this is a nice demonstration of {nullabor}
Designing for Interactive Analysis Requires a Theory of Graphical
  Inference -
  really interesting paper about how we need to think about inference
  when we give people dashboards.

Next Week
Thanks for reading this far! Next week we’ll explore boxplots for
understanding data and distributions.

Don't miss what's next. Subscribe to Ready for R Mailing List:

Start the conversation: