Ready4R (2024-02-19): {vtree} for you and me
Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.
{vtree}
: exploring variable partitions
Thank you to Andey Nunes-Brewster for introducing me to vtrees. A great tool to wonder at data with!
Last week we talked about crosstables for understanding how our dataset is parititioned in terms of categorical variables. I'd like to talk about an extension of crosstables called vtrees
(or variable trees), that let us visualize partitions of more than two categorical variables in the data.
Let's continue on with the cereals dataset, which contain information about 77 cereals and their nutrition information, along with their manufacturer, cereal type, and shelf. To make things a little easier to discuss, I'm recoding the shelf level in terms of (Top
, Middle
, and Bottom
shelves).
If you'd like to play around with this dataset, I've put a CSV version (with mascots) here.
library(tidyverse)
library(vtree)
shelf_code <- c("Top"="3", "Middle"="2", "Bottom"="1")
mascot_count <- read_csv("data/mascot_count.csv") |>
dplyr::mutate(across(c("shelf", "type", "manufacturer"), as.factor)) |>
dplyr::mutate(shelf=forcats::fct_recode(shelf, !!!shelf_code))
Just as a quick reminder, here's the first few rows and columns of the data:
mascot_count |>
dplyr::select(name, shelf, manufacturer, type, has_mascot) |>
head() |> knitr::kable()
name | shelf | manufacturer | type | has_mascot |
---|---|---|---|---|
100% Bran | Top | Nabisco | C | No |
100% Natural Bran | Top | Quaker Oats | C | No |
All-Bran | Top | Kelloggs | C | No |
All-Bran with Extra Fiber | Top | Kelloggs | C | No |
Almond Delight | Top | Ralston Purina | C | No |
Apple Cinnamon Cheerios | Bottom | General Mills | C | No |
One thing I like about vtrees is that they can highlight potential structural zeros in the data. Structural zeros are combinations of categories that never occur because they are impossible combinations. Being aware of these is important because they can affect model predictions, because our assumptions may not be met (there are models that address these extra zeroes called zero-inflated models).
Let's look at shelf
and has_mascot
in the cereals dataset. It is immediately apparent that shelf=top
has no mascots. Note that vtree
expects a character
argument, with the names of your variables in order, separated by a space.
vtree(mascot_count, "shelf has_mascot", pxwidth=500)
The strength of vtrees is that you can view partitions of more than two different variables. For example, we've added the cereal type
here (H
for hot cereal, and C
for cold cereal).
vtree(mascot_count, "shelf has_mascot type", pxwidth = 500)
I think vtrees are also good at highlighting disparities and injustice in populations. Let's look at an example from the Titanic
data. One question we might have is what percentage of children survived based on which class (1st, 2nd, 3rd) they were traveling in. I've removed the adults from this vtree to show the disparity in classes in terms of child survival. I've used the prune
argument to remove the Adult
values from the vtree
:
library(datasets)
data("Titanic")
titanic <- crosstabToCases(Titanic)
vtree(titanic,"Class Age",summary="Survived=Yes \n%pct% survived",
sameline=TRUE,
prune=list(Age=c("Adult")),
pxwidth=500)
Unfortunately, the brunt of children who died on Titanic were riding third-class.
And that's the idea behind {vtree}
: you can visualize more partitions of your data beyond a single two-way crosstable, and it can be especially useful in highlighting disparities and structural zeroes in your data.
Question For You: What Should I Write About Next?
Let's do a choose your own adventure for next week! Click one of the links below to cast your vote.
You can also leave a comment by clicking here (scroll down, it's down at the bottom).