Ready for R Mailing List

Subscribe
Archives
February 19, 2024

Ready4R (2024-02-19): {vtree} for you and me

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

{vtree}: exploring variable partitions

Thank you to Andey Nunes-Brewster for introducing me to vtrees. A great tool to wonder at data with!

Last week we talked about crosstables for understanding how our dataset is parititioned in terms of categorical variables. I'd like to talk about an extension of crosstables called vtrees (or variable trees), that let us visualize partitions of more than two categorical variables in the data.

Let's continue on with the cereals dataset, which contain information about 77 cereals and their nutrition information, along with their manufacturer, cereal type, and shelf. To make things a little easier to discuss, I'm recoding the shelf level in terms of (Top, Middle, and Bottom shelves).

If you'd like to play around with this dataset, I've put a CSV version (with mascots) here.

library(tidyverse)
library(vtree)

shelf_code <- c("Top"="3", "Middle"="2", "Bottom"="1")

mascot_count <- read_csv("data/mascot_count.csv") |>
  dplyr::mutate(across(c("shelf", "type", "manufacturer"), as.factor)) |>
  dplyr::mutate(shelf=forcats::fct_recode(shelf, !!!shelf_code))

Just as a quick reminder, here's the first few rows and columns of the data:

mascot_count |> 
  dplyr::select(name, shelf, manufacturer, type, has_mascot) |> 
  head() |> knitr::kable()
name shelf manufacturer type has_mascot
100% Bran Top Nabisco C No
100% Natural Bran Top Quaker Oats C No
All-Bran Top Kelloggs C No
All-Bran with Extra Fiber Top Kelloggs C No
Almond Delight Top Ralston Purina C No
Apple Cinnamon Cheerios Bottom General Mills C No

One thing I like about vtrees is that they can highlight potential structural zeros in the data. Structural zeros are combinations of categories that never occur because they are impossible combinations. Being aware of these is important because they can affect model predictions, because our assumptions may not be met (there are models that address these extra zeroes called zero-inflated models).

Let's look at shelf and has_mascot in the cereals dataset. It is immediately apparent that shelf=top has no mascots. Note that vtree expects a character argument, with the names of your variables in order, separated by a space.

vtree(mascot_count, "shelf has_mascot", pxwidth=500)

two way vtree showing that shelf="top" contains cereals that have 0 mascots

The strength of vtrees is that you can view partitions of more than two different variables. For example, we've added the cereal type here (H for hot cereal, and C for cold cereal).

vtree(mascot_count, "shelf has_mascot type", pxwidth = 500)

Three way vtree starting with shelf, has_mascot, and type

I think vtrees are also good at highlighting disparities and injustice in populations. Let's look at an example from the Titanic data. One question we might have is what percentage of children survived based on which class (1st, 2nd, 3rd) they were traveling in. I've removed the adults from this vtree to show the disparity in classes in terms of child survival. I've used the prune argument to remove the Adult values from the vtree:

library(datasets)
data("Titanic")
titanic <- crosstabToCases(Titanic)

vtree(titanic,"Class Age",summary="Survived=Yes \n%pct% survived",
      sameline=TRUE, 
      prune=list(Age=c("Adult")),
      pxwidth=500)

vtree comparing classes and then split by age - only 33% of the 3rd class children survived

Unfortunately, the brunt of children who died on Titanic were riding third-class.

And that's the idea behind {vtree}: you can visualize more partitions of your data beyond a single two-way crosstable, and it can be especially useful in highlighting disparities and structural zeroes in your data.

Question For You: What Should I Write About Next?

Let's do a choose your own adventure for next week! Click one of the links below to cast your vote.

You can also leave a comment by clicking here (scroll down, it's down at the bottom).

Don't miss what's next. Subscribe to Ready for R Mailing List:
Start the conversation:
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.