Ready4R (2024-02-19): {vtree} for you and me

zero-inflated models

                February 19, 2024

            Ready4R (2024-02-19): {vtree} for you and me 

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

            {vtree}: exploring variable partitions

Thank you to Andey Nunes-Brewster for introducing me to vtrees. A great tool to wonder at data with!

Last week we talked about crosstables for understanding how our dataset is parititioned in terms of categorical variables. I'd like to talk about an extension of crosstables called vtrees (or variable trees), that let us visualize partitions of more than two categorical variables in the data. 
Let's continue on with the cereals dataset, which contain information about 77 cereals and their nutrition information, along with their manufacturer, cereal type, and shelf. To make things a little easier to discuss, I'm recoding the shelf level in terms of (Top, Middle, and Bottom shelves).

If you'd like to play around with this dataset, I've put a CSV version (with mascots) here.

library(tidyverse)
library(vtree)

shelf_code <- c("Top"="3", "Middle"="2", "Bottom"="1")

mascot_count <- read_csv("data/mascot_count.csv") |>
  dplyr::mutate(across(c("shelf", "type", "manufacturer"), as.factor)) |>
  dplyr::mutate(shelf=forcats::fct_recode(shelf, !!!shelf_code))

Just as a quick reminder, here's the first few rows and columns of the data:
mascot_count |> 
  dplyr::select(name, shelf, manufacturer, type, has_mascot) |> 
  head() |> knitr::kable()

name
shelf
manufacturer
type
has_mascot

100% Bran
Top
Nabisco
C
No

100% Natural Bran
Top
Quaker Oats
C
No

All-Bran
Top
Kelloggs
C
No

All-Bran with Extra Fiber
Top
Kelloggs
C
No

Almond Delight
Top
Ralston Purina
C
No

Apple Cinnamon Cheerios
Bottom
General Mills
C
No

One thing I like about vtrees is that they can highlight potential structural zeros in the data. Structural zeros are combinations of categories that never occur because they are impossible combinations. Being aware of these is important because they can affect model predictions, because our assumptions may not be met (there are models that address these extra zeroes called zero-inflated models).
Let's look at shelf and has_mascot in the cereals dataset. It is immediately apparent that shelf=top has no mascots. Note that vtree expects a character argument, with the names of your variables in order, separated by a space.
vtree(mascot_count, "shelf has_mascot", pxwidth=500)

The strength of vtrees is that you can view partitions of more than two different variables. For example, we've added the cereal type here (H for hot cereal, and C for cold cereal).
vtree(mascot_count, "shelf has_mascot type", pxwidth = 500)

I think vtrees are also good at highlighting disparities and injustice in populations. Let's look at an example from the Titanic data. One question we might have is what percentage of children survived based on which class (1st, 2nd, 3rd) they were traveling in. I've removed the adults from this vtree to show the disparity in classes in terms of child survival. I've used the prune argument to remove the Adult values from the vtree:
library(datasets)
data("Titanic")
titanic <- crosstabToCases(Titanic)

vtree(titanic,"Class Age",summary="Survived=Yes \n%pct% survived",
      sameline=TRUE, 
      prune=list(Age=c("Adult")),
      pxwidth=500)

Unfortunately, the brunt of children who died on Titanic were riding third-class. 
And that's the idea behind {vtree}: you can visualize more partitions of your data beyond a single two-way crosstable, and it can be especially useful in highlighting disparities and structural zeroes in your data.
Question For You: What Should I Write About Next?
Let's do a choose your own adventure for next week! Click one of the links below to cast your vote.

You can also leave a comment by clicking here (scroll down, it's down at the bottom).

Don't miss what's next. Subscribe to Ready for R Mailing List:

Start the conversation:

name	shelf	manufacturer	type	has_mascot
100% Bran	Top	Nabisco	C	No
100% Natural Bran	Top	Quaker Oats	C	No
All-Bran	Top	Kelloggs	C	No
All-Bran with Extra Fiber	Top	Kelloggs	C	No
Almond Delight	Top	Ralston Purina	C	No
Apple Cinnamon Cheerios	Bottom	General Mills	C	No