Ready for R Mailing List

Subscribe
Archives
February 12, 2024

Ready4R (02-12-2024): The Power of Crosstables

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

Ready4R (02-12-2024): The Power of Crosstables

One of my favorite tools: {janitor} and Crosstables

Please note that I am working with an experimental API package for my mailing list, and I am debugging visibility issues. Please let me know if you can’t read any commands or output. You can see the web version of this by clicking on the Ready4R Banner above. If you need Ready for R course info, it’s here. Past newsletters are available here.

One of my most used tools in Exploratory Data Analysis is the humble crosstable. Crosstables can tell us many things, including whether variables are evenly distributed among two variables, or highlight structural zeros (combinations of variables that just don’t exist), and highlight the strength of association of one variable with another.

Let’s start out with the 80 cereals dataset (available here). Looking at the description, we know there should be at least 3 categorical variables (manufacturer, type, and shelf). Let’s keep that in mind when start looking at the data.

library(tidyverse)

manu_labels <- c("American Home"="A",
                   "General Mills"="G",
                   "Kelloggs"="K",
                   "Nabisco" = "N",
                   "Post" = "P",
                   "Quaker Oats" = "Q", 
                   "Ralston Purina" = "R")

cereals <- readr::read_csv("data/cereal.csv", ) |> 
  janitor::clean_names() |>
  mutate(shelf = factor(shelf, ordered=TRUE)) |>
  mutate(across(c("manufacturer", "type"), as.factor)) |>
  mutate(manufacturer = forcats::fct_recode(manufacturer, !!!manu_labels))

Crosstabs with janitor::tabyl()

There is a lot of power in counting things, especially in the crosstable, where we ask questions of association between variables. I want to highlight the power of crosstables.

You might be familiar with table(x, y) as a way to build your crosstables. I’d like to point one of the wonders of the {janitor} package: janitor::tabyl(). This function can be directly plugged into a tidy workflow.

For example, instead of table(cereals$shelf) we can write

cereals |>
  janitor::tabyl(shelf) 
##  shelf  n   percent
##      1 20 0.2597403
##      2 21 0.2727273
##      3 36 0.4675325

A couple of things that tabyl() does well: for single variables, it gives the counts and percentages, which can be extremely useful.

From reading the description, we know that manufacturer (if you are using the data from kaggle, I recoded mfr to be manufacturer) and type are two categorical variables in the data. We may want to know whether the manufacturers are evenly distributed in terms of cereal type.

I’ll also pipe the out of tabyl() into knitr::kable() to give a nicer looking table.

cereals |>
  janitor::tabyl(manufacturer, type) |>
  knitr::kable()
manufacturer C H
American Home 0 1
General Mills 22 0
Kelloggs 23 0
Nabisco 5 1
Post 9 0
Quaker Oats 7 1
Ralston Purina 8 0

It’s harder to see the picture because of the counts, though we do notice some things (General Mills and Kelloggs don’t have any hot cereals). We can look at proportions by using janitor::adorn_percentages, which will show us the row percentages in our cross table.

cereals |>
  janitor::tabyl(manufacturer, type) |>
  janitor::adorn_percentages(denominator = "row") |>
  janitor::adorn_pct_formatting() |>
  janitor::adorn_ns() |>
  knitr::kable()
manufacturer C H
American Home 0.0% (0) 100.0% (1)
General Mills 100.0% (22) 0.0% (0)
Kelloggs 100.0% (23) 0.0% (0)
Nabisco 83.3% (5) 16.7% (1)
Post 100.0% (9) 0.0% (0)
Quaker Oats 87.5% (7) 12.5% (1)
Ralston Purina 100.0% (8) 0.0% (0)

Using adorn_totals(denominator="row"), we can see the percentage of hot and cold cereals for each manufacturer. Note that by using adorn_percentages() and adorn_n() we can get both the counts and percentages.

A couple things become more obvious from the crosstab. There is only one manufacturer (A) that makes only hot cereals. The majority make cold cereals other than Nabisco and Quaker Oats.

Shelf Height and Marketing

What about the distribution of cereals among the shelf height (1=lowest shelf and 3=highest)? According to this article:

In a two-part study, researchers confirmed that the cereals targeting children are placed about 23 inches off the ground and those aimed at adults 48 inches high. After studying 65 cereals and 86 “spokes-characters” they found the cereals on the top shelves have characters staring straight ahead or slightly up to make eye contact with adults. For the lower boxes with cartoon characters with large inviting eyes, the gaze is focused slightly downward, to create eye contact with children.

We know that a lot of cereals that are marketed towards kids are on the second shelf, and a lot of value cereals are on the first shelf, and the more “adult cereals” are on the top shelf. Can we use crosstables to visualize the distribution of cereals?

cereals |>
  janitor::tabyl(manufacturer, shelf) |>
  janitor::adorn_percentages(denominator="row") |>
  janitor::adorn_ns() |>
  knitr::kable()
manufacturer 1 2 3
American Home 0.0000000 (0) 1.0000000 (1) 0.0000000 (0)
General Mills 0.2727273 (6) 0.3181818 (7) 0.4090909 (9)
Kelloggs 0.1739130 (4) 0.3043478 (7) 0.5217391 (12)
Nabisco 0.5000000 (3) 0.3333333 (2) 0.1666667 (1)
Post 0.2222222 (2) 0.1111111 (1) 0.6666667 (6)
Quaker Oats 0.1250000 (1) 0.3750000 (3) 0.5000000 (4)
Ralston Purina 0.5000000 (4) 0.0000000 (0) 0.5000000 (4)

So now we have an idea that kid-marketed cereals are on the second shelf. Can we check this? One idea to check this is to see which cereals that have breakfast cereal mascots. Then we can do a crosstab of cereals that have mascots versus shelf level.

Which shelves have cereal mascots?

We can do this by merging a CSV file I built from scraping a whiteclouds article about cereal mascots. If you’re interested in how I made this, the recipe is here. You can also download the mascot.csv file there as well. I didn’t clean it too much (note that some of the cereals have some notes in parentheses), but I’m mostly interested to see if there’s a match in our cereals dataset.

I changed cereal names to match where necessary in the mascots data.frame so that merging the two frames will produce correct results.

m_c <- read.csv(here::here("data/mascots.csv"), row.names = NULL)

m_c <- m_c |>
mutate(name=stringr::str_replace(name,"Cap’n Crunch cereals", "Cap'n'Crunch")) |>
  mutate(name=stringr::str_replace(name, "Count Chocula cereal", "Count Chocula")) |>
  mutate(name=stringr::str_replace(name, "Honey Smacks", "Smacks")) |>
  mutate(name=stringr::str_replace(name, "Mini-Wheats", "Frosted Mini-Wheats")) 

mascot_count <- cereals |>
  left_join(y=m_c, by="name") |>
  mutate(has_mascot = ifelse(is.na(mascot), "No", "Yes")) 

dim(mascot_count)
## [1] 80 19

As a quick check, let’s take a look at the 2nd shelf to see if we missed any potential cereal mascots.

mascot_count |>
  dplyr::filter(shelf==2) |>
  select(name, manufacturer, mascot, has_mascot) |>
  arrange(mascot) |>
  knitr::kable()
name manufacturer mascot has_mascot
Cocoa Puffs General Mills Buzz the Bee Yes
Cap’n’Crunch Quaker Oats Cap’n Crunch (Horatio Magellan Crunch) Yes
Cocoa Puffs General Mills Cocoa Puffs’ Sheik of Shake Yes
Corn Pops Kelloggs Cornelius (Corny) the Corn Yes
Count Chocula General Mills Count Chocula Yes
Smacks Kelloggs Dig’em Frog Yes
Fruity Pebbles Post Fred Flintstone Yes
Lucky Charms General Mills Lucky the Leprechaun Yes
Frosted Mini-Wheats Kelloggs Mr. Mini-Wheats Yes
Cocoa Puffs General Mills Sonny the Cuckoo Bird Yes
Raisin Bran Kelloggs The Raisin Bran Sun Yes
Froot Loops Kelloggs Toucan Sam Yes
Trix General Mills Trix Rabbit Yes
Cinnamon Toast Crunch General Mills Wendell Yes
Apple Jacks Kelloggs NA No
Cream of Wheat (Quick) Nabisco NA No
Golden Grahams General Mills NA No
Honey Graham Ohs Quaker Oats NA No
Kix General Mills NA No
Life Quaker Oats NA No
Maypo American Home NA No
Nut&Honey Crunch Kelloggs NA No
Strawberry Fruit Wheats Nabisco NA No

Looks ok. Note that there’s duplicate rows for cereals, because mascots can change over time (some cereals have multiple mascots).

Ok, now that we’re satisfied, we can start to answer our question about cereal mascots and shelves. In order to produce our cross tab, we’ll need to remove duplicate cereals (because they can have multiple mascots) before we make the cross-tab. I’ll do that using distinct() after I remove the mascot column:

mascot_count |>
  select(name, shelf, has_mascot) |>
  distinct() |>
  janitor::tabyl(shelf, has_mascot) |>
  janitor::adorn_percentages() |>
  janitor::adorn_pct_formatting() |>
  janitor::adorn_ns() |>
  knitr::kable()
shelf No Yes
1 60.0% (12) 40.0% (8)
2 42.9% (9) 57.1% (12)
3 100.0% (36) 0.0% (0)

And wow! Summarizing the data in this way makes one thing very obvious: there are no cereals that have mascots on the 3rd shelf. So yes, our quick EDA shows that cereal manufacturers are targeting children on the 1st and 2nd shelves.

If we dive deeper, we can see that General Mills and Kellogg’s are the brands that have the most cereal mascots (40% and 40% of all cereal mascots in the dataset). Note we’re using column totals this time for the dataset.

mascot_count |>
  select(name, manufacturer, has_mascot) |>
  distinct() |>
  janitor::tabyl(manufacturer, has_mascot) |>
  janitor::adorn_percentages(denominator = "col") |>
  janitor::adorn_pct_formatting() |>
  janitor::adorn_ns() |>
  knitr::kable()
manufacturer No Yes
American Home 1.8% (1) 0.0% (0)
General Mills 24.6% (14) 40.0% (8)
Kelloggs 26.3% (15) 40.0% (8)
Nabisco 10.5% (6) 0.0% (0)
Post 12.3% (7) 10.0% (2)
Quaker Oats 12.3% (7) 5.0% (1)
Ralston Purina 12.3% (7) 5.0% (1)

What did we learn?

We learned about the power of crosstables to highlight patterns of association in the data. By merging in a data.frame of cereal mascots, we were able to show that the marketing to childen using these mascots at their eye level was real.

Celebrate Black History Month with W.E.B. Dubois’ Visualizations

I’d just like to wrap up this newsletter encouraging you to read up more about W.E.B. DuBois’ amazing visualizations he put together for the 1900 Paris Exhibition of “The Exhibition of American Negroes”. I find his visualizations to be emotionally affecting. This page is a wonderful read about the history. Check out Chimdi Nwosu’s interactive W.E.B. Dubois Portrait Gallery for examples of these powerful visualizations.

Let me know your thoughts

If you’ve gotten this far, thank you for reading.

You can add your thoughts and comments to this post by clicking on the banner above to access the web version, which has comments enabled. Love to hear them. Share any insights you have gotten through crosstabs on the data. Does janitor::tabyl() help you do your work better?

Don't miss what's next. Subscribe to Ready for R Mailing List:
Start the conversation:
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.