Ready4R (02-12-2024): The Power of Crosstables

                February 12, 2024

            Ready4R (02-12-2024): The Power of Crosstables

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

Ready4R (02-12-2024): The Power of Crosstables

One of my favorite tools: {janitor} and
Crosstables

Please note that I am working with an experimental API package for my
mailing list, and I am debugging visibility issues. Please let me know
if you can’t read any commands or output. You can see the web version of
this by clicking on the Ready4R Banner above. If you need Ready for R
course info, it’s
here. Past newsletters are available here.

One of my most used tools in Exploratory Data Analysis is the humble
crosstable. Crosstables can tell us many things, including whether
variables are evenly distributed among two variables, or highlight
structural zeros (combinations of variables that just don’t exist), and
highlight the strength of association of one variable with another.
Let’s start out with the 80 cereals dataset (available here).
Looking at the description, we know there should be at least 3
categorical variables (manufacturer, type, and
shelf). Let’s keep that in mind when start looking at the
data.
library(tidyverse)

manu_labels <- c("American Home"="A",
                   "General Mills"="G",
                   "Kelloggs"="K",
                   "Nabisco" = "N",
                   "Post" = "P",
                   "Quaker Oats" = "Q", 
                   "Ralston Purina" = "R")

cereals <- readr::read_csv("data/cereal.csv", ) |> 
  janitor::clean_names() |>
  mutate(shelf = factor(shelf, ordered=TRUE)) |>
  mutate(across(c("manufacturer", "type"), as.factor)) |>
  mutate(manufacturer = forcats::fct_recode(manufacturer, !!!manu_labels))

Crosstabs with janitor::tabyl()
There is a lot of power in counting things, especially in the
crosstable, where we ask questions of association between variables. I
want to highlight the power of crosstables.
You might be familiar with table(x, y) as a way to build
your crosstables. I’d like to point one of the wonders of the
{janitor} package: janitor::tabyl(). This
function can be directly plugged into a tidy workflow.
For example, instead of table(cereals$shelf) we can
write
cereals |>
  janitor::tabyl(shelf) 
##  shelf  n   percent
##      1 20 0.2597403
##      2 21 0.2727273
##      3 36 0.4675325
A couple of things that tabyl() does well: for single
variables, it gives the counts and percentages, which can be extremely
useful.
From reading the
description, we know that manufacturer (if you are
using the data from kaggle, I recoded mfr to be
manufacturer) and type are two categorical
variables in the data. We may want to know whether the manufacturers are
evenly distributed in terms of cereal type.
I’ll also pipe the out of tabyl() into
knitr::kable() to give a nicer looking table.
cereals |>
  janitor::tabyl(manufacturer, type) |>
  knitr::kable()

manufacturer
C
H

American Home
0
1

General Mills
22
0

Kelloggs
23
0

Nabisco
5
1

Post
9
0

Quaker Oats
7
1

Ralston Purina
8
0

It’s harder to see the picture because of the counts, though we do
notice some things (General Mills and Kelloggs
don’t have any hot cereals). We can look at proportions by using
janitor::adorn_percentages, which will show us the row
percentages in our cross table.
cereals |>
  janitor::tabyl(manufacturer, type) |>
  janitor::adorn_percentages(denominator = "row") |>
  janitor::adorn_pct_formatting() |>
  janitor::adorn_ns() |>
  knitr::kable()

manufacturer
C
H

American Home
0.0% (0)
100.0% (1)

General Mills
100.0% (22)
0.0% (0)

Kelloggs
100.0% (23)
0.0% (0)

Nabisco
83.3% (5)
16.7% (1)

Post
100.0% (9)
0.0% (0)

Quaker Oats
87.5% (7)
12.5% (1)

Ralston Purina
100.0% (8)
0.0% (0)

Using adorn_totals(denominator="row"), we can see the
percentage of hot and cold cereals for each manufacturer. Note that by
using adorn_percentages() and adorn_n() we can
get both the counts and percentages.
A couple things become more obvious from the crosstab. There is only
one manufacturer (A) that makes only hot cereals. The
majority make cold cereals other than Nabisco and
Quaker Oats.

Shelf Height and Marketing
What about the distribution of cereals among the shelf height
(1=lowest shelf and 3=highest)? According to
this
article:

In a two-part study, researchers confirmed that the cereals
targeting children are placed about 23 inches
off the ground and those aimed at adults 48 inches high. After studying
65 cereals and 86 “spokes-characters” they found the cereals on the top
shelves have characters staring straight ahead or slightly up to make
eye contact with adults. For the lower boxes with cartoon characters
with large inviting eyes, the gaze is focused slightly downward, to
create eye contact with children.

We know that a lot of cereals that are marketed towards kids are on
the second shelf, and a lot of value cereals are on the first shelf, and
the more “adult cereals” are on the top shelf. Can we use crosstables to
visualize the distribution of cereals?
cereals |>
  janitor::tabyl(manufacturer, shelf) |>
  janitor::adorn_percentages(denominator="row") |>
  janitor::adorn_ns() |>
  knitr::kable()

manufacturer
1
2
3

American Home
0.0000000 (0)
1.0000000 (1)
0.0000000 (0)

General Mills
0.2727273 (6)
0.3181818 (7)
0.4090909 (9)

Kelloggs
0.1739130 (4)
0.3043478 (7)
0.5217391 (12)

Nabisco
0.5000000 (3)
0.3333333 (2)
0.1666667 (1)

Post
0.2222222 (2)
0.1111111 (1)
0.6666667 (6)

Quaker Oats
0.1250000 (1)
0.3750000 (3)
0.5000000 (4)

Ralston Purina
0.5000000 (4)
0.0000000 (0)
0.5000000 (4)

So now we have an idea that kid-marketed cereals are on the second
shelf. Can we check this? One idea to check this is to see which cereals
that have breakfast cereal mascots. Then we can do a crosstab of cereals
that have mascots versus shelf level.

Which shelves have cereal mascots?
We can do this by merging a CSV file I built from scraping a whiteclouds
article about cereal mascots. If you’re interested in how I made
this, the
recipe is here. You can also download the mascot.csv
file there as well. I didn’t clean it too much (note that some of the
cereals have some notes in parentheses), but I’m mostly interested to
see if there’s a match in our cereals dataset.
I changed cereal names to match where necessary in the mascots
data.frame so that merging the two frames will produce correct
results.
m_c <- read.csv(here::here("data/mascots.csv"), row.names = NULL)

m_c <- m_c |>
mutate(name=stringr::str_replace(name,"Cap’n Crunch cereals", "Cap'n'Crunch")) |>
  mutate(name=stringr::str_replace(name, "Count Chocula cereal", "Count Chocula")) |>
  mutate(name=stringr::str_replace(name, "Honey Smacks", "Smacks")) |>
  mutate(name=stringr::str_replace(name, "Mini-Wheats", "Frosted Mini-Wheats")) 

mascot_count <- cereals |>
  left_join(y=m_c, by="name") |>
  mutate(has_mascot = ifelse(is.na(mascot), "No", "Yes")) 

dim(mascot_count)
## [1] 80 19
As a quick check, let’s take a look at the 2nd shelf to see if we
missed any potential cereal mascots.
mascot_count |>
  dplyr::filter(shelf==2) |>
  select(name, manufacturer, mascot, has_mascot) |>
  arrange(mascot) |>
  knitr::kable()

name
manufacturer
mascot
has_mascot

Cocoa Puffs
General Mills
Buzz the Bee
Yes

Cap’n’Crunch
Quaker Oats
Cap’n Crunch (Horatio Magellan Crunch)
Yes

Cocoa Puffs
General Mills
Cocoa Puffs’ Sheik of Shake
Yes

Corn Pops
Kelloggs
Cornelius (Corny) the Corn
Yes

Count Chocula
General Mills
Count Chocula
Yes

Smacks
Kelloggs
Dig’em Frog
Yes

Fruity Pebbles
Post
Fred Flintstone
Yes

Lucky Charms
General Mills
Lucky the Leprechaun
Yes

Frosted Mini-Wheats
Kelloggs
Mr. Mini-Wheats
Yes

Cocoa Puffs
General Mills
Sonny the Cuckoo Bird
Yes

Raisin Bran
Kelloggs
The Raisin Bran Sun
Yes

Froot Loops
Kelloggs
Toucan Sam
Yes

Trix
General Mills
Trix Rabbit
Yes

Cinnamon Toast Crunch
General Mills
Wendell
Yes

Apple Jacks
Kelloggs
NA
No

Cream of Wheat (Quick)
Nabisco
NA
No

Golden Grahams
General Mills
NA
No

Honey Graham Ohs
Quaker Oats
NA
No

Kix
General Mills
NA
No

Life
Quaker Oats
NA
No

Maypo
American Home
NA
No

Nut&Honey Crunch
Kelloggs
NA
No

Strawberry Fruit Wheats
Nabisco
NA
No

Looks ok. Note that there’s duplicate rows for cereals, because
mascots can change over time (some cereals have multiple mascots).
Ok, now that we’re satisfied, we can start to answer our question
about cereal mascots and shelves. In order to produce our cross tab,
we’ll need to remove duplicate cereals (because they can have multiple
mascots) before we make the cross-tab. I’ll do that using
distinct() after I remove the mascot
column:
mascot_count |>
  select(name, shelf, has_mascot) |>
  distinct() |>
  janitor::tabyl(shelf, has_mascot) |>
  janitor::adorn_percentages() |>
  janitor::adorn_pct_formatting() |>
  janitor::adorn_ns() |>
  knitr::kable()

shelf
No
Yes

1
60.0% (12)
40.0% (8)

2
42.9% (9)
57.1% (12)

3
100.0% (36)
0.0% (0)

And wow! Summarizing the data in this way makes one thing very
obvious: there are no cereals that have mascots on the 3rd
shelf. So yes, our quick EDA shows that cereal manufacturers
are targeting children on the 1st and 2nd shelves.
If we dive deeper, we can see that General Mills and Kellogg’s are
the brands that have the most cereal mascots (40% and 40% of all cereal
mascots in the dataset). Note we’re using column totals this time for
the dataset.
mascot_count |>
  select(name, manufacturer, has_mascot) |>
  distinct() |>
  janitor::tabyl(manufacturer, has_mascot) |>
  janitor::adorn_percentages(denominator = "col") |>
  janitor::adorn_pct_formatting() |>
  janitor::adorn_ns() |>
  knitr::kable()

manufacturer
No
Yes

American Home
1.8% (1)
0.0% (0)

General Mills
24.6% (14)
40.0% (8)

Kelloggs
26.3% (15)
40.0% (8)

Nabisco
10.5% (6)
0.0% (0)

Post
12.3% (7)
10.0% (2)

Quaker Oats
12.3% (7)
5.0% (1)

Ralston Purina
12.3% (7)
5.0% (1)

What did we learn?
We learned about the power of crosstables to highlight patterns of
association in the data. By merging in a data.frame of cereal mascots,
we were able to show that the marketing to childen using these mascots
at their eye level was real.

Celebrate Black History Month with W.E.B. Dubois’
Visualizations
I’d just like to wrap up this newsletter encouraging you to read up
more about W.E.B. DuBois’ amazing visualizations he put together for the
1900 Paris Exhibition of “The Exhibition of American Negroes”. I find
his visualizations to be emotionally affecting. This
page is a wonderful read about the history. Check out Chimdi Nwosu’s
interactive W.E.B.
Dubois Portrait Gallery for examples of these powerful
visualizations.

Let me know your thoughts
If you’ve gotten this far, thank you for reading.
You can add your thoughts and comments to this post by clicking on
the banner above to access the web version, which has comments enabled.
Love to hear them. Share any insights you have gotten through crosstabs
on the data. Does janitor::tabyl() help you do your work
better?

Don't miss what's next. Subscribe to Ready for R Mailing List:

Start the conversation:

Ready for R Mailing List

Ready4R (02-12-2024): The Power of Crosstables

One of my favorite tools: `{janitor}` and Crosstables

Crosstabs with `janitor::tabyl()`

Shelf Height and Marketing

Which shelves have cereal mascots?

What did we learn?

Celebrate Black History Month with W.E.B. Dubois’ Visualizations

Let me know your thoughts

Ready for R Mailing List

Ready4R (02-12-2024): The Power of Crosstables

One of my favorite tools: {janitor} and Crosstables

Crosstabs with janitor::tabyl()

Shelf Height and Marketing

Which shelves have cereal mascots?

What did we learn?

Celebrate Black History Month with W.E.B. Dubois’ Visualizations

Let me know your thoughts

One of my favorite tools: `{janitor}` and Crosstables

Crosstabs with `janitor::tabyl()`