Crosstabs with janitor::tabyl()
There is a lot of power in counting things, especially in the
crosstable, where we ask questions of association between variables. I
want to highlight the power of crosstables.
You might be familiar with table(x, y) as a way to build
your crosstables. I’d like to point one of the wonders of the
{janitor} package: janitor::tabyl() . This
function can be directly plugged into a tidy workflow.
For example, instead of table(cereals$shelf) we can
write
cereals |>
janitor::tabyl(shelf)
## shelf n percent
## 1 20 0.2597403
## 2 21 0.2727273
## 3 36 0.4675325
A couple of things that tabyl() does well: for single
variables, it gives the counts and percentages, which can be extremely
useful.
From reading the
description, we know that manufacturer (if you are
using the data from kaggle, I recoded mfr to be
manufacturer ) and type are two categorical
variables in the data. We may want to know whether the manufacturers are
evenly distributed in terms of cereal type.
I’ll also pipe the out of tabyl() into
knitr::kable() to give a nicer looking table.
cereals |>
janitor::tabyl(manufacturer, type) |>
knitr::kable()
American Home |
0 |
1 |
General Mills |
22 |
0 |
Kelloggs |
23 |
0 |
Nabisco |
5 |
1 |
Post |
9 |
0 |
Quaker Oats |
7 |
1 |
Ralston Purina |
8 |
0 |
It’s harder to see the picture because of the counts, though we do
notice some things (General Mills and Kelloggs
don’t have any hot cereals). We can look at proportions by using
janitor::adorn_percentages , which will show us the row
percentages in our cross table.
cereals |>
janitor::tabyl(manufacturer, type) |>
janitor::adorn_percentages(denominator = "row") |>
janitor::adorn_pct_formatting() |>
janitor::adorn_ns() |>
knitr::kable()
American Home |
0.0% (0) |
100.0% (1) |
General Mills |
100.0% (22) |
0.0% (0) |
Kelloggs |
100.0% (23) |
0.0% (0) |
Nabisco |
83.3% (5) |
16.7% (1) |
Post |
100.0% (9) |
0.0% (0) |
Quaker Oats |
87.5% (7) |
12.5% (1) |
Ralston Purina |
100.0% (8) |
0.0% (0) |
Using adorn_totals(denominator="row") , we can see the
percentage of hot and cold cereals for each manufacturer. Note that by
using adorn_percentages() and adorn_n() we can
get both the counts and percentages.
A couple things become more obvious from the crosstab. There is only
one manufacturer (A ) that makes only hot cereals. The
majority make cold cereals other than Nabisco and
Quaker Oats .
Shelf Height and Marketing
What about the distribution of cereals among the shelf height
(1 =lowest shelf and 3 =highest)? According to
this
article:
In a two-part study, researchers confirmed that the cereals
targeting children are placed about 23 inches
off the ground and those aimed at adults 48 inches high. After studying
65 cereals and 86 “spokes-characters” they found the cereals on the top
shelves have characters staring straight ahead or slightly up to make
eye contact with adults. For the lower boxes with cartoon characters
with large inviting eyes, the gaze is focused slightly downward, to
create eye contact with children.
We know that a lot of cereals that are marketed towards kids are on
the second shelf, and a lot of value cereals are on the first shelf, and
the more “adult cereals” are on the top shelf. Can we use crosstables to
visualize the distribution of cereals?
cereals |>
janitor::tabyl(manufacturer, shelf) |>
janitor::adorn_percentages(denominator="row") |>
janitor::adorn_ns() |>
knitr::kable()
American Home |
0.0000000 (0) |
1.0000000 (1) |
0.0000000 (0) |
General Mills |
0.2727273 (6) |
0.3181818 (7) |
0.4090909 (9) |
Kelloggs |
0.1739130 (4) |
0.3043478 (7) |
0.5217391 (12) |
Nabisco |
0.5000000 (3) |
0.3333333 (2) |
0.1666667 (1) |
Post |
0.2222222 (2) |
0.1111111 (1) |
0.6666667 (6) |
Quaker Oats |
0.1250000 (1) |
0.3750000 (3) |
0.5000000 (4) |
Ralston Purina |
0.5000000 (4) |
0.0000000 (0) |
0.5000000 (4) |
So now we have an idea that kid-marketed cereals are on the second
shelf. Can we check this? One idea to check this is to see which cereals
that have breakfast cereal mascots. Then we can do a crosstab of cereals
that have mascots versus shelf level.
Which shelves have cereal mascots?
We can do this by merging a CSV file I built from scraping a whiteclouds
article about cereal mascots. If you’re interested in how I made
this, the
recipe is here. You can also download the mascot.csv
file there as well. I didn’t clean it too much (note that some of the
cereals have some notes in parentheses), but I’m mostly interested to
see if there’s a match in our cereals dataset.
I changed cereal names to match where necessary in the mascots
data.frame so that merging the two frames will produce correct
results.
m_c <- read.csv(here::here("data/mascots.csv"), row.names = NULL)
m_c <- m_c |>
mutate(name=stringr::str_replace(name,"Cap’n Crunch cereals", "Cap'n'Crunch")) |>
mutate(name=stringr::str_replace(name, "Count Chocula cereal", "Count Chocula")) |>
mutate(name=stringr::str_replace(name, "Honey Smacks", "Smacks")) |>
mutate(name=stringr::str_replace(name, "Mini-Wheats", "Frosted Mini-Wheats"))
mascot_count <- cereals |>
left_join(y=m_c, by="name") |>
mutate(has_mascot = ifelse(is.na(mascot), "No", "Yes"))
dim(mascot_count)
## [1] 80 19
As a quick check, let’s take a look at the 2nd shelf to see if we
missed any potential cereal mascots.
mascot_count |>
dplyr::filter(shelf==2) |>
select(name, manufacturer, mascot, has_mascot) |>
arrange(mascot) |>
knitr::kable()
Cocoa Puffs |
General Mills |
Buzz the Bee |
Yes |
Cap’n’Crunch |
Quaker Oats |
Cap’n Crunch (Horatio Magellan Crunch) |
Yes |
Cocoa Puffs |
General Mills |
Cocoa Puffs’ Sheik of Shake |
Yes |
Corn Pops |
Kelloggs |
Cornelius (Corny) the Corn |
Yes |
Count Chocula |
General Mills |
Count Chocula |
Yes |
Smacks |
Kelloggs |
Dig’em Frog |
Yes |
Fruity Pebbles |
Post |
Fred Flintstone |
Yes |
Lucky Charms |
General Mills |
Lucky the Leprechaun |
Yes |
Frosted Mini-Wheats |
Kelloggs |
Mr. Mini-Wheats |
Yes |
Cocoa Puffs |
General Mills |
Sonny the Cuckoo Bird |
Yes |
Raisin Bran |
Kelloggs |
The Raisin Bran Sun |
Yes |
Froot Loops |
Kelloggs |
Toucan Sam |
Yes |
Trix |
General Mills |
Trix Rabbit |
Yes |
Cinnamon Toast Crunch |
General Mills |
Wendell |
Yes |
Apple Jacks |
Kelloggs |
NA |
No |
Cream of Wheat (Quick) |
Nabisco |
NA |
No |
Golden Grahams |
General Mills |
NA |
No |
Honey Graham Ohs |
Quaker Oats |
NA |
No |
Kix |
General Mills |
NA |
No |
Life |
Quaker Oats |
NA |
No |
Maypo |
American Home |
NA |
No |
Nut&Honey Crunch |
Kelloggs |
NA |
No |
Strawberry Fruit Wheats |
Nabisco |
NA |
No |
Looks ok. Note that there’s duplicate rows for cereals, because
mascots can change over time (some cereals have multiple mascots).
Ok, now that we’re satisfied, we can start to answer our question
about cereal mascots and shelves. In order to produce our cross tab,
we’ll need to remove duplicate cereals (because they can have multiple
mascots) before we make the cross-tab. I’ll do that using
distinct() after I remove the mascot
column:
mascot_count |>
select(name, shelf, has_mascot) |>
distinct() |>
janitor::tabyl(shelf, has_mascot) |>
janitor::adorn_percentages() |>
janitor::adorn_pct_formatting() |>
janitor::adorn_ns() |>
knitr::kable()
1 |
60.0% (12) |
40.0% (8) |
2 |
42.9% (9) |
57.1% (12) |
3 |
100.0% (36) |
0.0% (0) |
And wow! Summarizing the data in this way makes one thing very
obvious: there are no cereals that have mascots on the 3rd
shelf. So yes, our quick EDA shows that cereal manufacturers
are targeting children on the 1st and 2nd shelves.
If we dive deeper, we can see that General Mills and Kellogg’s are
the brands that have the most cereal mascots (40% and 40% of all cereal
mascots in the dataset). Note we’re using column totals this time for
the dataset.
mascot_count |>
select(name, manufacturer, has_mascot) |>
distinct() |>
janitor::tabyl(manufacturer, has_mascot) |>
janitor::adorn_percentages(denominator = "col") |>
janitor::adorn_pct_formatting() |>
janitor::adorn_ns() |>
knitr::kable()
American Home |
1.8% (1) |
0.0% (0) |
General Mills |
24.6% (14) |
40.0% (8) |
Kelloggs |
26.3% (15) |
40.0% (8) |
Nabisco |
10.5% (6) |
0.0% (0) |
Post |
12.3% (7) |
10.0% (2) |
Quaker Oats |
12.3% (7) |
5.0% (1) |
Ralston Purina |
12.3% (7) |
5.0% (1) |
What did we learn?
We learned about the power of crosstables to highlight patterns of
association in the data. By merging in a data.frame of cereal mascots,
we were able to show that the marketing to childen using these mascots
at their eye level was real.
Celebrate Black History Month with W.E.B. Dubois’
Visualizations
I’d just like to wrap up this newsletter encouraging you to read up
more about W.E.B. DuBois’ amazing visualizations he put together for the
1900 Paris Exhibition of “The Exhibition of American Negroes”. I find
his visualizations to be emotionally affecting. This
page is a wonderful read about the history. Check out Chimdi Nwosu’s
interactive W.E.B.
Dubois Portrait Gallery for examples of these powerful
visualizations.
Let me know your thoughts
If you’ve gotten this far, thank you for reading.
You can add your thoughts and comments to this post by clicking on
the banner above to access the web version, which has comments enabled.
Love to hear them. Share any insights you have gotten through crosstabs
on the data. Does janitor::tabyl() help you do your work
better?
|
|