Ready for R (2024-02-05): Skim Your Data

                February 5, 2024

            Ready for R (2024-02-05): Skim Your Data

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

Ready for R (2024-02-05): Skim Your Data

One of my favorite tools: {skimr}

Please note that I am working with an experimental API package for my
mailing list, and I am debugging visibility issues. Please let me know
if you can’t read any commands or output. You can see the web version of
this by clicking on the Ready4R Banner above.

I’m going to highlight the {skimr} package, which I try
to always use when I encounter new data. It’s invaluable to see the big
picture of the data. You can install it using:
install.packages("skimr")
Let’s start out with the 80 cereals dataset (available here).
Looking at the description, we know there should be at least 3
categorical variables (manufacturer, type, and
shelf). Let’s keep that in mind when start looking at the
data. I’m going to change the data types of all of these to factors. In
the case of shelf, we’ll cast the variable as an
ordered factor type.
library(tidyverse)

cereals <- readr::read_csv("data/cereal.csv", ) |> 
  janitor::clean_names() |>
  mutate(shelf = factor(shelf, ordered=TRUE)) |>
  mutate(across(c("manufacturer", "type"), as.factor))
Now that we have the data loaded correctly, we can now take a look at
it. You may be used to using summary() on your data.
skimr::skim() is like that, but gives you variable
summaries based on variable type (character,
factor, and numeric in this dataset).
When you use it, you’ll usually call it like the following:
skimr::skim(cereals). But to make it easier to explain,
I’ll separate it out into the different summary types.

Overall Summary
The first one is the overall data.frame summary:
skim_output <- skimr::skim(cereals)
summary(skim_output)

Data summary

Name
cereals

Number of rows
77

Number of columns
16

_______________________

Column type frequency:

character
1

factor
3

numeric
12

________________________

Group variables
None

You can get lots of important information here: dimensions, and
number of different variable types. If these variable counts by type
aren’t what I expect, that’s usually a signal I’ll need to do some
variable transformation.

Character Summary
Let’s take a look at the character variable summary.
skimr::yank(skim_output, "character")
Variable type: character

skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace

name
0
1
3
38
0
77
0

There’s only one character variable, name. We expect
this to be unique, since the variable is defined as 1 cereal per row.
The n_unique confirms this (77 rows, 77 unique values).

Factor Summary
When we take a look at the factor level summary, things start to get
interesting:
skimr::yank(skim_output, "factor")
Variable type: factor

skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts

manufacturer
0
1
FALSE
7
K: 23, G: 22, P: 9, Q: 8

type
0
1
FALSE
2
C: 74, H: 3

shelf
0
1
TRUE
3
3: 36, 2: 21, 1: 20

You can see that the categories for each variable, such as
manufacturer are sorted by frequency. There’s also
information about whether the variable is missing any values, which can
be helpful.

Numeric Summary
I get really excited about the numeric summary.
skimr::yank(skim_output, "numeric")
Variable type: numeric

skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist

calories
0
1
106.88
19.48
50.00
100.00
110.00
110.00
160.0
▁▂▇▂▁

protein
0
1
2.55
1.09
1.00
2.00
3.00
3.00
6.0
▇▆▂▁▁

fat
0
1
1.01
1.01
0.00
0.00
1.00
2.00
5.0
▇▂▁▁▁

sodium_mg
0
1
159.68
83.83
0.00
130.00
180.00
210.00
320.0
▃▂▇▇▂

fiber_content_g
0
1
2.15
2.38
0.00
1.00
2.00
3.00
14.0
▇▃▁▁▁

carbo_g
0
1
14.60
4.28
-1.00
12.00
14.00
17.00
23.0
▁▁▆▇▃

sugars_g
0
1
6.92
4.44
-1.00
3.00
7.00
11.00
15.0
▅▇▇▆▇

potass
0
1
96.08
71.29
-1.00
40.00
90.00
120.00
330.0
▇▇▂▁▁

vitamins
0
1
28.25
22.34
0.00
25.00
25.00
25.00
100.0
▁▇▁▁▁

weight
0
1
1.03
0.15
0.50
1.00
1.00
1.00
1.5
▁▁▇▁▁

cups
0
1
0.82
0.23
0.25
0.67
0.75
1.00
1.5
▂▇▇▁▁

rating
0
1
42.67
14.05
18.04
33.17
40.40
50.83
93.7
▅▇▅▁▁

This table contains most of the usual values of
summary() (quantiles, mean, sd, etc). But I love the
hist column because it shows a small histogram of the
distribution of numeric values, which is really helpful to see if the
distributions are skewed.

{skimr} helps you get to know your data
skimr::skim() is usually one of the first things I do
when I load data into R. It helps me confirm facts from the dataset
description, such as uniqueness of columns, but also gives me a very
helpful overview.
{skimr} is maintained by
rOpenSci. They’re a really great group that develop very handy R
Packages and I suggest you check them out!

That’s it for today
That’s it for this week’s newsletter. I hope to talk more about handy
Exploratory Data Analysis tools and the mindset of Exploratory Data
Analysis. I believe that getting good at Exploratory Data Analysis helps
build your confidence as an analyst.

Don't miss what's next. Subscribe to Ready for R Mailing List:

Start the conversation:

Ready for R Mailing List

Ready for R (2024-02-05): Skim Your Data

One of my favorite tools: `{skimr}`

Overall Summary

Character Summary

Factor Summary

Numeric Summary

`{skimr}` helps you get to know your data

That’s it for today