Overall Summary
The first one is the overall data.frame summary:
skim_output <- skimr::skim(cereals)
summary(skim_output)
Data summary
Name |
cereals |
Number of rows |
77 |
Number of columns |
16 |
_______________________ |
|
Column type frequency: |
|
character |
1 |
factor |
3 |
numeric |
12 |
________________________ |
|
Group variables |
None |
You can get lots of important information here: dimensions, and
number of different variable types. If these variable counts by type
aren’t what I expect, that’s usually a signal I’ll need to do some
variable transformation.
Character Summary
Let’s take a look at the character variable summary.
skimr::yank(skim_output, "character")
Variable type: character
There’s only one character variable, name . We expect
this to be unique, since the variable is defined as 1 cereal per row.
The n_unique confirms this (77 rows, 77 unique values).
Factor Summary
When we take a look at the factor level summary, things start to get
interesting:
skimr::yank(skim_output, "factor")
Variable type: factor
manufacturer |
0 |
1 |
FALSE |
7 |
K: 23, G: 22, P: 9, Q: 8 |
type |
0 |
1 |
FALSE |
2 |
C: 74, H: 3 |
shelf |
0 |
1 |
TRUE |
3 |
3: 36, 2: 21, 1: 20 |
You can see that the categories for each variable, such as
manufacturer are sorted by frequency. There’s also
information about whether the variable is missing any values, which can
be helpful.
Numeric Summary
I get really excited about the numeric summary.
skimr::yank(skim_output, "numeric")
Variable type: numeric
calories |
0 |
1 |
106.88 |
19.48 |
50.00 |
100.00 |
110.00 |
110.00 |
160.0 |
▁▂▇▂▁ |
protein |
0 |
1 |
2.55 |
1.09 |
1.00 |
2.00 |
3.00 |
3.00 |
6.0 |
▇▆▂▁▁ |
fat |
0 |
1 |
1.01 |
1.01 |
0.00 |
0.00 |
1.00 |
2.00 |
5.0 |
▇▂▁▁▁ |
sodium_mg |
0 |
1 |
159.68 |
83.83 |
0.00 |
130.00 |
180.00 |
210.00 |
320.0 |
▃▂▇▇▂ |
fiber_content_g |
0 |
1 |
2.15 |
2.38 |
0.00 |
1.00 |
2.00 |
3.00 |
14.0 |
▇▃▁▁▁ |
carbo_g |
0 |
1 |
14.60 |
4.28 |
-1.00 |
12.00 |
14.00 |
17.00 |
23.0 |
▁▁▆▇▃ |
sugars_g |
0 |
1 |
6.92 |
4.44 |
-1.00 |
3.00 |
7.00 |
11.00 |
15.0 |
▅▇▇▆▇ |
potass |
0 |
1 |
96.08 |
71.29 |
-1.00 |
40.00 |
90.00 |
120.00 |
330.0 |
▇▇▂▁▁ |
vitamins |
0 |
1 |
28.25 |
22.34 |
0.00 |
25.00 |
25.00 |
25.00 |
100.0 |
▁▇▁▁▁ |
weight |
0 |
1 |
1.03 |
0.15 |
0.50 |
1.00 |
1.00 |
1.00 |
1.5 |
▁▁▇▁▁ |
cups |
0 |
1 |
0.82 |
0.23 |
0.25 |
0.67 |
0.75 |
1.00 |
1.5 |
▂▇▇▁▁ |
rating |
0 |
1 |
42.67 |
14.05 |
18.04 |
33.17 |
40.40 |
50.83 |
93.7 |
▅▇▅▁▁ |
This table contains most of the usual values of
summary() (quantiles, mean, sd, etc). But I love the
hist column because it shows a small histogram of the
distribution of numeric values, which is really helpful to see if the
distributions are skewed.
{skimr} helps you get to know your data
skimr::skim() is usually one of the first things I do
when I load data into R. It helps me confirm facts from the dataset
description, such as uniqueness of columns, but also gives me a very
helpful overview.
{skimr} is maintained by
rOpenSci. They’re a really great group that develop very handy R
Packages and I suggest you check them out!
That’s it for today
That’s it for this week’s newsletter. I hope to talk more about handy
Exploratory Data Analysis tools and the mindset of Exploratory Data
Analysis. I believe that getting good at Exploratory Data Analysis helps
build your confidence as an analyst.
|