Ready for R Mailing List

Subscribe
Archives
February 5, 2024

Ready for R (2024-02-05): Skim Your Data

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

Ready for R (2024-02-05): Skim Your Data

One of my favorite tools: {skimr}

Please note that I am working with an experimental API package for my mailing list, and I am debugging visibility issues. Please let me know if you can’t read any commands or output. You can see the web version of this by clicking on the Ready4R Banner above.

I’m going to highlight the {skimr} package, which I try to always use when I encounter new data. It’s invaluable to see the big picture of the data. You can install it using:

install.packages("skimr")

Let’s start out with the 80 cereals dataset (available here). Looking at the description, we know there should be at least 3 categorical variables (manufacturer, type, and shelf). Let’s keep that in mind when start looking at the data. I’m going to change the data types of all of these to factors. In the case of shelf, we’ll cast the variable as an ordered factor type.

library(tidyverse)

cereals <- readr::read_csv("data/cereal.csv", ) |> 
  janitor::clean_names() |>
  mutate(shelf = factor(shelf, ordered=TRUE)) |>
  mutate(across(c("manufacturer", "type"), as.factor))

Now that we have the data loaded correctly, we can now take a look at it. You may be used to using summary() on your data. skimr::skim() is like that, but gives you variable summaries based on variable type (character, factor, and numeric in this dataset).

When you use it, you’ll usually call it like the following: skimr::skim(cereals). But to make it easier to explain, I’ll separate it out into the different summary types.

Overall Summary

The first one is the overall data.frame summary:

skim_output <- skimr::skim(cereals)
summary(skim_output)
Data summary
Name cereals
Number of rows 77
Number of columns 16
_______________________
Column type frequency:
character 1
factor 3
numeric 12
________________________
Group variables None

You can get lots of important information here: dimensions, and number of different variable types. If these variable counts by type aren’t what I expect, that’s usually a signal I’ll need to do some variable transformation.

Character Summary

Let’s take a look at the character variable summary.

skimr::yank(skim_output, "character")

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 3 38 0 77 0

There’s only one character variable, name. We expect this to be unique, since the variable is defined as 1 cereal per row. The n_unique confirms this (77 rows, 77 unique values).

Factor Summary

When we take a look at the factor level summary, things start to get interesting:

skimr::yank(skim_output, "factor")

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
manufacturer 0 1 FALSE 7 K: 23, G: 22, P: 9, Q: 8
type 0 1 FALSE 2 C: 74, H: 3
shelf 0 1 TRUE 3 3: 36, 2: 21, 1: 20

You can see that the categories for each variable, such as manufacturer are sorted by frequency. There’s also information about whether the variable is missing any values, which can be helpful.

Numeric Summary

I get really excited about the numeric summary.

skimr::yank(skim_output, "numeric")

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
calories 0 1 106.88 19.48 50.00 100.00 110.00 110.00 160.0 ▁▂▇▂▁
protein 0 1 2.55 1.09 1.00 2.00 3.00 3.00 6.0 ▇▆▂▁▁
fat 0 1 1.01 1.01 0.00 0.00 1.00 2.00 5.0 ▇▂▁▁▁
sodium_mg 0 1 159.68 83.83 0.00 130.00 180.00 210.00 320.0 ▃▂▇▇▂
fiber_content_g 0 1 2.15 2.38 0.00 1.00 2.00 3.00 14.0 ▇▃▁▁▁
carbo_g 0 1 14.60 4.28 -1.00 12.00 14.00 17.00 23.0 ▁▁▆▇▃
sugars_g 0 1 6.92 4.44 -1.00 3.00 7.00 11.00 15.0 ▅▇▇▆▇
potass 0 1 96.08 71.29 -1.00 40.00 90.00 120.00 330.0 ▇▇▂▁▁
vitamins 0 1 28.25 22.34 0.00 25.00 25.00 25.00 100.0 ▁▇▁▁▁
weight 0 1 1.03 0.15 0.50 1.00 1.00 1.00 1.5 ▁▁▇▁▁
cups 0 1 0.82 0.23 0.25 0.67 0.75 1.00 1.5 ▂▇▇▁▁
rating 0 1 42.67 14.05 18.04 33.17 40.40 50.83 93.7 ▅▇▅▁▁

This table contains most of the usual values of summary() (quantiles, mean, sd, etc). But I love the hist column because it shows a small histogram of the distribution of numeric values, which is really helpful to see if the distributions are skewed.

{skimr} helps you get to know your data

skimr::skim() is usually one of the first things I do when I load data into R. It helps me confirm facts from the dataset description, such as uniqueness of columns, but also gives me a very helpful overview.

{skimr} is maintained by rOpenSci. They’re a really great group that develop very handy R Packages and I suggest you check them out!

That’s it for today

That’s it for this week’s newsletter. I hope to talk more about handy Exploratory Data Analysis tools and the mindset of Exploratory Data Analysis. I believe that getting good at Exploratory Data Analysis helps build your confidence as an analyst.

Don't miss what's next. Subscribe to Ready for R Mailing List:
Start the conversation:
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.