Ready4R (2025-03-03): SQL, Unit Tests, and Learning APIs
Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.
Intro to SQL course
I have been teaching an Intro to SQL class for Fred Hutch learners. One of the things we've been working on as a training group is to have more courses that appeal beyond just research and bioinformatics. This Intro to SQL course is meant to do that, using synthetic data that is in an important format for clinical data, called the OMOP common data model (OMOP CDM).
SQL (Structured Query Language) has a long history of the computing language used to interface with database systems. The paradigms of SQL are so useful they are still used to query database systems that use cutting edge technology, such as Snowflake and DuckDB. So honestly, writing SQL queries is still an extremely valuable skill to have, and will remain in the future.
Unlike most database courses which spend a lot of time on Database Design, this course focuses on writing useful SQL queries from the start, and less on database design. There are a lot of subtleties to SQL, and I intersperse information about design throughout the notes, rather than forcing it. There is also a nice "Database Concepts" section that helps you establish the necessary concepts before you get started.
Course notes are here: https://hutchdatascience.org/intro_to_sql. If there is interest, I will make the Posit cloud workspace available for others. Please let me know in comments section.
Unit Testing Part 3
Last time we talked about setting up a package for testing. This week, I'm going to show you some specific examples of tests I've written.
A Testing Example
I mentioned I would show some real-world examples of tests. I wrote a package that would transform the data values from a DNAnexus system that used Apache Spark that would return cohort data in data.frame format.
The issue was that the categorical data was encoded as integers, which needed to be looked up in a separate table to return values. The data retrieval was very large (hundreds of thousands of rows, with over 7000 covariates), and decoding the data to the actual values was a long and difficult process. That's the main use case for xvhelper
.
Here's one of the tests I wrote to make sure that the function decode_single()
works:
test_that("decode single", {
load("data_dict.rda") #1
load("coding_dict.rda") #2
load("cohort.rda") #3
coded_col_df <- merge_coding_data_dict(coding_dict, data_dict). #4
cohort2 <- cohort |>
dplyr::select(participant.p100240_i0, participant.p1508_i0) #5
out_frame <- decode_single(cohort2, coded_col_df)
expect_equal(colnames(cohort2), colnames(out_frame))
})
The first few lines of this test is loading in sample data using the load()
function (#1, #2, #3). Of importance is cohort.rda
, which is what we test against. These are data.frames
that contain the dictionaries for decoding (coding_dict
, data_dict
), and the sample data to decode (cohort
). We first build our decoding table by merging the two dictionaries (#4).
Then we select()
two columns from the data (#5), and check equivalence between the output, out_frame
, and the object, cohort2
.
When I ran these tests a year ago, the output from Spark changed, and my tests broke. That was enough for me to dig in and fix the functions.
Tests in ggplot2
Let's actually take a look at some of the tests in ggplot2
. Here is a set of tests that tests the grouping aesthetic:
# Test data
df <- data_frame(
x = 1:4,
a = c("a", "a", "b", "b"),
b = c("a", "b", "a", "b")
)
#functions to extract groups from ggplot objects
group <- function(x) as.vector(get_layer_data(x, 1)$group)
groups <- function(x) vec_unique_count(group(x))
test_that("one group per combination of discrete vars", {
plot <- ggplot(df, aes(x, x)) + geom_point()
expect_equal(group(plot), rep(NO_GROUP, 4)). #1
plot <- ggplot(df, aes(x, a)) + geom_point()
expect_equal(group(plot), c(1, 1, 2, 2)) #2
....
})
Here we see two expectations within a unit test, but notice they use the same test data df
, and functions to extract the groupings called group
and groups
. These functions and data are actually outside of the test_that
function.
Notice the expectations in the test - they use the group()
function to check equivalence with expect_equal()
. The groups are encoded as character
variables, but the grouping returns them as factors.
Package dependencies - what happens when they change?
One of the big headaches developers have is keeping up with updates in dependencies. For example, if you wrote a wrapper function for a platform API, such as buttondown, there may be code-breaking changes to the API that can cause bugs in your downstream code.
You will probably have unit tests that call the API or package directly that will keep you aware of any breaking changes, much like my example above.
Reminder: API workshop for Ukraine on 3/13
On March 13, I will be giving a short workshop on using the {httr2}
package for requesting data with APIs. This is a benefit to support Ukrainians. You can participate by donating 20 euro or more at the link below.
Do the words “Web API” sound intimidating to you? This talk is a gentle introduction to what Web APIs are and how to get data out of them using the {httr2}, {jsonlite}. and {tidyjson} packages. You'll learn how to request data from an endpoint and get the data out. We'll do this using an API that gives us facts about cats. By the end of this talk, web APIs will seem much less intimidating and you will be empowered to access data from them.
More information: https://sites.google.com/view/dariia-mykhailyshyna/main/r-workshops-for-ukraine?authuser=0#h.hngu50v1j9mb
Thanks for Reading!
Stay strong, everyone. We need to make our voices heard and get pissed off at our politicians if we want to affect change. The higher ups would rather we forget that we are the ones who truly have power.
To that end, please consider joining a Stand Up for Science event near you: https://standupforscience2025.org/.
Best, Ted