UI Statistical Program Summer Newsletter
Hello Subscribers and happy 4th of July!
A few months ago, I came across an organization on GitHub, OpenGeoHub, an organization in the Netherlands that provides many programming resources for working with spatial data, particularly soil data. They have many data sets and tools useful to scientists working with climate and landscape-level data. A resource I found particularly useful is a tutorial for how to use ensemble learning for spatial interpolation of data. There are many different approaches for how to interpolate across a larger area when working with point data. Since we do not know the correct interpolation method for a given variable and landmass in advance when working with real data, combining the results from several methods is one way to handle the uncertainty in prediction. This guide will show you how to do that.
New or Updated Resources
DuckDb just released a new package, duckplyr, that brings the full power of dplyr functionality and syntax combined with the efficiency of DuckDb databases. duckplyr will work with standard R objects loaded into memory or specialized file types like parquet that are not loaded into memory during an R session. You can use duckplyr function just like any dplyr function (e.g.
filter()
,full_join()
), but under the hood, it is using duckdb processes that simply put, are very computationally efficient. It’s a complicated process using SQL that people most certainly don’t need to understand in order to use it.A new R CRAN Task View was released for compositional data. These are data that must all add up to a given total, making the individual data points interdependent. In their words:
Compositional data are common in diverse scientific fields, including the chemical, biological, and environmental sciences; where they typically represent portions of a total sample weight or volume and are expressed in units such as percent, parts per million, mg/l, mmol/mol, or similar. Some examples include chemical compositions of soil, water, or air, food compositions, behavioral or time-use profiles, and relative abundances of species. They are also common in socio-economical sciences; for example when dealing with market shares, investment portfolios, or household budgets. —Task View Authors
Microbiome data are another example of compositional data that require specialized analytical approaches. This task view provides an organized list of R packages that can help in the management, visualization and analysis of compositional data sets.
Here is a really nice extended tutorial (it accompanies a class taught at the University of Tennessee) on working with geospatial data in python. It was created by Dr. Qiusheng Wu, who readers of this newsletter may recall is a very prolific creator of online resources for working with geospatial data. I’m always impressed with the quality of his educational resources.
Upcoming Meetings
The 2025 Posit Conf is fast approaching. This 2-day conference, held September 16-18 in Atlanta, provides a mixture of talks on scientific applications and pure R development that are enriching and informative. Listening to these talks can help make you a more confident and competent R programmer. The conference offers free registration for academic attendees who attend virtually. The schedule includes talks on Positron (their new IDE), R/Shiny, integrating large language models (LLMs) into your R workflow, and several meta sessions on how to make R reproducible and how to make collaborative R workflows across a large organization. I attend this each year virtually and always come away with new skills and insights. All conference videos are eventually posted on YouTube 3 months after the conference ends, but who wants to miss out on the incomparable thrill of seeing it all live?
Thanks for reading!
Julia & Harpreet