20/24 Summer Newsletter
If we are very careful and try very hard, we might not completely mislead ourselves.
–Richard McElreath
Hello Subscribers!
Lately, I have been reading works by Richard McElreath, quantitative anthropologist and author of Rethinking Statistics. His position is that we as scientists frequently misuse statistics, deploying statistics as powerful creatures gobbling up data and spitting out p-values regardless of model suitability. He’s not entirely wrong. While this view may sound bleak, he does have a vision for improving the state of statistical analysis, starting with approaching research questions and downstream analysis more thoughtfully and with clear goals and ending on the Bayesian alphabet. His video series and online book (all free) are engaging and informative. Even if you never become a full-fledged Bayesian, his materials cover topics like how to construct a directed acyclic graphic to clearly outline your hypotheses, proper identification of confounds, and how might use these tools for causal inference. But, these materials are long and take time to digest. If you’re not quite ready for that, he has also summarized his thoughts on his blog. In particular the series “Regression: Fire and Dangerous Things” (parts I, II and III) are worth a read.
Large Language Models (LLMs)
Large language models (e.g. ChatGPT, Gemini, Llama) continue to be energetically very expensive, confidentiality confidently providing answers regardless of the actual correctness. I’ve seen it provide comically incorrect answers for (admittedly) complex R coding and modeling questions I have inquired about. I hope this is obvious, but no scientist should ever rely on an LLM for statistical analysis.
However, these tools are not without utility in our work lives. They can sometimes solve common coding errors and help with hard-to-goggle topics, such as anything related to LaTeX typesetting. And like Wikipedia, it can be the starting point of trying to figure out a complex topic. Forget how to code “~”? ChatGPT can easily provide the right answer: \sim
. Not sure how to fill in the area of a density plot? ChatGPT is likely to provide the correct answer. If you are new to a programming language, LMM’s are a great place to help answer your questions. Just remember to be cautious with the results. They may or may not be correct, and sometimes a wrong answer can be catastrophic. If the answer to your plotting question is wrong and you fail to make the exact plot you want, that seems like a low-stakes error we all can live with. But if it misspecifies a statistical model, producing an incorrect p-value that is never identified as erroneous, that is in fact an unacceptable error.
If you want to know more about how to optimize daily usage of LMM’s, check out this presentation by Jeremy Howard. It’s one year old, but still relevant.
New or Updated Resources
We updated our website! (agstats.io). It is largely the same content as before, but it is all a bit easier to find (we hope). We did remove the ‘ANOVA in R’ post while we build a mixed model resource.
We updated our introduction R course. All content can be accessed here.
We recently created a set of data standards and an example file (including an example codebook). This is intended for our University of Idaho colleagues, but everyone can benefit from well organized data.
The AI Institute for Food Systems (a UC-Davis project) has a series of tutorials in python for agriculturally-relevant computer vision tasks in python (e.g. classification, segmentation).
NASS (the National Agricultural Statistics Service) has recently released a series of data sets “Crop Sequence Boundaries”. These are “estimates of field boundaries, crop acreage, and crop rotations across the contiguous United States. It uses satellite imagery with other public data and is open source allowing users to conduct area and statistical analysis of planted U.S. commodities and provides insight on farmer cropping decisions.” (quote from their website)
We recently read a useful blog post on RStudio tips and tricks. Regular users of this GUI know how complex and powerful it is, but have you ever spent time learning any of those handy shortcuts?? They can help with your workflow.
Useful resources that are under ongoing development/updates: Agriculture CRAN Task View and the Mixed Model CRAN Task View which provide curated lists of R packages relevant to their respective topics.
Upcoming Meetings
It’s not too late to register for Posit Conf! This is the annual Posit Conference, which is held August 12-14 in Seattle. It is an R-focused conference put on by the makers of RStudio that attendees rave about for weeks afterwards. The conference usually has an option for virtual registration/attendance with reduced costs for academic attendees.
Thanks for reading!
Julia & Harpreet