Ready4R (2024-05-16): The Friction of EDA
Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.
Ready4R (2024-05-16): The Friction of EDA
I’ve avoided talking about Large Language Models (LLMs), because I thought I really didn’t have much new to offer about them. If you’re wondering my opinion of LLMs is: LLMs can be an incredible useful tool, but they aren’t a replacement for humans. Companies that let people go because of LLMs are going to regret it. They will lose a huge amount of institutional memory and expertise with the data and it will be more costly for them in the long run.
I’m spurred to write about this (and sorry,
{marginaleffects}
will have to wait another week) because
Google unveiled their LLM based Exploratory Data Analysis and data
cleaning system called Data
Science Agent. It purports to replace data scientists and make the
data cleaning process automated. I think this is pretty misguided.
To make EDA “easier” is missing the point. The friction of the EDA process forces you to confront issues in the data. EDA is fundamentally a creative process.
That said, I want to avoid gatekeeping. You shouldn’t have to know all of the code to generate the necessary tables and EDA plots (and this is where Data Science Agent can be super helpful), but you need to engage with the process to be successful with EDA. This is where I think Data Science Agent is failing.
The reason you hire data scientists is because: 1) they are not scared of engaging with the data, 2) they seek to understand how the data is collected, 3) Given any issues, they will propose fixes to the pipeline. They have a big picture systems view of how data is generated and used among users of your system.
Buyer Beware
So caveat emptor (buyer beware): if you are seeking to use data in novel and creative ways, LLMs will only propose EDA based on what’s been done before. Every dataset that is used for analysis requires someone to engage with it, and absorb uncertainty. Without such engagement, you are up a creek with no paddle.
The Google team seem to be aware of multiple issues, which they address in their known limitations section:
The agent can be verbose in its output which can be overwhelming for a human to review. We are working on finding the right balance between showing the inner workings of the agent and getting the user to the output they desire.
To this I say, most people want to have a meaningful summary that is relevant to their work, not just verbose output.
The generated notebooks may require further fine-tuning and adjustments depending on the complexity of your data and analysis goals.
I’m afraid the demo impresses the wrong audience: managers who want to save money. These are the ones who have cartoon-like dollar signs in their eyes when you show them something like this. “Just think of the money we’ll save!” By letting experienced people go, you lose institutional memory, the people who have been around your organization. These people are the ones who remember that you tried to build a ML model with limited data, and can tell you why it didn’t work.
I think asking Data Science Agent focused questions such as “find me the 5 highest correlations between my predictive variables” can be super helpful, as well as for visualizing distributions and calculating summaries. You still need some thought behind using it.
The reason we do EDA is not to just check things off a list; it’s to truly engage with the data. Let’s hope they develop Data Science Agent in that direction.
WebR Talk is Live
I recently gave a talk on WebR for the Portland R User Group. WebR is a technology that lets you run R in your browser. I’m extremely excited how this will democratize learning data science and make it more accessible to everyone. Thanks to everyone who came and learned something. Here’s the link to the video: https://www.youtube.com/watch?v=r9LW03H6Ev8&t=8s and here are the slides: https://laderast.github.io/webr-demo/#/title-slide.
Vicki Boykis: Don’t Worry about LLMs
Speaking of LLMs, Vicki Boykis gave an[ ]excellent talk about thoughtfully using LLMs](https://vickiboykis.com/2024/05/20/dont-worry-about-llms/), using Machine Learning (ML) approaches. To have success at something you know very little of, choose a smaller problem that is still attractive to your stakeholders. It turns out that ML approaches, such as gradient descent, where you explore a problem space, are still very useful in the age of LLMs. She mentions this with some historical background of how cathedrals are built (I won’t give away the metaphor, it’s too good to spoil). Thanks to Bob Rudis for sharing.
Thanks for reading!
Thanks again for reading this far. I have had a very busy month doing work at two jobs, which is why newsletters have been less frequent as of late.