Ready4R (2024-05-16): The Friction of EDA

LLMs can be an incredible useful tool, but they aren’t a replacement for humans.

                May 27, 2024

            Ready4R (2024-05-16): The Friction of EDA

Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.

Ready4R (2024-05-16): The Friction of EDA

Ready4R (2024-05-16): The Friction of
EDA

I’ve avoided talking about Large Language Models (LLMs), because I
thought I really didn’t have much new to offer about them. If you’re
wondering my opinion of LLMs is: LLMs can be an incredible useful
tool, but they aren’t a replacement for humans. Companies that let
people go because of LLMs are going to regret it. They will lose a huge
amount of institutional memory and expertise with the data and it will
be more costly for them in the long run.
I’m spurred to write about this (and sorry,
{marginaleffects} will have to wait another week) because
Google unveiled their LLM based Exploratory Data Analysis and data
cleaning system called Data
Science Agent. It purports to replace data scientists and make the
data cleaning process automated. I think this is pretty misguided.

To make EDA “easier” is missing the point. The friction of the EDA
process forces you to confront issues in the data. EDA is fundamentally
a creative process.

That said, I want to avoid gatekeeping. You shouldn’t have to know
all of the code to generate the necessary tables and EDA plots (and this
is where Data Science Agent can be super helpful), but you need
to engage with the process to be successful with EDA. This is
where I think Data Science Agent is failing.
The reason you hire data scientists is because: 1) they are not
scared of engaging with the data, 2) they seek to understand how the
data is collected, 3) Given any issues, they will propose fixes to the
pipeline. They have a big picture systems view of how data is generated
and used among users of your system.

Buyer Beware
So caveat emptor (buyer beware): if you are seeking to use data in
novel and creative ways, LLMs will only propose EDA based on what’s been
done before. Every dataset that is used for analysis requires someone to
engage with it, and absorb uncertainty. Without such engagement, you are
up a creek with no paddle.
The Google team seem to be aware of multiple issues, which they
address in their known limitations section:

The agent can be verbose in its output which can be overwhelming for
a human to review. We are working on finding the right balance between
showing the inner workings of the agent and getting the user to the
output they desire.

To this I say, most people want to have a meaningful summary that is
relevant to their work, not just verbose output.

The generated notebooks may require further fine-tuning and
adjustments depending on the complexity of your data and analysis
goals.

I’m afraid the demo impresses the wrong audience: managers who want
to save money. These are the ones who have cartoon-like dollar signs in
their eyes when you show them something like this. “Just think of the
money we’ll save!” By letting experienced people go, you lose
institutional memory, the people who have been around your organization.
These people are the ones who remember that you tried to build a ML
model with limited data, and can tell you why it didn’t work.
I think asking Data Science Agent focused questions such as “find me
the 5 highest correlations between my predictive variables” can be super
helpful, as well as for visualizing distributions and calculating
summaries. You still need some thought behind using it.
The reason we do EDA is not to just check things off a list; it’s to
truly engage with the data. Let’s hope they develop Data Science Agent
in that direction.

WebR Talk is Live
I recently gave a talk on WebR for the Portland R User Group. WebR is
a technology that lets you run R in your browser. I’m extremely excited
how this will democratize learning data science and make it more
accessible to everyone. Thanks to everyone who came and learned
something. Here’s the link to the video: https://www.youtube.com/watch?v=r9LW03H6Ev8&t=8s and
here are the slides: https://laderast.github.io/webr-demo/#/title-slide.

Vicki Boykis: Don’t Worry about LLMs
Speaking of LLMs, Vicki Boykis gave an[ ]excellent talk about
thoughtfully using LLMs](https://vickiboykis.com/2024/05/20/dont-worry-about-llms/),
using Machine Learning (ML) approaches. To have success at something you
know very little of, choose a smaller problem that is still attractive
to your stakeholders. It turns out that ML approaches, such as gradient
descent, where you explore a problem space, are still very useful in the
age of LLMs. She mentions this with some historical background of how
cathedrals are built (I won’t give away the metaphor, it’s too good to
spoil). Thanks to Bob Rudis for sharing.

Thanks for reading!
Thanks again for reading this far. I have had a very busy month doing
work at two jobs, which is why newsletters have been less frequent as of
late.

Don't miss what's next. Subscribe to Ready for R Mailing List:

Start the conversation: