Ready4R (2024-02-25): Philosophy of EDA
Welcome to the Weekly Ready for R mailing list! If you need Ready for R course info, it's here. Past newsletters are available here.
Thanks everyone for your votes in the survey. This week's vote was for the Philosophy of EDA.
Is there a Philosophy of Exploratory Data Analysis?
Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone – John Tukey
I think a lot of new analysts and data scientists rush through Exploratory Data Analysis. They treat it like a checklist, as something to rush through on the way to building a model. I think that's a mistake. Exploratory Data Analysis is fundamentally a creative act and where you start building your value as an analyst. It is the start of a conversation where you build your domain knowledge by talking about it with your stakeholders.
John Tukey, the godfather of EDA, calls it detective work on your data. I prefer to think of it as building a mental model of the data. By asking focused questions of the data, you understand the issues within the data and figure out the best approach to analyzing the data.
Start With a Template
When I start EDA, I have a markdown template that I use when I first start doing EDA on a dataset. It helps me focus on one aspect of the data by asking an overarching question I want to answer.
-
I start with a question that I'm interested in within the dataset, and I plan a way to answer it.
-
Then I start listing the variables that can help me answer this question, based on a data dictionary if it's available.
-
Next I'll start with the single variable summaries and find things that stick out, including distributions that are not what I expected. Are there outliers that greatly influence the mean or standard deviation? Why are they so large?
-
Based on the variable types and my outcome variable, I start generating plots to assess relationships in the data. Usually with the outcome first, but between any covariates as well.
-
Sometimes the answer is obvious, sometimes it isn't. Sometimes the outcome can be easily predicted from my variables of interest, but usually not. But I'm now prepared to show my preliminary findings to my collaborator, and understand their perspective.
Oftentimes, this conversation generates new questions, which loops back to step 1.
The EDA process helps me approach the data with a clear mind. I need to convince myself that an effect is real before I start down the path of building a model. Because model building is an ever deepening rabbit hole.
Meta Question: Do You Think an Effect is Real?
The question you should always ask yourself when you find an outcome is associated with a variable is: Do you think this effect is real? Sometimes you need to ask your collaborator. If you do, think about which plot clearly shows the effect, and will resonate with your collaborator in terms of visual language.
For example, If it's predicting a categorical variable with continuous covariates, then maybe a boxplot is appropriate. If it's a non-standard plot, be prepared to educate your collaborator on how to read it.
The other way to approach the data is that EDA helps you quantify the difficulty of finding a solution to your problem. How many variables are needed to predict your dataset? If you need to explain your model to your collaborators, how complex would that explanation be?
EDA is a Creative Act
I think the creative side of Exploratory Data Analysis is daydreaming about data. Wondering about variables, and coming up with creative hypotheses to verify with the EDA process.
In a future newsletter, I'll run through a specific example of EDA with this process, and talk about establishing your toolbox of visualizations and diagnostics.
Further Reading
Remembrances of Things EDA is a great article from the Nightingale about Tukey and the history of Exploratory Data Analysis.
What's Next Week?
Thanks for reading this far! Following up on the survey, next week will be about visualizing missing values in a dataset using some of my favorite tools in R. But we'll return to more EDA philosophy after that. Let me know what you think by clicking here and adding a comment.