One Shot Learning #1: Your data analysis is software too.
Hi there! I'm Alejandro Companioni, a data scientist at iHeartRadio with 9.166 years of experience working on software and data stuff. One Shot Learning is a weekly newsletter analyzing technical and strategic applications of machine learning and AI in the tech industry.
The Jupyter notebook is a popular interactive environment for data analysis in Python. It sits alongside pandas, scikit-learn, and Venn diagrams as a core tool of many data scientists. During the summer, this whole thing about Jupyter notebooks went down. It started with Joel Grus's talk at JupyterCon in late August, titled "I Don't Like Notebooks." (slides). It motivated data people on the Internet to share why they don't like Jupyter notebooks.
- Ian Goodfellow, who has written a popular book on deep learning entitled "Deep Learning," doesn't like them either;
- Hilary Parker, who co-hosts an excellent podcast titled "Not So Standard Deviations" which covers more than standard deviation, dislikes them for different reasons;
- Yihui Xie, a software engineer at RStudio, thinks Juypter is not great and RMarkdown is not that bad; and
- Jeremy Kun at Google tweeted the words "Jupyter notebook" and "production" in the same sentence, eliciting sympathy from data people everywhere. 😒
(Around the time of Joel's talk, Netflix also publicly shared their work to improve Jupyter. I have other thoughts about their work, but will save those for a future edition of this newsletter.)
I cannot ignore a chance to wade into an Internet discussion on data science, especially when I think I can make a great point! The first part of my point is not especially controversial, though: interactive notebooks are useful tools for exploratory analysis and terrible tools for software engineering.
Look, tools that empower data analysts to iterate quickly on exploratory work are valuable. It should not surprise you that analysts apply their preferred tools to adjacent domains. Interactive notebooks are not excluded from this totally-natural behavior! Brian Granger - co-founder of Project Jupyter - admits as much in a recent episode of DataFramed:
Certainly I think there is a bit of an effect where the notebook is a hammer, so everything starts to look like a nail.
In the case of an analyst, the domain of "software engineering" lies close to their own domain. Projects in both areas require code which (ideally) exhibits clarity and reproducibility. Obfuscated software is bad - unless it's intentionally bad and thus hilarious - and idempotency is good.
The problem, then, is when the analyst takes a core tool from their domain and applies it to a slightly different domain like software engineering. Things go south fast: your notebook has not-quite-imperative code that is untested and unmonitored. It is, in other words, bad software. On the out-of-order execution of notebook cells, Yihui writes
[This] is sometimes a benefit (you can work on a small part of the notebook a time). Allowing users to run cells in an arbitrary order doesn’t necessarily mean giving them enough rope to hang themselves.
"This bad software practice is not always bad analytical practice" is not very reassuring. Best engineering practices have evolved so far beyond GOTO
! Why not adopt decades of experience from software engineering into analytical code?
My controversial second point is that if you are analyzing data, you are writing software. I've gone on the record in the past (and accidentally deleted that record along with the rest of my Twitter history 😬) on this. You may apply your code to any domain you choose, whether science, advertising, or tractor firmware. You may even claim your code is not software! But there will be a point, in the course of your code-writing activities, that your code-that-isn't-software needs to graduate to ok-now-it's-software.
Many folks have noted the blurry distinction between analysis and software in the past. Robert Chang distinguishes between analyst-type and engineer-type data scientists to parse out the significant overlap. Yihui states:
Notebooks don’t encourage users to follow good software engineering rules. I tend to agree with the person (whose name was blacked out on slide #46) that data science is not about creating software. However, Joel’s “data science” might involve more “creating software” than others. After all, who knows what “data science” really means...
And Brian Granger:
Right now there's a very steep incline between working interactively in a notebook and software engineering. As someone moves across that transition, the right thing for them to do is, stop using Jupyter and open up their favorite IDE.
I assert that this failure to delineate work leads analysts to incorrectly classify their software as "analysis code" and forego best practices in software. Are you prototyping a new data-driven feature? Unit tests will make sure it works as expected. Are you analyzing data to motivate key business decisions? Then you need to test your code to build confidence in your results. If you are deploying your new model somewhere, ansible or Chef can help. Etc.
Ultimately, the discussion about Jupyter is a red herring. Tooling conversations have a low barrier to entry, which means they can distract from harder conversations about improving best practices for data science. Instead of claiming scientific exceptionalism, analysts should consider how to expand their process beyond interactive computing environments.
As always, I am happy to hear your feedback on issue #1 of One Shot Learning.