One Shot Learning #2: Data people need the right tools.
Hi there! I'm Alejandro Companioni, a data scientist at iHeartRadio with 9.25 years of experience working on software and data stuff. One Shot Learning is a weekly newsletter analyzing technical and strategic applications of machine learning and AI in the tech industry.
Thank you all for the feedback on the first issue of One Shot Learning, and welcome to the new subscribers. There are now dozens of us in the OSL community - dozens!
This week brings with it an article in Nature covering Jupyter. People have thoughts about painting that notebook shed! Read that (and the comments on Hacker News) if you want more thoughts on Jupyter's shortcomings.
Last week I wrote:
Around the time of Joel's talk, Netflix also publicly shared their work to improve Jupyter. I have other thoughts about their work, but will save those for a future edition of this newsletter.
This is a future edition of One Shot Learning! I'll spare you any more heavy discussion of notebooks, though. Instead, let's focus on how Netflix’s productionalization of Jupyter notebooks reflects evolution in the responsibilities of people working with data, and the lack of tooling needed to meet these responsibilities.
Where were you when the Data Science Venn Diagram graced us with its presence? I remember the moment fondly. Caramel leaves filtered the sunlight as fall arrived in Washington, DC. I wore a v-neck sweater and sipped a warm cup of coffee - a charming flavor, roasted lovingly by Keurig named "Pumpkin Spice" - when the post appeared, like a swaddled babe resting in a basket on the flowing river, in my Google Reader.
Anyway eight years on and now we have all of these.
These exist as people adapt their expectations for data science. Years after Drew Conway's post, data science managers no longer seek unicorns. Instead, they build teams with complimentary skillsets to deliver models and data infrastructure.
Netflix shares examples of these diverse responsibilities before discussing their Jupyter-based tooling:
For example, a data engineer might create a new aggregate of a dataset containing trillions of streaming events — using Scala in IntelliJ. An analytics engineer might use that aggregate in a new report on global streaming quality — using SQL and Tableau. And that report might lead to a data scientist building a new streaming compression model — using R and RStudio. On the surface, these seem like disparate, albeit complementary, workflows. But if we delve deeper, we see that each of these workflows has multiple overlapping tasks.
All of these require exploration, validation, and productionalization to varying degrees. Netflix believes it can unify this work with a single tool:
No single tool could span all of these tasks; what’s more, a single task often requires multiple tools. When we add another layer of abstraction, however, a common pattern emerges across tools and languages: run code, explore data, present results.
They chose Jupyter for its language-agnostic messaging protocol, mixed-media file format, and interactive UI. More detail (emphasis mine):
The Jupyter protocol provides a standard messaging API to communicate with kernels that act as computational engines. The protocol enables a composable architecture that separates where content is written (the UI) and where code is executed (the kernel).
[...]
Backing all this is a file format that stores both code and results together. This means results can be accessed later without needing to rerun the code. In addition, the notebook stores rich prose to give context to what’s happening within the notebook. This makes it an ideal format for communicating business context, documenting assumptions, annotating code, describing conclusions, and more.
These characteristics map back to the responsibilities outlined above for engineers and modelers on data teams. Instead of only committing stream aggregations to version control, the data engineer can enrich their code with their thought process and iterative work. Meanwhile, analysts and modelers can now bundle software and notes with their visualizations instead of creating unannotated dashboards or imperative notebooks.
This tooling discussion exists because of a larger issue: context, code and visual artifacts are key components of data work, but there are no tools that emphasize all three. Reporting dashboards like Tableau obfuscate code and do not provide powerful documentation tooling; IDEs like PyCharm bring code to the fore while relegating charts and context to docstrings or files on disk. The right tool could empower data people to deliver projects with context, code and artifacts as first-class citizens.
Are notebooks a solution to this problem? I don't know. I was skeptical going into Netflix's blog series, but the fact remains that unifying tools for data teams do not exist as open-source software. Apache Spark is a dramatic improvement over Hadoop Streaming, but it is still extremely tedious to use. We have already discussed the downsides of Jupyter. To paraphrase Henry Ford, RStudio provides excellent free-as-in-beer tooling for any language out there — as long as that language is R. Meanwhile, libraries exist for pipelining and modeling and visualization, like Apache Airflow and TensorFlow and Altair, but these address specific tasks and are not easily unified.
But: we have to start somewhere! Netflix built theirs atop of Jupyter. Uber has built Michelangelo (and scaled it too) atop HDFS-based technologies. According to Robert Chang, Airbnb is building something similar. RStudio has built an IDE that integrates with the Hadleyverse. Databricks, AWS, Microsoft and others offer paid platforms for machine learning which include visualization tools and resource provisioning.
The aforementioned examples have been built by unicorn-valued tech companies for internal use or unicorn-valued tech companies for external use at tremendous cost. The larger community, meanwhile, lacks an open-source tool that handles context, code and visual artifacts equally.
I have no solutions to this problem. I would love to unveil CompaLab, revolutionizing your data work by providing an IDE that integrates charts, documentation and software while plugging into any data lake. That said, companies are incentivized to limit their data tooling to internal use as an advantage over competitors, or charge significant amounts of money for it to keep the lights on. Data teams can thus benefit greatly from strategizing how to overcome a deficit of adequate tooling — either because it makes their companies more money, or lets them outlast their competitors.