One Shot Learning #5: On Prefect, Airflow, and squeezing your citrus.
Last week I wrote about how Airflow - a little ETL framework you may have heard of - fits into Airbnb’s strategy for scaling data science at a rapidly-growing company. Right after sending that out, a few folks asked me about Prefect. Here’s Colin’s tweet, for example (thanks for writing, Colin!)
As a followup to the latest @OneShotLearning, have you come across @PrefectIO? They’ve extended/improved upon @ApacheAirflow in a lot of useful ways.
Some of you who have met me IRL know that I enjoy a good mixed drink. I love spots with menus where “orgeat” and “aquavit” would have low tf-idf scores, like Existing Conditions and Beaker and Gray. I’ll sometimes make drinks and bitters at home, and I really enjoy drinks with citrus in them.
In my experience, the best citrus cocktails are made from freshly-squeezed juice. Even JaNee from Mahalo (seriously, watch this) can deliver a great Paloma if you hand her a ripe grapefruit. Unfortunately this means buying lots of citrus fruit, which can get pricey really fast. I try to assuage my penny-pinching side, though, by squeezing as much juice out of each grapefruit as possible.
This week, a blog post comparing Prefect and Airflow made the rounds at work. Prefect is promising: inter-task data flows without blowing up your database, small and dynamic tasks, composed functions, lightweight scheduling, a functional-style API, and 🎺 workflow-level integration testing 🎺. With such a roster of features, only Snapchat levels of poor design in the upcoming UI could set the project back.
It’s tempting to migrate to Prefect! But we’ve still got a lot of juice left in Airflow. We have operationalized many pipelines as DAGs. A second team is deploying Airflow to orchestrate business-critical updates with our integration partners. We could instead develop other reporting tools, tune our warehouse and pipelines, or explore frameworks for machine learning. If we’re materializing tables and moving databases in a precise choreography, a collection of cocktail concocters in perfect rhythm, then why stop now?
Bryan and Tom made clear decisions here to look ridiculous and lose tons of money during the night rush!
On the other hand, using Airflow may yield pipeline jungles full of glue code - two forms of technical debt often observed in machine learning systems. “Debt” is apt here if you believe a Prefect migration is inevitable, since (1) migration costs increase proportional to a system’s complexity and (2) we’re not going to stop writing software next week. Assuming inevitability might be right, too: we will build ML systems soon, and training ML models in Airflow is rough.
Why all the hand-wringing about our ETL framework? Because all of your technical decisions are investments made on behalf of your company. We are all paid to deliver value, and data scientists also often juggle competing priorities with high potential impact. Therefore, your decision to work on a project has an explicit cost (your salary and benefits during the project’s lifetime) and an opportunity cost (the possible returns of all projects you don’t work on).
Even microscopic technical decisions are investment decisions. While working on a query this week, I refactored a common table expression instead of tweaking its join. This decision, in retrospect, had clear financial implications: I took the slower path hoping to produce more maintainable code while refining my understanding of our data model. Essentially, I incurred greater cost now to increase my future productivity.
So, after 600 or so words, I haven’t answered Colin’s initial question. Yes, I have heard of Prefect; yes, it clearly improves on Airflow! I would absolutely consider developing future ML workflows using Prefect. But I won’t vote to pivot from Airflow yet - not while we’re still shaking up a few more Palomas.