John's Data and Analytics Weekly

Subscribe
Archives
December 30, 2021

Data stacks, writing tests for data science, PCA vs SVD

  • An interview with former Snowflake CEO Bob Muglia on the how the data stack will evolve in the next 5 years. Of particular interest to our team are his comments on domain-oriented governance for data within large organizations.

    “The idea of data sharing was to build the mechanisms required to enable a domain-oriented governance model. And so this idea of domain-oriented governance can very much apply in the modern data stack.... It allows a company to set up different domains of data expertise, and then share that data with other organizations.”

  • Snowflake and Databricks are battling head to head to establish themselves as lead in the “data stack” space. In our case, I think Databricks Unity Catalog may be part of our solution.

  • Notes from Peter Baumgartner on writing tests as part of the data science workflow, including uses of hypothesis (which I didn’t know about), Great Expectations, and pytest.

  • One of the most common questions when teaching advanced linear algebra, and (for example) embeddings is around the relationship between SVD and PCA. I myself have repeatedly returned to this stack exchange post to get my story straight.

  • Miller is a command-line tool for querying, shaping, and reformatting data files in various formats including CSV, TSV, and JSON.

  • PyData Triangle is happening on January 5, with a talk on Druid and what looks like a wide-ranging talk covering how data and AI fit into web3, and then a bunch of lightening talks.

Don't miss what's next. Subscribe to John's Data and Analytics Weekly:
Powered by Buttondown, the easiest way to start and grow your newsletter.