Data stacks, writing tests for data science, PCA vs SVD
-
An interview with former Snowflake CEO Bob Muglia on the how the data stack will evolve in the next 5 years. Of particular interest to our team are his comments on domain-oriented governance for data within large organizations.
“The idea of data sharing was to build the mechanisms required to enable a domain-oriented governance model. And so this idea of domain-oriented governance can very much apply in the modern data stack.... It allows a company to set up different domains of data expertise, and then share that data with other organizations.”
-
Snowflake and Databricks are battling head to head to establish themselves as lead in the “data stack” space. In our case, I think Databricks Unity Catalog may be part of our solution.
-
Notes from Peter Baumgartner on writing tests as part of the data science workflow, including uses of hypothesis (which I didn’t know about), Great Expectations, and pytest.
-
One of the most common questions when teaching advanced linear algebra, and (for example) embeddings is around the relationship between SVD and PCA. I myself have repeatedly returned to this stack exchange post to get my story straight.
-
Miller is a command-line tool for querying, shaping, and reformatting data files in various formats including CSV, TSV, and JSON.
-
PyData Triangle is happening on January 5, with a talk on Druid and what looks like a wide-ranging talk covering how data and AI fit into web3, and then a bunch of lightening talks.