Data stacks, writing tests for data science, PCA vs SVD

        December 30, 2021

Data stacks, writing tests for data science, PCA vs SVD

An interview with former Snowflake CEO Bob Muglia on the how the data stack will evolve in the next 5 years. Of particular interest to our team are his comments on domain-oriented governance for data within large organizations.
“The idea of data sharing was to build the mechanisms required to enable a domain-oriented governance model. And so this idea of domain-oriented governance can very much apply in the modern data stack.... It allows a company to set up different domains of data expertise, and then share that data with other organizations.”

Snowflake and Databricks are battling head to head to establish themselves as lead in the “data stack” space. In our case, I think Databricks Unity Catalog may be part of our solution.

Notes from Peter Baumgartner on writing tests as part of the data science workflow, including uses of hypothesis (which I didn’t know about), Great Expectations, and pytest.

One of the most common questions when teaching advanced linear algebra, and (for example) embeddings is around the relationship between SVD and PCA. I myself have repeatedly returned to this stack exchange post to get my story straight.

Miller is a command-line tool for querying, shaping, and reformatting data files in various formats including CSV, TSV, and JSON.

PyData Triangle is happening on January 5, with a talk on Druid and what looks like a wide-ranging talk covering how data and AI fit into web3, and then a bunch of lightening talks.

                            Don't miss what's next. Subscribe to John's Data and Analytics Weekly:

            Email address (required)