Observations from last week’s Software Engineering course and a pile of new Open Source tools for you
Observations from last week's Software Engineering course and a pile of new Open Source tools for you
Further below are 7 jobs including Senior/Lead/Head Data Science and Data Engineering positions at companies like Number 10 (Downing Street), JustEat, CataWiki, ECMWF and MadeWithIntent, Coefficent Systems
Thoughts from my recent Software Engineering for Data Scientists course
Last week I ran a public iteration of my Software Engineering for Data Scientists course - it focuses on people who use Notebooks and don't yet write tests, moving everyone towards a collaborative critique mindset to diagnose "bad" Notebooks, add unit tests and DataFrame tests, extract useful code into a folder structure and finally to review a bunch of "weird" Pandas outcomes. As usual this led to spirited conversation, feedback on good process and lots of new tools and techniques to take back to the office. I don't tend to write-up the discussion that occurs in these sessions and I figured I'd try that experiment. If you'd like to attend the next one, or to have a private iteration in your group, just reply to this.
I like to start the session covering a code review on a "bad Notebook" - you know the sort - written out of order, awkward variable names, no documentation, bad charts - code that just looks like you shouldn't trust it. We did a code review and figured out the highest value problems to fix (and in the next session the Notebooks is updated and ready for harder questions). Next we dug into unit testing - why we might write tests, how many to write (more than 0! focus on areas of highest value for bug finding and embarrassment reduction) and how to write them. I like to use pytest plus the numpy testing routines to diagnose pure Python, numeric and Pandas issues. We also use coverage to check that our tests exercise critical parts of our code and to uncover where testing is missing.
One annoyance this iteration is that the usually-excellent nbdime
package - which lets you do visual diffs between versions of Notebooks - is currently broken for recent versions of Jupyter and pending a new release. I demo'd this using a separate environment, but had to advise students to hold off of experimenting with this. Usually this tool is very helpful when code-reviewing Notebooks as both code and the output can be diff'd, without all the messy JSON internal gubbins being exposed (as nbdime
understands how to parse that metadata away).
There's always a good discussion around "how many tests to write? how focused should they be?". I generally advise - start with at least 1 unit test in your code so you're structuring to enable testing, then add tests to anything that does data loading or transformation (we want to trust this stuff and you almost certainly know what it should be doing before you write it). Adding tests around research code is hard as we're almost certainly evolving our ideas as we go. Having a think about "what's costly if it goes wrong?" is a nice way to prioritise where to test - reducing wasted time or deployment issues by having some automated tests is a winner, having these run in a CI system is even better to raise the trust-bar in a team.
nbqa is always a happy surprise in the class as you can run module-only tools like flake8
on a Notebook to spot issues. Along with plug-ins like bugbear and pandas vet you can add a bunch of useful additional weirdness-checkers which can help you avoid getting into sticky situations. Do you have any code review tools that you'd like to share? I'd happily share them here if they look useful.
We also use pandera to add data-quality tests to our DataFrame so we can trust our data when it is loaded and manipulated. Unit tests are probably run prior to deployment whilst data-quality check are probably run during deployment as data passes through the system. I think there's so much value to unlock by having checks in place that fail at the point of the issue like when loading a broken datafile, rather than discovering further down the pipeline that things aren't working and then having to laboriously work back through the pipeline to find what went wrong. Pandera helps with that.
We had a discussion about great expectations which feels like a heavier-weight Pandera - it has a nice UI but requires more lines of code - do you have an opinion on one over the other? I'm curious as I've never tried GX in anger, but I have benefited from Pandera repeatedly.
We had a surprise meander into dunder naming (__name__
and __main__
) and what happens when a module is run through pytest
vs using python
or when it is imported. I try to scatter interesting tidbits across my code examples to flush out uncertainties as these always lead to interesting conversations. If you don't know about this - try print(__name__)
in a module, then run it with python or import
that module from somewhere else to see what happens.
I also asked about the use of ChatGPT or GitHub CoPilot for code-assistance and half the class were experimenting. I related my annecode from a few issues back about ChatGPT giving a confident lie about how to write a bash command - these tools definitely have value (I have others telling me that they're great for fleshing out a matplotlib
or seaborn
call), but we need to be mindful of errors blindly creeping through.
We also spoke about "what slows down your deployments to production?" and both organisational issues (getting sign-off from managers) and practical issues (no automation in the deployment pipeline!) came up frequently. I can see how adding more unit tests and online data quality tests can reduce a manager's fear about sign-off, and moving to a more automated deployment process (backed by unit-tests, CI and a sensible project structure) is going to be key to solve these issues.
I'm always pressing in my classes for students to get involved in collaborating on open source projects - if you file a bug or make a doc update to a project like Pandas or sklearn
then you get to learn about a whole new project release process for free, and it is a process that is battle-hardened, asychronous and robust. I was happy to hear that one of the attendees was involved in releasing a data population-monitoring tool popmon, having an organisation release a tool to the public is a great way to learn about open source involvement:
popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets. popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference.
Many thanks to my students for the great discussion topics and hopefully some of you reading this will find a couple of the points above useful. If so, please reply and let me know, maybe I can write some more on these points. If you or your team might want to attend a future version of the class either use the date-notification GForm at the top of my training page or just reply to this email, I'd be happy to let you know.
PyDataLondon 2023 runs this June 2-4
Tickets are available for our conference which runs this June 2-4 in central London and the schedule will be online soon. Tickets sell out every year so if you need corporate purchase authorisation - do chat to your boss now. Here are the videos from 2022 in case you've not been before - everything is accepted through a double-blind review committee so the quality-bar is very high (and yeah, I've been rejected myself in the past - and whilst that sucked for me, it does make the point about the double-blind review committee being impartial despite me being a founder!).
I plan to run another of my Executives at PyData discussion sessions on all the issues that occur around leadership. I may also run a discussion session on "getting to higher performance" for a general discussion about the state of higher performance tools. Personally I've very much looking forward to attending. I'll even bring my infant to take advantage of the childcare this year.
Open source
Atharva Ladd introduces some recent data science library updates for (thanks Atharva for the contribution and your great support at our PyDataLondon meetups!):
- Sktime: Python library designed for time series analysis with an interface to showcase learning tasks such as classification, regression, clustering, annotation, and forecasting. It also provides time series algorithms and scikit-learn compatible tools for creating, fine-tuning, and validating time series models
- Feature Engine: Python library to create and select features for machine learning models. The transformers use a
fit()
method to learn parameters from data and atransform()
method to apply the transformation - Feature Tools: Python library that allows users to automatically generate features from relational and temporal datasets for machine learning models.
- Dask: A flexible parallel computing library for analytics that handles large datasets and provides efficient ways to scale up computations on multi-core CPUs, distributed clusters, and cloud computing environments.
Sktime 0.17.0
Sktime 0.17.0 is now able to fully support python 3.11 contrary to its previous version 0.16.0.
Heavy dependencies on tensorflow_probability
are removed and replaced with BaseObject
based interface to calculate probabilities, maintain compatibility with multiple programs and allow additional methods to be added. This will make it easier to perform calculations and use different types of data within the code. The distribution forecast metrics are designed to handle both exact and approximate calculations. It has a deprecation mechanism that will eventually switch the argument to False and remove the sktime
distribution after two cycles i.e. 0.19.0. The latest PR update ensures that return types and losses used for probabilistic calculations work well with evaluation & tuning, and also aim to fix averaging issues and test interval/quantile metrics.
A conditional transformer TransformIf
is selected over a parameter plugin estimator which parses a condition, take a fittable parameter estimator and if/else transformers to direct a forecasting or transformation workflow
The Hodrick-Prescott filter transformer applies a filter to a one-dimensional array and outputs two components, a cycle and a trend, using the statsmodels time series filter. The Christiano Fitzgerald filter also outputs a cycle and a trend but uses the statsmodels time series filter cffilter
The new ForecastKnownValues
produces forecasts from a set of known values and has tons of applications from being used as a dummy or naive forecaster with a known baseline expectation, as a forecaster with expert forecasts, as a counterfactual in benchmarking experiments or post-processing with other forecasters.
A direct interface for the MrSQM algorithm, based on the mrsqm
package is added which serves as a proof-of-concept for estimators with cython dependencies eg: MrSEQL.
Feature Engine 1.6.0
The latest release of feature-engine makes the transformers compatible with the set_output
API from Scikit-learn, removes the inplace
functionality from transformers and introduces three new transformers for discretization of continuous variables (GeometricWidthDiscretiser
), feature selection (ProbeFeatureSelection
), and operations between datetime variables (DatetimeSubtraction
). Big News - most categorical encoders can now encode variables with missing data. The release also includes new modules to load specific datasets, functions to automatically select numerical, categorical, or datetime variables (variable_handling module) alongside minor bug fixes and documentation updates. Specifically, the PRatioEncoder
class was removed from the API and the performance of the DropDuplicateFeatures
has been significantly improved.
Feature Tools 1.24
A series of classes and numerical methods were added in this latest issue to determine the average count across all unique values, eliminate noise in a signal and improve the smoothness of a signal trend (Savitzky-Golay filter), time interval calculations, Pearson correlation between a series of values and its shifted version, kurtosis and finding number of local maxima. Some generic functions to determine the extension of a filepath, get first/last name from a full name, etc were also implemented. A separate makefile command was setup to segregate the core requirements, test requirements and dev requirements
Dask (2023.2.0 - 2023.3.1)
- 2023.3.1 - Some of the notable enhancements include improved support for pyarrow strings for working with large datasets, extended complete extras to provide additional functionality for specific use cases, and initial support for converting pandas extension dtypes to arrays. The bug fixes address issues such as flaky RuntimeWarning to prevent warning messages appearing inconsistently when using Dask to perform calculations on arrays of data, parquet overwrite behavior, and numpy scalars to avoid Dask to crash when working with certain types of data
- 2023.3.0 - The default shuffle algorithm in
Bag
was changed from 'p2p' to a more reliable one to avoid issues in certain scenarios. The documentation has been updated to reflect this change. A minimum version of thejinja2
library was added to ensure that users don't encounter any issues related to incompatible versions ofjinja2
- 2023.2.1 - Enabled P2P shuffling and rechunking that allows data to be distributed, split and merged into chunks in order to optimise performance. Added robust support for efficient string conversion during the reading of Parquet files, for sorting data by multiple columns and for creating generator-based random number implementations. Some other major fixes included converting string data to use pyarrow strings, improving the performance of grouping operations by adding "numeric_only" parameter in the groupby function and aligning the profiler plot when context manager is entered.
- 2023.2.0 - There are some important enhancements that update the default behaviour of parameters such as
numeric_only
when calculating the quantile values,datetime_is_numeric
when including datetime columns during summary statistics calculations andvalue_counts
method in Pandas 2.0 to return the correct name. This version also includes a number of fixes for creation of numericmeta_nonempty
indices (used to represent the index of a Dask DataFrame or Series), for theinfo
method of a Pandas DataFrame (used to display a summary of its columns, index, and data types) and for outdated information and typos in the development guide. Other maintenance updates aim to ensure the smooth functioning of the software and improve its overall stability - these include fixing tests for compatibility with pandas 2.0, replacing deprecated code, avoiding certain imports, broadening exception catching, and updating dependencies such asisort
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,500+ subscribers. Your first job listing is free and it'll go to all 1,500 subscribers 3 times over 6 weeks, subsequent posts are charged.
Lead Data Scientist - Experimentation and Causal Inference at Just-Eat Takeaway.com, Permanent, Amsterdam or London
We’re searching for opportunities to perfect our processes and implement methodologies that help our teams to thrive locally and globally. That’s where you come in! As a Lead Data Scientist in our Experimentation and Causal Inference team, you will work at the intersection of data science, statistics, and analytics to help build the best customer experience possible.
Enabling data-driven decision making through fast, trustworthy and reliable experimentation at scale, you will work alongside data scientists, analysts and engineers to define our experimentation pipeline which enables causal inference and automates insight extraction from hundreds of experiments. Your work will be used across every product and business function, and enable a standard way of working for causal inference across tech and non-tech functions to evaluate the impact of changes made throughout our business. As lead data scientist, it is expected that you will teach your team new tricks from a scientific perspective as well as way of working. Most importantly, your work will directly contribute to building the world’s most customer-friendly food delivery app.
- Rate:
- Location: Amsterdam or London (central)
- Contact: Please apply through the links below (please mention this list when you get in touch)
- Side reading: link, link
Senior Data Analyst (Commercial Insights) at Catawiki, Permanent, Amsterdam
We’re looking for a Senior Data Analyst / Senior Data Scientist Insights to become the Data Business Partner of our Categories (Luxury Goods, Art, Interiors, and Collectables), in close collaboration with their dedicated Finance Business Partners.
Your role will be to analyse current business trends and to derive actionable insights in order to support our categories in shifting their commercial strategy when and where needed (supply, demand, curation etc). You will be a trusted business partner for them, who is actively participating in the decision-making, and regularly identifying new business opportunities.
- Rate:
- Location: Amsterdam, The Netherlands
- Contact: j.den.hamer@catawiki.nl (please mention this list when you get in touch)
- Side reading: link, link, link
Data Scientist / Data Engineer at 10 Downing Street
The No10 data science team, 10DS, offers an unparalleled opportunity to develop your career personally through these demanding and intellectually stimulating roles. Formed in mid-2020, 10DS has a remit to radically improve the way in which key decisions are informed by data, analysis and evidence.
We are looking for exceptional candidates with great mathematical reasoning. In return you will be provided an unparalleled opportunity to develop your technical skills and support advice to help improve your country.
- Rate:
- Location: London
- Contact: avarotsis@no10.gov.uk (please mention this list when you get in touch)
- Side reading: link
Scientist/Senior Scientist for Machine Learning at ECMWF
The European Centre for Medium-range Weather Forecasts (ECMWF) is looking to hire multiple (senior) scientists for machine learning. We're now in 3 locations across Europe (with adjusted salaries), working on improving weather forecasts for Europe.
If you have experience on the traditional HPC + ML or ML + Earth Systems Science side of things, or more of a ML coordinator, you may be a great fit. Especially, if you're part of a under-represented minority, please consider that the job post is written for both senior and regular scientists and you may not have to match every single bullet point to be a great fit. Our machine learning ecosystem: a lot of deep learning from CNNs to transformers and GNNs on lots a GPUs.
The ECMWF itself is an intergovernmental organisation created in 1975 by a group of European nations and is today supported by 35 Member and Co-operating States, mostly in Europe with the world's largest archive of meteorological data. So definitely a very unique place to work and push state-of-the-art machine learning these days.
- Rate: £68,374 GBP to €103,517 EUR NET of tax annual basic salary + other benefits
- Location: Reading, UK / Bonn, Germany / Bologna, Italy
- Contact: jesper.dramsch@ecmwf.int (please mention this list when you get in touch)
- Side reading: link, link, link
Senior Data Scientist at Made With Intent, Permanent, Remote
We’re a new, revolutionary platform that helps online retailers show a bit more care for their customers. We give retailers the ability to understand their customer intent by listening to hundreds of micro-behaviours on site, modelling them together, and creating a predictor metric between 0 and 1. We predict which customers will buy, and where they are in their journey, and serve them appropriate content that nudges their intent in a caring, nurturing and educating way.
The ideal data scientist candidate will have at least 2 years of hands-on programming experience in AI, a postgraduate degree, experience in NLP, Tensorflow/Pytorch, MLOps, AWS and advanced SQL skills. They must work with unstructured data, have excellent communication skills, and collaborate with stakeholders. We value those who learn new tools quickly and hustle when required.
- Rate: £70k-£90k
- Location: Remote
- Contact: tom@madewithintent.ai (please mention this list when you get in touch)
- Side reading: link, link
Data Scientist
M&G plc is an international savings and investments business, as at 30 June 2022, we had £348.9 billion of assets under management and administration.
Analytics – Data Science Team is looking for a Data Scientist to work on projects ranging from Quantitative Finance to NLP. Some recent projects include: - ML applications in ESG data - Topic modelling and sentiment analysis - Portfolio Optimization
The work will revolve around the following: - Build data ingestion pipelines (with data sourced from SFTP & third party APIs) - Explore data and extract insights using Machine Learning models like Random Forest, XGBoost and (sometimes) Neural Networks - Productionize the solution (build CI/CD pipelines with the help of friendly DevOps engineer)
- Rate:
- Location: London
- Contact: sarunas.girdenas@mandg.com (please mention this list when you get in touch)
- Side reading: link
Senior Data Scientist at Coefficient Systems Ltd
We are looking for an enthusiastic and pragmatic Senior Data Scientist with 5+ years’ experience to join the Coefficient team full-time. We are a “full-stack” data consultancy delivering end-to-end data science, engineering and ML solutions. We’re passionate about open source, open data, agile delivery and building a culture of excellence.
This is our first Senior Data Science role, so you can expect to work closely with the CEO and to take a tech lead role for some of our projects. You'll be at the heart of project delivery including hands-on coding, code reviews, delivering Python workshops, and mentoring others. You'll be working on projects with multiple clients across different industries, including clients in the UK public sector, financial services, healthcare, app startups and beyond.
Our goal is to promote a diverse, inclusive and empowering culture at Coefficient with people who enjoy sharing their knowledge and passion with others. We aim to be best-in-class at what we do, and we want to work with people who share that same attitude.