New Rebel AI leadership group, PyDataLondon conference and SW Engineering tools for you
New Rebel AI leadership group, PyDataLondon conference and SW Engineering tools for you
Further below are 5 jobs including Senior/Lead/Head Data Science and Data Engineering positions at companies like JustEat, CataWiki and MadeWithIntent
I'll talk below on a new endeavour called "Rebel AI" that I'm building for excellent data science leaders, give you some reasons to attend our PyDataLondon conference, list some more tools for the software eng side of data science and share some updates on useful open source tools.
Rebel AI leadership group (a forthcoming endeavour)
During my strategic consulting with teams I've repeatedly noted that higher level leadership questions keep coming up. Whilst I've got "Ian's answers" I can't help but think that a broader leadership group would have even better answers. To that end I'm building Rebel AI - a private leadership group for the top data science leaders in my network. I'm curating this group, it is a paid thing for those who want to deliver impactful data science results faster.
I'm looking for leaders who are having trouble shipping high value data products to their organisation, you're probably also a bit frustrated. You've already tried various approaches and things aren't getting easier. You want the advice of a trusted set of peers - "fresh eyes" and some validation of your strategies. This is less about "which ML model to choose" and far more about "am I tackling the right strategic problem?".
This newsletter will continue (and remains free - nothing changes). My courses continue, as will my conference talks and open sessions like the Executives at PyData discussion groups at our PyData conferences. Rebel AI is an addition to my professional services to help a wider group of leaders get valuable systems shipping and making a difference.
Reply to this email if you'd like to hear more and tell me what you want help with and we can have a call. I've got a one-pager description and I'd be keen to talk about your current frustrations.
I've mostly wrapped up with 2 clients this year with great success with unblocking processes, figuring out where the value really is and doubling-down to get value unlocked. It has been a lot of fun (and worth a few million between the clients) and I'm keen to dig more into this. There's also a reflection in the software engineering tools below on this.
PyDataLondon 2023 conference and getting your questions answered in the open PyDataUK slack
In a month on June 2-4 our PyDataLondon 2023 conference will run in central London near Tower Bridge. The schedule is up with tutorials on TensorFlow, Active Learning, MLFlow and plenty more followed by talks on Code Smells, Large Scale Agent Based Simulations (Virgin for fibre routing), Multi-armed Bandits, Green-optimised compute, Polars vs Pandas, dbt, time series and loads more (with a few slow-to-confirm speakers to be filled in shortly).
Tickets are selling and sponsorship slots can still be available if you'd like to reach 400+ excellent data science and engineering folk. If you're in our meetup group you'll have seen my conference announce.
We would love it if you could share details about the conference with your network - post-Covid a lot of folk are travelling less, so we need a bit more help reaching those who'd love to attend. Can you help us please? Our PyDataLondon organiser crew are all volunteers (9 years+ for me as a founder). Here's a tweet and toot to reshare, or please post into LinkedIn and elsewhere - all help appreciated.
Personally I'll be running another of my Executives at PyData discussion sessions for leaders. This is related to the Rebel AI leadership group I noted above, but the session at the conference is open to all attendees and will be written-up and shared after.
If you'd like to be in the company of 1k+ PyData UK meetup members (I figure many of you here are also in PyData) then join here, the slack is free and friendly.
Polars vs Pandas 2
My talk on Polars vs Pandas 2 vs Dask made it through the double-blind review committee (everyone - including organisers - go through the double blind process). If you have experience of tools beyond Pandas I'd love to hear your thoughts in that session.
Harald Carlens shared a wonderful State of Competitive ML 2022 write-up that he's authored analysing the tools used in competitive ML competitions (i.e. Kaggle and elsewhere). Along with the rise and fall of various ML tools he looks at DataFrame libraries.
He notes:
...we found 87% of winners using Pandas and… none using any of the alternatives.
This was surprising! Each of these alternatives have their own niches (scale, speed, distribution), and seem well-established enough to see some use. It’s unclear exactly why this has happened. A few possible reasons:
- The competitive machine learning community often shares code during competitions, with competitors building on others’ solutions. It’s possible that diverging from the commonly used stack of libraries makes it harder to integrate code shared by others.
- Pandas has existed for a long time, and is backed by NumPy arrays. NumPy generally interfaces well with other libraries in the PyData ecosystem, including scikit-learn, which is often used for calculating metrics as well as data preprocessing/transformation.
The reason I'm working on this talk for the conference is to get a hands-on view of Polars (which so far I've only viewed from the sidelines). I agree with Harald that interoperability with the wider ecosystem is an important point - does Polars work seamlessly with matplotlib
, scikit-learn
and other tools? I'm going to dig into this and I'll share more here in the future.
More thoughts on software engineering tools for data scientists
I'm running another of my Software Engineering for Data Scientists classes, this time in a hedge fund. I've got a mixed cohort of people-new-to-Python through to experienced quants who rarely test and want to learn if they should (and I tend to believe "yes, probably"). I try to make sure we have good q&a in these discussions and we've had a good opportunity to dig into where tests of various kinds can be useful. This builds on the notes from last issue.
Whilst the tools I'll note below can be useful (and build a solid base), the more interesting discussion that I've had over several courses is "which practices hurt the speed of my research development"? Frequently the problem isn't tool based but process based (and this is on my mind given "Rebel AI" noted above - how do we get our teams to reliably go faster?). Common issues include:
- realising that the lifetime of "quick and dirty code" is much longer than expected (and then it is inflexible)
- a desire to quickly get "the answer" without questioning "is it right" leading to subsequent uncertainty and delay
- avoiding code reviews as a waste of time, later realising that fresh eyes help uncover flaws
- saving time by not writing tests which later can aid discovering how inherited-and-undocumented code is meant to actually work (making it "somebody else's problem")
Some folk with a pure research process (and little or no desire to productionise results) perhaps have no testing and maybe even no code review with peers. Engineering teams who write libraries will have a strong culture of testing, code review and automated deployment processes. How do you balance up the needs of all of this?
I think one interesting observation is that if you a) trust your data (because someone else is in charge of keeping it clean) and b) you don't deploy (that's someone else's job) you can get away without testing. I don't think this is wise, but practically-speaking I see it a lot.
If you're doing a similar job but you don't trust your data then you probably do at least want data-quality checks. I teach Pandera and I use it on my own code, particularly the feature that lets you auto-generate a script that "just works" based on an example DataFrame. TDDA is a similar project, more aimed at discovering what's in the data plus enforcement rules.
Generally I argue that unit tests using pytest
are a brilliant idea if you're building code you want to depend upon - particularly if you're going to build up a library that's shared with others or used even just by you over many projects. Having just 1 test means you're "building with testing in mind", it is much easier to add a 2nd test than in a late-stage project add a 1st test as the architecture often doesn't fit a testing mindset.
Unit tests and end-to-end tests act as documentation - they can't lie (else they should fail) and they can be complex. doctests
have a place for simpler examples and they keep library documentation correct. This sort of "living code documentation" means that when the code is received by someone else, there's at least a chance it can be understood, verified and improved without a huge effort being painfully invested.
I've surveyed some colleagues about which tools they use to check their data science workflow (both for R&D and production). The short list includes:
- black (Notebook compatible) to avoid having a variety of writing styles
- flake8 as a strong default linter with plugins for:
- sqlfluff SQL linter
- integgorate missing-docstring-checker
- bugbear checks for Python weirdness including mutables-as-default-argument (which will only hurt you)
- variable-names helps you to write less-bad variable names
- built-ins check that you're not shadowing a built-in function
- bandit security checker
- isort import order-er
- pre-commit to make linting checks run during a commit with plug-ins:
check-added-large-files
to avoid pushing big files to git and getting stuckdetect-private-keys
to avoid pushing security keys
- nbqa to run linters on a Notebook
Thanks mostly to Laszlo and John for these tools (I use a subset). One sensible reason to try a set of these is that we're aiming to make sure young projects don't get out of shape, such that they're costly to fix-up later. If you have a sensible process early-on (and it isn't a tax on you), you're likely to be faster at adapting to changing circumstances. None of these tools fix the fact that "real world changes happen unexpectedly" but they can help even out the cost of making change happen without pain.
If thinking about the processes that make our teams go faster is interesting to you - do check the Rebel AI announce further up please and reply to this newsletter.
Open source
In this issue, we’ll be talking about some upcoming/existing projects (thanks to Atharva Ladd):
Ruff: A fast (10-100x faster than existing linters) Python linter written in Rust, which can lint the CPython codebase from scratch. It can be installed via pip and has built-in caching, over 500 built-in rules, and supports autofix. Compatible with Python 3.11, Ruff can also be integrated in VS Code - It aims to replace multiple Python development tools and is actively used in major open-source projects. It is backed by Astral and has received positive testimonials for its speed and efficiency.
Plotly: A Python library for creating interactive and browser-based graphs. It’s built on top of plotly.js and offers a variety of chart types including scientific charts, 3D graphs, statistical charts, financial charts which can be viewed in Jupyter notebooks, standalone HTML files or integrated into Dash applications. Plotly also offers consulting services for dashboard development, application integration and feature additions.
Lifelines: Is a pure Python implementation of survival analysis - a statistical technique originally developed by the actuarial and medical community to measure lifetimes, and to understand why events occur at different times under uncertainty. Some applications in other domains include measuring subscriber lifetimes for SaaS providers, determining inventory stock-outs for goods, measuring the lifetimes of political parties and relationships for sociologists, and conducting A/B tests.
Hypothesis: a Python testing library that lets you write tests that are parameterized by examples and generates clear examples to help identify bugs in your code. It is practical, easy to use, stable and powerful.
Ruff 0.0.263
To help people write better code, they decided to raise an error when people use pytest.raises(Exception)
. This error is called a "lint error" and it helps us catch potential issues in the code.
A new rule called ICN003 prevents people from using from ... import …
in cases TYT03 alongside TYT01 and TYT02. A tool called flake-import-conventions
is further extended to flag cases like from pandas import DataFrame
and suggest using a different way of importing. This helps ensure that code follows certain conventions and best practices.
Pylint’s new feature, PLE0302 helps identify unexpected special method signatures (functions in Python that have specific names) that catch any unexpected use of these special methods and ensure that the code follows the correct patterns. The N815 rule in the pep8-naming tool that checks for violations of naming conventions in code, is relaxed for TypedDict fields. This helps developers work more efficiently and avoids unnecessary workarounds or using more complex libraries.
A number of bug fixes were also implemented that ensure error messages aren’t displayed when flake8-pyi sees a valid default value for a variable that doesn't have a type annotation, SIM222 and SIM223 have false positives and auto-fixes and when positional and keyword arguments are checking for missing arguments in docstring.
Plotly 5.14.1
In this version, only one small but significant change was made - there was a problem with generating graphs when testing a new version of the Pandas library (2.0.0rc0).
The data type timedelta64[ms]
was returned (contrary to float64
) which was not compatible with Plotly, hence causing an error when trying to generate the graph. The solution was to convert timedelta64[ms]
to timedelta64[ns]
and then dividing by a specific value to get the float64 data type.
Lifelines 0.27.3 and 0.27.4 updates (latest version 0.27.6 on pypi)
Lifelines now works with Python version 3.11 ! Some warnings and bug fixes were resolved - especially regarding the to_latex
function that converts data to LaTeX format. Lifelines now uses a newer version of a tool called Pandas Styler to help with this conversion process.
Finally, there were some changes to the way that summary objects (special types of objects that give you information about your data) work - developers hid some functions called to_*
that were on these objects. This means that you won't be able to use those functions anymore, but it also means that the summary objects are simpler and easier to use overall.
Hypothesis 6.75.1 and 6.75.0
In this new version, instead of returning None
, an error will be raised when someone tries to access an attribute that doesn't exist. This will help users understand what went wrong and fix the problem more easily.
Additionally, the hypothesis.example()
function can be automatically called using a Pytest plugin. However, to use this you need LibCST (which is included when you install Hypothesis with the "codemods" option) and Python 3.9 or later.
And now for something different - charity fund raising
This is a personal note unrelated to data science. Later in the year I'm taking part in a charity car drive, we'll be raising money for charity (probably for Parkinson's research). Soon there will be a JustGiving page and I'd be humbled if you'd donate.
If you know me enough to drink beer with me I might tell you about the misadenture of buying our first "banger car" that met the event's criteria. It ended with a fire engine (but thankfully just a lot of smoke). It turns out buying a banger that works is a bit of a challenge. This story will develop...
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,500+ subscribers. Your first job listing is free and it'll go to all 1,500 subscribers 3 times over 6 weeks, subsequent posts are charged.
Lead Data Scientist - Experimentation and Causal Inference at Just-Eat Takeaway.com, Permanent, Amsterdam or London
We’re searching for opportunities to perfect our processes and implement methodologies that help our teams to thrive locally and globally. That’s where you come in! As a Lead Data Scientist in our Experimentation and Causal Inference team, you will work at the intersection of data science, statistics, and analytics to help build the best customer experience possible.
Enabling data-driven decision making through fast, trustworthy and reliable experimentation at scale, you will work alongside data scientists, analysts and engineers to define our experimentation pipeline which enables causal inference and automates insight extraction from hundreds of experiments. Your work will be used across every product and business function, and enable a standard way of working for causal inference across tech and non-tech functions to evaluate the impact of changes made throughout our business. As lead data scientist, it is expected that you will teach your team new tricks from a scientific perspective as well as way of working. Most importantly, your work will directly contribute to building the world’s most customer-friendly food delivery app.
- Rate:
- Location: Amsterdam or London (central)
- Contact: Please apply through the links below (please mention this list when you get in touch)
- Side reading: link, link
Senior Data Analyst (Commercial Insights) at Catawiki, Permanent, Amsterdam
We’re looking for a Senior Data Analyst / Senior Data Scientist Insights to become the Data Business Partner of our Categories (Luxury Goods, Art, Interiors, and Collectables), in close collaboration with their dedicated Finance Business Partners.
Your role will be to analyse current business trends and to derive actionable insights in order to support our categories in shifting their commercial strategy when and where needed (supply, demand, curation etc). You will be a trusted business partner for them, who is actively participating in the decision-making, and regularly identifying new business opportunities.
- Rate:
- Location: Amsterdam, The Netherlands
- Contact: j.den.hamer@catawiki.nl (please mention this list when you get in touch)
- Side reading: link, link, link
Data Scientist / Data Engineer at 10 Downing Street
The No10 data science team, 10DS, offers an unparalleled opportunity to develop your career personally through these demanding and intellectually stimulating roles. Formed in mid-2020, 10DS has a remit to radically improve the way in which key decisions are informed by data, analysis and evidence.
We are looking for exceptional candidates with great mathematical reasoning. In return you will be provided an unparalleled opportunity to develop your technical skills and support advice to help improve your country.
- Rate:
- Location: London
- Contact: avarotsis@no10.gov.uk (please mention this list when you get in touch)
- Side reading: link
Scientist/Senior Scientist for Machine Learning at ECMWF
The European Centre for Medium-range Weather Forecasts (ECMWF) is looking to hire multiple (senior) scientists for machine learning. We're now in 3 locations across Europe (with adjusted salaries), working on improving weather forecasts for Europe.
If you have experience on the traditional HPC + ML or ML + Earth Systems Science side of things, or more of a ML coordinator, you may be a great fit. Especially, if you're part of a under-represented minority, please consider that the job post is written for both senior and regular scientists and you may not have to match every single bullet point to be a great fit. Our machine learning ecosystem: a lot of deep learning from CNNs to transformers and GNNs on lots a GPUs.
The ECMWF itself is an intergovernmental organisation created in 1975 by a group of European nations and is today supported by 35 Member and Co-operating States, mostly in Europe with the world's largest archive of meteorological data. So definitely a very unique place to work and push state-of-the-art machine learning these days.
- Rate: £68,374 GBP to €103,517 EUR NET of tax annual basic salary + other benefits
- Location: Reading, UK / Bonn, Germany / Bologna, Italy
- Contact: jesper.dramsch@ecmwf.int (please mention this list when you get in touch)
- Side reading: link, link, link
Senior Data Scientist at Made With Intent, Permanent, Remote
We’re a new, revolutionary platform that helps online retailers show a bit more care for their customers. We give retailers the ability to understand their customer intent by listening to hundreds of micro-behaviours on site, modelling them together, and creating a predictor metric between 0 and 1. We predict which customers will buy, and where they are in their journey, and serve them appropriate content that nudges their intent in a caring, nurturing and educating way.
The ideal data scientist candidate will have at least 2 years of hands-on programming experience in AI, a postgraduate degree, experience in NLP, Tensorflow/Pytorch, MLOps, AWS and advanced SQL skills. They must work with unstructured data, have excellent communication skills, and collaborate with stakeholders. We value those who learn new tools quickly and hustle when required.