Successfully dealing with leadership politics, making `apply` in Pandas even faster
Successfully dealing with leadership politics, making apply
in Pandas even faster
Further below are 6 jobs including:
- Data Scientist (FTC - 12 Months)
- Senior Data Scientist
- Principal Data Scientist - MOPAC
- Data Engineer at Airtime Rewards, Permanent, Manchester
- Analytics Engineer at Yoto, Permanent, London
- Software engineers at the Bennett Institute of Applied Data Science
In this issue I talk on "dealing with politics" through my RebelAI leadership group, share a tip on making a Numba-compiled apply
call in Pandas even faster than the "usual" way of doing it (which was a surprise for me when benchmarking material for my new Fast Pandas course) and I've added a summary of recent updates to PyPI data science packages at the very end (is this useful? I'd love some feedback).
The PyDataLondon 2024 schedule is live and the talks look really good - tickets are on sale and they've always sold out in the past (note tutorials are now sold out).
On the Saturday morning I'll run another of my regular leadership discussions (reply to this and I can add you to the GCal reminder).
The goal for the leadership discussions is to dig into opportunities and challenges for team leaders and to get advice from the crowd on how they've solved similar issues, so members can get closer to repeatable success.
This format evolved into my RebelAI private leadership group (noted below), I've been running these sessions for 7 or so years at PyData conferences now - people tell me that they meet very interesting people in these sessions. If you're looking for other leaders, advice and networking, I'd strongly suggest you get a ticket and attend on the Saturday morning (and mail me back to get a GCal invite).
I'm also always looking for conversations around strategic leadership (I help teams by bringing clarity to their data science process). If you know a team that needs some help, I'd love an introduction.
Training
I have new dates to announce for my upcoming training in July and September. The links on my training page show the July and September dates - if you fill in my training notification form I'll happily send you a 10% discount code valid for this year. In a few months I'll be running:
- Fast Pandas July 18-19 - make your existing Pandas codebase 2-30x faster per bottleneck by addressing common issues with powerful speed-ups
- Software Engineering for Data Scientists July 8-10 - increase your speed of delivery by modularising, running code reviews, testing for increased confidence and preparing for production from early on
- Successful Data Science Projects July 11-12 - decrease failures and make success more likely with better project planning and execution
If you'd like your Pandas code to run faster, your team to write more maintainable DS code and your projects to succeed more frequently - check out the above and fill in my training survey.
RebelAI - dealing with politics
RebelAI is my private group for "excellent data scientists turned leaders", we meet once a month to constructively critique an opportunity or challenge and each week we pose questions to the group.
A recent Monday Morning Question from a member asked "Have you ever had to deal with politics sidelining projects? What did you do to fix this?". This led to some lovely discussion, the top points were:
- Influence probably beats tech skills, so improve your ability to influence
- Build a light PoC, show value quickly, let people try it, remove reasons to deny progress
- Talk to the stakeholders, buy them a coffee, listen to their problems, see if you can remove doubts
- "Fight hard, lose well, move on quickly"
- Find other key influencing teams (e.g. finance), get them to sign off, then maybe you're unstoppable
- "Take down their shields, don't put more gunpowder behind the bullet" (super quote!)
If you face topics like this, maybe you need to join RebelAI? Reply to this and I can send you a 3 pager PDF.
Don't forget that at the PyDataLondon conference in 2 weekends I'll run a leadership discussion (reply to this and I can put you on a GCal reminder). This is separate to RebelAI but laid the groudwork for me to create RebelAI.
Fast Pandas - apply
with Numba compilation (the right path, the wrong path)
A common function in Pandas is apply, typically for applying a function row-by-row. The general advice is "don't do this because it is slow" and then we all do it because it is convenient.
If you're using NumPy for your columns (not Arrow, which has been supported in the last year or two), you can probably speed up your applied functions by 2-20*. The trick is to use the Numba just in time compiler.
Typically you'd write your function, then wrap @jit
in front of it. The first time it is used it'll be compiled (which takes 1-30 seconds depending on complexity), optionally then it can be cached to disk. After that it'll typically run 10* faster, maybe much more, assuming you're mostly doing numeric work.
I've used this method in Pandas for years and it can be great for very little work - you pass in your function to apply
as usual and it gets automatically compiled.
Separately there's an option to run an apply
with the relatively new engine
parameter (i.e. engine='numba'
in the apply
arguments). This used to be synonymous with passing in an already-compiled function.
What I've discovered for Pandas 2.1+ is that there seems to be a newer fast-path if you pass in the engine
parameter. So rather than pre-decorating your function, instead you don't decorate it and you pass it in to apply
as usual with something like {raw=True, engine='numba'}
. If you do a timing compare between these two approaches you'll see another significant speed-up by taking this second approach.
If you're not sure why you'd need raw=True
, the difference between using NumPy or Arrow to store your data and how and why Numba can give you a 10*+ speed-up - attend my next Fast Pandas course in July and all shall be revealed. You can read the docs and experiment to get there on your own.
Recent package updates from PyPI
Powered by the PyPI API and the lovely pypistats package.
- polars 0.20.31 Blazingly fast DataFrame library
- scikit-learn 1.5.0 A set of python modules for machine learning and data mining
- spacy 3.7.5 Industrial-strength Natural Language Processing (NLP) in Python
- pyomo 6.7.3 Pyomo: Python Optimization Modeling Objects
- dask 2024.5.2 Parallel PyData with Task Scheduling
- scikit-optimize 0.10.2 Sequential model-based optimization toolbox.
- pandera 0.19.3 A light-weight and flexible data validation and testing tool for statistical data objects.
- modin 0.30.0 Modin: Make your pandas code run faster by changing one line of code.
- matplotlib 3.9.0 Python plotting package
- pytest 8.2.2 pytest: simple powerful testing with Python
This list is a new feature - if it is useful, please reply and let me know!
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,600+ subscribers. Your first job listing is free and it'll go to all 1,600 subscribers 3 times over 6 weeks, subsequent posts are charged.
Data Scientist (FTC - 12 Months)
We’re setting up a Data Science team at MOPAC. We’ll be building the capabilities as we go: establishing analytical best practice; setting up the infrastructure; and demonstrating data science potential. You’ll be unlocking knowledge into what causes the overall trends of crime in London and compiling the code to derive meaning from survey and consultation data…and a huge number of other things!
This post is ideally suited to someone who is keen to break into the world of data science, with excellent statistical, technical, and interpersonal skills. If you're passionate about using data for the benefit of all Londoners, apply today!
- Rate: £39,604.00 - £45,411.00 per annum
- Location: Remote (One day a month in Union Street, London)
- Contact: anthony.duguay@mopac.london.gov.uk (please mention this list when you get in touch)
- Side reading: link
Senior Data Scientist
We’re setting up a Data Science team at MOPAC. We’ll be building the capabilities as we go: establishing analytical best practice; setting up the infrastructure; and demonstrating data science potential. You’ll be unlocking knowledge into what causes the overall trends of crime in London and compiling the code to derive meaning from survey and consultation data…and a huge number of other things!
This post is ideally suited to someone with data science experience who wants to be hands on in a data role, is a curious flexible thinker, with excellent statistical, technical, and interpersonal skills. If you're passionate about using data for the benefit of all Londoners, apply today!
- Rate: £46,597.00 - £53,209.00 per annum
- Location: Remote (One day a month in Union Street, London)
- Contact: anthony.duguay@mopac.london.gov.uk (please mention this list when you get in touch)
- Side reading: link
Principal Data Scientist - MOPAC
We’re setting up a Data Science team at MOPAC. We’ll be building the capabilities as we go: establishing analytical best practice; setting up the infrastructure; and demonstrating data science potential. You’ll be unlocking knowledge into what causes the overall trends of crime in London and compiling the code to derive meaning from survey and consultation data…and a huge number of other things!
The Principal Data Scientist role is ideally suited to someone with management experience, with excellent data science knowledge to lead the way on our journey into data science. If you're passionate about using data for the benefit of all Londoners, apply today!
- Rate: £55,009.00 - £62,860.00 per annum
- Location: Remote (One day a month in Union Street, London)
- Contact: anthony.duguay@mopac.london.gov.uk (please mention this list when you get in touch)
- Side reading: link
Data Engineer at Airtime Rewards, Permanent, Manchester
Design and implement robust, scalable data pipelines to ingest data from internal platforms into our data warehouse. Monitor and maintain data pipelines, ensuring data quality, integrity, and availability. Optimise data pipelines to enhance performance and reduce cloud computing costs. Understand, gather, and document detailed business requirements. Take ownership of data projects from planning to delivery, collaborating with other departments as needed. Innovate and automate current processes, driving continuous improvement.
- Rate: £35,000 - 45,000
- Location: Manchester, Hybrid (2 days/week in office)
- Contact: oguzcan.koncagul@airtimerewards.com (please mention this list when you get in touch)
- Side reading: link
Analytics Engineer at Yoto, Permanent, London
We’re looking for an Analytics Engineer to join our team to accelerate the business and help us make sense of the terabytes of data we receive every day.
We’re a small team at the heart of all the decisions Yoto makes. We work in a mature, high-trust environment with a lot of independence. Everyone can contribute ideas and be part of the decision making process. We tackle a broad range of problems, from developing cutting-edge data products to building and maintaining our data orchestration platform. Our work spans across all the key strategic projects throughout the company.
- Rate: £30,000 - £40,000 based on experience.
- Location: Kings Cross, London (Hybrid)
- Contact: jeena.lakshmanan@yotoplay.com (please mention this list when you get in touch)
- Side reading: link
Software engineers at the Bennett Institute of Applied Data Science
We're looking for software developers, at all stages of their careers, to help build, maintain, and operate OpenSAFELY -- a revolutionary open source platform for secure clinical research. We're also looking for a team lead, a project manager, and a research software advocate (think "developer evangelist" for research).
Led by Ben Goldacre (clinician, researcher, and author of Bad Science and Bad Pharma), we’re a truly interdisciplinary team with a strong track record of delivering useful tools in a globally leading research setting. You’ll have the chance to use your software skills to save lives and further the state of medical data research. Our software delivery teams are collaborative, supportive, thoughtful and kind, and we support hybrid or fully remote working, with in person team events throughout the year.