Reflections on Polars and Pandas 2 & a new Salary Guide
Reflections on Polars and Pandas 2
Further below are 7 jobs including: Scientist/Engineer for Machine Learning, Senior Data Scientist at CEFAS, Data Scientist at Arena.Online, Manager Data Science - Business Analytics at Catawiki, Senior Software Engineer at Qualis Flow Limited, Analytics & Digital Data Scientist at JLR, 3-6 month Applied AI Residency at Gradient Labs
I'll share some thoughts on Polars and Pandas 2 (with Dask to follow in the next newsletter) based on a talk I gave at a couple of conferences recently. Further below I share an update on the charity car drive (including the engine-melted Volvo) several of us are doing in September in aid of Parkinsons Research. At the end I've got a new Data Science/Eng Salary Guide, if you're looking for a job or hiring there's some useful detail there - that's right before this edition's 7 jobs.
New public training courses in November
I'm planning to run new editions of my Software Engineering for Data Scientists, Higher Performance Python and Successful Data Science Projects courses in November. If you'd like a heads-up on the dates, reply to this and I can let you know.
Course outlines can be found on my training page.
Talking on "Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine" at PyDataLondon 2023 and ODSC 2023
In June with Giles Weaver we spoke on Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine, this talk gave us both a good chance to have a crack at learning the interesting Polars dataframe library whilst seeing how it stacked up against the venerable Dask. Behind the scenes it uses Arrow rather than NumPy for data storage and Pandas 2 now also has the option of using Arrow - so I really fancied catching up on what new opportunities these tools might give us.
All of my past talks are on speakerdeck along with the slides for Pandas 2, Polars and Dask (PyDataLondon 2023) (and the ODSC variant which has a couple of extra slides). If you find this useful, please check my fund-raising section below for the MotoScape Car Rally and consider donating to our charity event.
At the conference I also ran a high performance discussion session, I'll talk about that and Dask in the next edition.
The video isn't yet available. I'll summarise our slides.
Pandas 2
Pandas 2 adds Arrow vectors in addition to NumPy vectors - these generally cost the same in speed and memory or are better. If you have strings then you definitely want to try the new Arrow string datatype.
Pandas 2 generally uses less RAM than earlier versions (e.g. this old bug of mine now uses much less RAM). It doesn't necessarily run faster and typically is still single-core, but by copying less RAM it both lowers the RAM footprint and wastes less time with RAM duplication. See slide 7 in the presentation for benchmarks.
For the most part if we're using Arrow in Pandas 2 you can continue to use Seaborn (and indeed Seaborn works fine with Polars too).
As Alex noted on LI:
Arrow whenever possible is a must for me nowadays - too tired of missing NaN in numpy. (along with a bunch of other great points)
Jesper 3 notes the following after our talk:
My conclusion: consider giving Polars a try if you prioritize performance or prefer its API. Otherwise, Pandas2 with the Arrow backend for strings is a solid choice. For handling huge datasets, Dask proves to be an excellent option.
We also had some other lovely feedback
Polars
Whilst Polars is young, it is growing in popularity quickly. Just yesterday I saw the announcement (hn) that Polars is now VC funded and will become its own company.
Here's a high-level review - Polars is impressively fast, once you get past the different syntax (that's fine, it is just different to Pandas). You can do more in-RAM as it uses RAM more efficiently. Probably it is faster, sometimes 10x faster (e.g. slide 11), as it uses multi-cores natively.
On the flip side it uses Arrow internally so projects that want a NumPy 2D array (e.g. sklearn
see this) won't necessarily work. You can convert to a Pandas DataFrame
easily enough, but that probably doubles your RAM usage. For this use case maybe Polars isn't a great fit. I did have some success with sklearn
from Polars with 2D homogenous arrays, but any mixed-dtype DataFrames will likely need to be converted to NumPy arrays.
We did find some bugs when using Polars causing seg-faults, we found these very early on. The issues however were fixed quickly, I'm certainly impressed with author Ritchie's speed of response to the bugs we found - the main issues were fixed in 24 hours. The fact that we found such bugs (which could cause the machine to reset) early on makes me a little cautious in recommending Polars for anything other than research work at this stage.
I certainly think for R&D projects (not mission-critical work yet), where the underlying Arrow representation doesn't cause performance issues with other tools, Polars is a very interesting tool. It is the first time in years that I'm actively paying attention to a Pandas competitor.
One of the really impressive things with Polars is switching from the eager query API (each query is evaluated as required, just like in Pandas) to the lazy API where queries are squashed together and optimised. In addition this mode can use a streaming interface to larger-than-RAM files on disk.
If you check slide 16 in our slidedeck you'll see we swap from a single in-memory Parquet file to the full (much larger than RAM) parquet set and "it just works". I'll reflect on how much faster this was than with Dask in the next issue.
There are other differences between Polars and Pandas which might trip you up whilst on-boarding. I'll make some notes in the next issue.
Conclusion
So which to use? Pandas 2 is an easy upgrade from Pandas 1.5. If you switch to Arrow from NumPy datatypes then everything "probably works ok". Hopefully you've got appropriate tests to check that your migration is successful - right? I teach this in my Software Engineering class for good reason.
If you do switch to Arrow then you might find a few API calls in Pandas have reduced functionality, for the most part it should all be there and it is probably quicker with less RAM usage.
If you're doing R&D projects and speed is an issue then I'd strongly suggest you try Polars. Expect the occasional speed-bump (e.g. no direct sklearn
support if you have large arrays and no Numba support), but if you're doing data transformation and exploration then it does just "seem to work fine". And for the most part it is faster, with less RAM usage.
The star-usage chart for Polars from their funding announce gives you an idea of the growth behind the project:
Interestingly DuckDB (in yellow) also seems to have a similar hockey-stick growth pattern and that too can use SQL to query Parquet files. I've been told that in some quarters DuckDB is the dominant choice over Pandas, so I suspect we're seeing a convergence towards a common tool-set. Which tool will win is obviously still very open to debate.
I have some corrections to make from the last issue (thanks Marco Gorelli!):
- Pandas 2 does not have Copy on Write enabled by default, it is disabled but can be turned on - it may be enabled by default in Pandas 3 in a year+
- Pandas 2's Arrow support isn't 100% yet and will continue to improve
- Some of the Polars speed-ups we saw will be caused not just by better efficiencies but also because different algorithms are being used behind the scenes
MotoScape Car Rally - charity banger run to Venice in September
Many thanks to those of you who have kindly donated to our JustGiving page for the charity car "banger drive" to Venice and back in a month, we drive at the start of September through France, Belgium, Luxembourg, Germany, Austria and end in Italy, then drive back - assuming the car is still running!
For those of you who read my newsletter and who haven't donated - if you find value in what I write, please do consider making a donation. All money raised goes to Parkinsons Research, we're doing this for charity (all car costs come out of our own pocket). If I'm helping you move forwards, and definitely if I've helped you find a job or a useful process or tool for your career - please do consider donating.
As noted in the last issue lockdown and becoming a new parent was pretty taxing - doing something silly is a nice way to change things up. Driving a 24 year old car (which hasn't blown up, I'll tell you about the one that did) for a 2k mile round-trip, is definitely a fun distraction.
You can see the current car on our Just Giving page, it is a 24 year old Volkswagen Passat petrol estate that is being dressed up as a surf car:
Making the car reliable is quite the challenge. We did a weekend's work on it recently and driver/engineer Ed has done a whole heck of a lot of work. Pulling the engine apart (see JustGiving for images) to do things like pressure tests on the cylinders has been an interesting "diagnostics challenge" - no print
statements, no variable logs, just lots of oil and mechanical gizmos:
Prior to this we had a Volvo V50 turbo diesel estate. This was referenced in our Polars talk. At this point we'd owned it for 23 hours, at the 15 mile mark (admittedly on a 181k mileage engine), the turbo decided to consume its own oil causing a "turbo run-on event". That's pretty terrifying because if you stop the car (I did) and take the key out (I did), your engine is running at 7k revs (max to the limiter) and is trying to melt itself, with you still in the car. Pretty terrifying.
I'd stopped as the car went crazy whilst I was driving. The rev counter started picking up on its own, the hit maximum whilst my foot was off the accelerator. Did I mention that this was pretty terrifying? This is before the engine started to melt (whilst I was still slowing us down safely) and white smoke poured out of the engine obscuring the country lane I was on.
After I called the fire brigade (the road was obscured, the engine wouldn't stop, I'd run away down the road!) the engine finally popped. The liquids you see are a mix of diesel, oil and coolant - even the coolant tank ruptured. I'd have been more impressed with the mess had I not been driving at the time. That black smoke on the left? That's what was pumped out of the exhaust whilst the engine melted itself:
But heck - YOLO right? So we got this one scrapped and bought a second car and now we're very close to our initial fund-raising target. All funds raised go to charity (Parkinsons Research), all costs come out of our own pocket. If you haven't donated yet - might I please suggest that you consider doing A Good Thing and donating a bit of your money? At the very least go have a look at the pictures on our page and maybe share them on the socials or with petrol-head friends.
Data Scientist Salaries in 2023
In one of my forums the question of salary ranges came up, a colleague shared the Harnham Global Salary Guide and it met the taste test. If you're recruiting, or looking to check what you're worth, you might want to take a look. If you come here you can get the employer's PDF report, it is pretty good. As an individual you can put in details here and it'll say for example that a Data Scientist in England in Greater London could be on £47k as a junior, £77k as a mid-level and £118k as a senior. There's more detail in the PDF so that's worth getting.
A couple of colleagues agreed that a DS in London would probably see £45-60 as a junior, £60-80 as a mid-level and £80-100 as a senior, with company leadership at £150k+. The PDF notes that most companies offer 2-3 days on-site, hardly any require fully on-site. What surprised me in the PDF was that most data-related roles, including BI, mentioned Python as a very important skill (pretty much all of them did), more frequently than SQL. Another surprise was that across job titles, in the same location, the salary bands looked pretty similar. Hopefully this helps you judge your market-worth.
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,600+ subscribers. Your first job listing is free and it'll go to all 1,600 subscribers 3 times over 6 weeks, subsequent posts are charged.
Scientist/Engineer for Machine Learning (multiple vacancies)
Forecasting the weather accurately saves lives. At the ECMWF we have been predicting the weather 24/7 since 1975 with now 35 member and co-operating countries with a highly regarded physical system.
In this next chapter we're looking for 4 more colleagues to round off our team in creating a cutting-edge machine learning models to supplement our physics-based model and make our predictions faster, more accurate, and more energy efficient. Normally, this kind of impactful work comes at the expense of a decent salary... not in this case! So if you have the relevant deep learning experience, you might be able to make that impact in the world with us! (If you're part of an under-represented minority, please consider applying. The vacancy note is written to cover 4 positions, which means we don't expect everyone to cover every aspect!)
- Rate: £68,374 to 103,517€ (net of tax) + benefits
- Location: Bonn, Germany or Reading, UK
- Contact: jesper.dramsch@ecmwf.int (please mention this list when you get in touch)
- Side reading: link, link, link
Senior Data Scientist at CEFAS, Permanent
The UK Government's Centre for Environment, Fisheries and Aquaculture Science (CEFAS) is looking for data scientists and senior data scientists to work on computer vision and machine learning projects. We're tackling the serious global problems of climate change, biodiversity loss and food security to secure a sustainable blue future for all.
Projects include the detection, classification and quantification of benthic organisms in sea floor video, remote electronic monitoring of fishing vessels, beach litter in remotely piloted aircraft imagery, and work with innovative ship-based instruments such as plankton cameras and flow cytometers.
Closing date 28th August.
- Rate: 37 - 41K
- Location: Remote, Lowestoft or Weymouth
- Contact: robert.blackwell@cefas.gov.uk (please mention this list when you get in touch)
- Side reading: link, link, link
Data Scientist at Arena.Online, Permanent, Droitwich (Worcestershire)
Arena.Online is the UK's leading ethical flower delivery service, specialising in personalised D2C and B2B fulfilment. The Insight & Data Science team is central in informing company decision-making and identifying growth opportunities. In an industry with sparse data, we're inventive and flexible in creating and maximising data sources.
Key responsibilities in this role include maintaining and improving our datasets (quality, functionality, and scope) and reporting capability. This already covers a wide range of sources and techniques with everything from formal order data to web scraping to customer review analysis. You will also be running projects independently and as part of our close-knit team which require creativity, adaptability, and determination. We aim to refine our data stack from data discovery to collection, preparation, reporting, dashboard building, and ultimately analysis and recommendations. We’re looking for somebody intelligent, curious, and full of exciting new ideas. As our team's third member, you'll face significant responsibility, but mega opportunity in our supportive, fun, and always learning-focused environment.
- Rate: £45k
- Location: Droitwich, Worcestershire, UK (Hybrid, must be able to travel to the office at least 3 days per week)
- Contact: Please email alina.ali@arenaflowers.com with your CV and tailored cover letter (please mention this list when you get in touch)
- Side reading: link, link
Manager Data Science - Business Analytics at Catawiki, Permanent, Amsterdam, The Netherlands
We’re looking for a Data Science Manager for our Commercial Data Insights team who will manage a team of Data Scientists / Analysts that support all the commercial departments of Catawiki (Marketing, Experts, Sales, Categories & Clusters, Finance) in using data to better understand our marketplace dynamics, to take the right decisions, and to identify opportunities to build a better Catawiki.
- Rate:
- Location: Amsterdam, The Netherlands
- Contact: Justin den Hamer - j.den.hamer@catawiki.nl (please mention this list when you get in touch)
- Side reading: link, link, link
Senior Software Engineer at Qualis Flow Limited
Your team and your role
We’re looking for someone that will be responsible for designing and developing the software that powers our products. You’ll need to collaborate with other teams, write high-quality code and ensure the codebase follows best practices.
You’ll be working in our Engineering team, working closely with Product and other technical teams and reporting to the team lead.
- Design, develop, and maintain the core engine that powers our products, ensuring scalability, performance, and reliability
- Write high-quality, maintainable code that is well-documented and tested. We are a fan of TDD
- Deeply collaborate with other engineers. We are fans of promiscuous pair-programming and mob programming
- Ensure the codebase follows best practices for software development, such as using appropriate design patterns, writing clean and modular code, and ensuring that the codebase is easy to understand and maintain
- Continuously improve our development processes and technologies to ensure high-quality software delivery
- Participate in code reviews and provide feedback to other engineers on their code
- Work closely with the Product team to translate product requirements into technical specifications
- Collaborate with other internal teams to develop software that meets the needs of the business and our customers
- Contribute to the technical direction of the team and provide ideas and input on architectural decisions
- Provide technical guidance and mentorship to more junior members of the team
- Always have an eye on the big picture to avoid getting lost in the weeds
Your skills
- Strong experience in an engineering environment, with a focus on building complex, scalable, and performant systems
- Expertise in Python
- Having NLP or Image processing experience is a bonus, but not mandatory
- Experience with modern AI applications, prompt engineering and LLMs is a bonus, but not mandatory
- Previous backend development experience beneficial
- Strong knowledge of Cloud architecture, instrumentation, and CI/CD and git (not just commit and push 😀)
- Excellent communication and interpersonal skills, with the ability to convey complex ideas and data to diverse audiences
-
Proven experience in guiding and mentoring other engineers
-
Rate: 80-90k
- Location: Hybrid, Office at London Bridge
- Contact: sam.joseph@qualisflow.com (please mention this list when you get in touch)
- Side reading: link, link, link
Analytics & Digital Data Scientist at JLR, Permanent, Gaydon UK
This role sits within the new JLR Supply Chain Analytics Team – an embedded cross-disciplinary group of data professionals who move fast using the latest technologies, sharing best practices with colleagues in JLR central digital. In this delivery role you’ll be part of a fun, inclusive, learning, flexible, value focussed team backed by growing supply chain domain expertise and fearless in the face of a thorny technical problem or highly complex data. You will assist in the design and delivery of transformational Digital products to help streamline the way we work, enabling us to meet future vehicle sales volumes and customer delivery expectations in this unprecedented era of global supply chain disruption.
You will use your mathematical skills and scientific approach to help with the formulation and implementation of our Digital strategy for Planning, particularly the use of Optimisation and forecasting models to guide planning scenarios. You will use your technical knowledge to translate the business question into the correct mathematical formulation and technically implement it, assisting the business with interpreting and diagnosing the output and proactively suggesting improvements.
- Rate:
- Location: Gaydon (Midlands) UK, hybrid
- Contact: aburrel1@jaguarlandrover.com or apply via the link (please mention this list when you get in touch)
- Side reading: link, link
3-6 month Applied AI Residency at Gradient Labs
Gradient Labs is a new AI startup in London. We’re launching an AI Residency Program: we’re looking for an early-career AI researcher to join us for 3-6 months to experience what day 0 at an AI startup in London is like.
We’re building a suite of LLM-based agents that can safely automate customer support conversations in complex environments. You’ll be working directly with a small team— us (me, Danai Antoniou, Dimitri Masin) and any other early hires we make. Join us to spent your time researching, prototyping, optimising, and analysing what our agents can or should do. Help us build that functionality and evaluate how well it works.
- Rate: Approx £60k
- Location: London (in person 1-2 days per week)
- Contact: neal@gradient-labs.ai, dimitri@gradient-labs.ai (please mention this list when you get in touch)
- Side reading: link