Pandas 2 or something else? Predicting on scant data with Metaculus

                February 6, 2024

            Pandas 2 or something else? Predicting on scant data with Metaculus

            Pandas 2 or something else? Predicting on scant data with Metaculus
Further below are 4 jobs including: AI and Cloud Engineers at the Incubator for Artificial Intelligence, UK wide, Data Science Intern at Coefficient, Data Scientist at Coefficient Systems Ltd, Data Science Educator - FourthRev with the University of Cambridge
A few months back I listed my first prediction challenge around UK solar production on the Metaculus prediction site - I'm using a tiny bit of data science to forecast an interesting real-world outcome. Maybe you'd like to make a prediction too? Details below.
I've also written up the Pandas 2, Polars or Dask talk that Giles and I gave at PyDataGlobal late last year with a link to the slide deck.
Further below I reflect on the potential for a GIL-less Python 3.13, the possible uses for PyPy and have a note on writing your own plugin for Polars.
I'll be giving a lightning talk on CuDF at PyDataLondon soon so I'll link to those slides in the next issue - you can (with a bit of pain) get a 10-100x performance increase out of Pandas with a GPU!
Successful Data Science Projects and NEW Fast Pandas course coming in February
If you're interested in being notified of any of my upcoming courses (plus a 10% discount code) please fill in my course survey and I'll email you back.
In February I run another of my Successful Data Science Projects courses virtually, aimed at anyone who sets up their own projects, who has had failures and who wants to turn these into successes. We deconstruct failures, talk about best practice and set you up with new techniques to make valuable deliveries.

Successful Data Science Projects (22-23rd February)
Fast Pandas (date TBC in March) - lots of ways to make your Pandas run faster even to GPUs - reply to this for details
Software Engineering for Data Scientists (March 6-8)
Higher Performance Python (April 10-12) - profile to find bottlenecks, compile, make Pandas faster and scale with Dask
Scientific Python Profiling (in-house only) - focused on quants who want to profile to find opportunities for speed-ups

Can you predict UK Solar Growth with scant data?
Some months back I wrote about the Metaculus prediction site - it is a kaggle-like competition site (you play for points, not money) where you predict future events. If you're at all into the idea of predicting complex future outcomes with little current knowledge, this is a really interesting place to hang. Since it is played for points, not money, people freely discuss their prediction methodology so it can be quite educational.
I posted a prediction challenge on What will Great Britain's maximum solar power capacity (MW) be for October 2024? and I'd love it if you added your prediction to the pool - more predictions help to narrow down the uncertainty. I'm running this prediction with a view to running some more interesting ones once we're closer to the UK election cycle (whenever that might be this year...).
I'm using a monte carlo simulation on 12 months of data, forecasting N months forwards to the target date, with monthly updates of data from the Government website, driven by Polars (just for practice). Sanity checking comes from checking the news.
I'm interested in the green(er) energy transition and by predicting on competitions like this I'm able to keep my thinking aligned with what's realistic (rather than get caught up in hyperbole in the news). If you want to practice your ballpark estimation skills - with data analysis or just by following your gut - this is a great place to try. The Government Excel sheet contains useful graphs you can easily project forwards. 
The on-boarding tutorial for new members is brilliant to get you thinking about your own calibration. I've certainly improved my own uncertainty calibration by playing on here.
If you're not very good at predicting things like deadlines and the complexity of a solution whilst at work - practicing your prediction skills on Metaculus is likely to help.
"Pandas 2, Polars or Dask?" - PyDataGlobal 2023 December talk
Late last year Giles Weaver and I gave an updated talk on Pandas 2 vs Polars vs Dask at PyDataGlobal 2023 (abstract), the key take-aways are:

Pandas 2 has strong Arrow support
Pandas 2 Copy on Write looks really interesting
Pandas with Arrow shows changing (for better or worse) benchmarks
Polars keeps getting faster, LazyDataFrames are great for bigger data
Dask-Expressions start to edge into Polars' optimisation territory

The strong Arrow support means that strings take up much less space, but Arrow support isn't necessarily as performant - that same slide shows a much slower groupby compared to using the default NumPy storage.
The maturing Copy on Write mechanism looks super interesting, I'll be covering this in my new Fast Pandas public course in a couple of months (use the form to get a notification). DataFrame copies are expensive, by only making them when absolutely necessary we can significantly reduce the cost of DataFrame manipulations - using less RAM overall and giving a significant speed boost. There are some operational considerations and I suspect the switch to CoW-by-default in Pandas 3 will cause at least some headaches, hence I'm keen to talk about it here and in my courses so folk can get ahead of any issues.
Generally Polars keeps on getting faster and the development team are really pushing interesting changes into Polars. Scanning a larger-than-RAM dataset is a one-line change in Polars, making it much easier to scale to single-machine big-datasets than Pandas.
Dask has a new dask-expressions extension which takes your "basic Dask" code and adds some of the same query optimisations that Polars has down into a new query planner, resulting in faster execution. This is young and not as powerful yet as the Polars approach, but gives an easy-win speed-up into Dask with only small code changes. 
Coiled has some recent blog posts and benchmarks on this new feature. They're also responding quickly to requests, so if you're curious do think about making your needs known.
Both Polas and Dask need hand-tuning for certain dataset sizes which is a little frustrating, but not hard. We don't have a "solves everything" solution but we're certainly getting closer.
Matt Rocklin of Coiled gave a talk around the same time on  Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at Scale subtitled "The battle that nobody wins". This includes notes about a collaboration with Polars and attempts to make this a "fairly fair" benchmark, if you're curious about how these libraries compare for larger workloads I think you'll find this an interesting watch.
I couldn't help but slip in a slide around the underlying dataset and the fact that we won the Motoscape rally with a lovely sunset photo, plus that one of the car that tried to blow up with me inside it. Thanks to all of you who supported my Motoscape trip - I'm doing it again later this year in the same Volkswagen (well, assuming it passes next week's MOT!).
If you're interested in this topic you may want to attend my upcoming Higher Performance Python in April or register your interest for my forthcoming Fast Pandas course - that's 1 solid day focused on everything I know about maximising Pandas performance. 
Python's future - GIL-less?
In the last issue I talked on Python 3.13's forthcoming optional JIT which will compile pure-Python code (it won't impact Pandas, sklearn or other libraries). This is on my mind as I've just updated the opening chapter for the forthcoming 3rd edition of my High Performance Python book with notes on the GIL and JIT.
Another interesting change is the potential removal of the Global Interpreter Lock (GIL). PEP 703 introduced early last year describes a way to remove the Global Interpreter Lock from Python. Past attempts have always resulted in slow-downs in Python's benchmarks between the GIL-less or the GIL versions and that wasn't considered acceptable. This proposal was tentatively accepted last June.
This new proposal probably does introduces some slow-downs, but offers many benefits - notably for scientific computing. The PEP link is worth a read as almost the entire focus is on multi-core AI use cases - so this change will directly affect us and hopefully for the better.
There's more background and controversy if you want to dig further, and a clean take on what might happen here. Behind the scenes the reference counting mechanism that powers our automatic memory management system will need to be updated, the memory allocator will need to be updated and the "fast Python" project with recent improvements like the JIT may also need to be modified. Quite a lot will be going on and the plan is to make it seamless - so it doesn't impact us like the Python 2.x->3.x transition did (that cost about a decade!).
In your day-to-day use we wouldn't expect to see much difference. If you're using multi-processing to use all of your cores then you might have to express things a little differently, but hopefully we'd see some dramatic performance improvements under certain conditions - possibly we could get away without having to serialise large objects between processes or manage shared memory (which is possible, but painful - I've got a whole chapter on that in my book!).
These changes will go on in the background on a branch, to test their real-world impact and the impact upon the wider development community. It is likely that NumPy, Pandas and related tools will need to be modified to cope with the (usual) GIL and GIL-less builds.
Is PyPy used for "real work"?
I've kept an eye on this thread from last year on "Is anyone using PyPy for real work?". Having taught my Higher Performance Python course for years, I used to include PyPy but dropped it in the end as nobody I taught either used it or (typically) could use it.
The issue has always been that whilst this JIT-enabled alternative Python implementation is technically correct (and it runs pure-Python code like Django just fine and much faster), it doesn't interface well with C libraries due to the GIL (see above). This introduces a huge speed penalty, so whilst you can call into Pandas, sklearn and the like - there was a relatively large penalty on the order of a significant fraction of a second per call. This made it practically useless (imagine every call you make into Pandas with a 0.5s delay attached, with no speed-up within Pandas itself).
In the thread there's definitely evidence for RDF and text and general Python work showing significant speed-ups, but notably nothing from the scientific community (which sadly meets my expectation).
It is still an active project, that may change when the JIT enters Python 3.13+, but it is worth considering if you're running a pure-Python server farm and you'd like to switch some machines off.
Writing a Polars plugin
Marco Gorelli has created a guide to writing a Polars plugin:

I'd like to postulate that the material covered here gives you enough tools that you can address at least 99% of inefficient map_elements tasks.
but then he goes on to say
If you pick up The Rust Programming Language and can make it through the first 10 chapters, then I postulate that you'll have enough knowledge to replace the vast majority of inefficient map_elements calls.
so maybe it requires a bit more effort!

The prereq section digs into the values-and-validity representation that Arrow uses, so reading this will give you some insight not just into Polars but also into Arrow's representation (which is different to NumPy's).
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,600+ subscribers. Your first job listing is free and it'll go to all 1,600 subscribers 3 times over 6 weeks, subsequent posts are charged.
AI and Cloud Engineers at the Incubator for Artificial Intelligence, UK wide
The Government is establishing an elite team of highly empowered technical experts at the heart of government. Their mission is to help departments harness the potential of AI to improve lives and the delivery of public services.
We're looking for AI, cloud and data engineers to help build the tools and infrastructure for AI across the public sector.

Rate: £64,700 - £149,990
Location: Bristol, Glasgow, London, Manchester, York
Contact: ai@no10.gov.uk (please mention this list when you get in touch)
Side reading: link

Data Science Intern at Coefficient
We are looking for a Data Science Intern to join the Coefficient team full-time for 3 months. A permanent role at Coefficient may be offered depending on performance. You'll be working on projects with multiple clients across different industries, including the UK public sector, financial services, healthcare, app startups and beyond. You can expect hands-on experience delivering data science & engineering projects for our clients as well as working on our own products. You can also expect plenty of mentoring and guidance along the way: we aim to be best-in-class at what we do, and we want to work with people who share that same attitude.
We'd love to hear from you if you:
Are comfortable using Python and SQL for data analysis, data science, and/or machine learning.
Have used any libraries in the Python Open Data Science Stack (e.g. pandas, NumPy, matplotlib, Seaborn, scikit-learn).
Enjoy sharing your knowledge, experience, and passion.
Have great communication skills. You will be expected to write and contribute towards presentation slide decks to showcase our work during sprint reviews and client project demos.

Rate: £28,000
Location: London, Hybrid
Contact: jobs@coefficient.ai (please mention this list when you get in touch)
Side reading: link, link, link

Data Scientist at Coefficient Systems Ltd
We are looking for a Data Scientist to join the Coefficient team full-time. You can expect hands-on experience delivering data science & engineering projects for our clients across multiple industries, from financial services to healthcare to app startups and beyond. This is no ordinary Data Scientist role. You will also be delivering Python workshops, mentoring junior developers and taking a lead on some of our own product ideas. We aim to be best in class at what we do, and we want to work with people who share the same attitude.
You may be a fit for this role if you:
Have at least 1-2 years of experience as a Data Analyst or Data Scientist, using tools such as Python for data analysis, data science, and/or machine learning.
Have used any libraries in the Python Open Data Science Stack (e.g. pandas, NumPy, matplotlib, Seaborn, scikit-learn).
Can suggest how to solve someone’s problem using good analytical skills e.g. SQL.
Have previous consulting experience.
Have experience with teaching and great communication skills.
Enjoy sharing your knowledge, experience, and passion with others.

Rate: £40,000-£45,000 depending on experience
Location: London, Hybrid
Contact: jobs@coefficient.ai (please mention this list when you get in touch)
Side reading: link, link, link

Data Science Educator - FourthRev with the University of Cambridge
As a Data Science Educator / Subject Matter Expert at FourthRev, you will leverage your expertise to shape a transformative online learning experience. You will work at the forefront of curriculum development, ensuring that every learner is equipped with industry-relevant skills, setting them on a path to success in the digital economy. You’ll collaborate in the creation of content from written materials and storyboards to real-world case studies and screen-captured tutorials.
If you have expertise in subjects like time series analysis, NLP, machine learning concepts, linear/polynomial/logistic regression, decision trees, random forest, ensemble methods: bagging and boosting, XGBoost, neural networks, deep learning (Tensorflow) and model tuning - and a passion for teaching the next generation of business-focused Data Scientists, we would love to hear from you.

Rate: Approx. £200 per day
Location: Remote
Contact: Apply here (https://jobs.workable.com/view/1VhytY2jfjB3SQeB75SUu1/remote-subject-matter-expert---data-science-(6-month-contract)-in-london-at-fourthrev) or contact d.avery@fourthrev.com to discuss (please mention this list when you get in touch)
Side reading: link, link, link

Don't miss what's next. Subscribe to NotANumber: