Specialist explains why your Python 3.11 run-times are faster
Specialist explains why your Python 3.11 run-times are faster
Further below are 3 job roles including Senior roles in Data Science and Data Engineering at organisations like Causaly and MDSol.
Below I have notes on the reasons and a diagnostic for speed-ups in Python 3.11.
Executives at PyDataGlobal for Dec 1st
Are you a data science leader? Would you like to raise leadership questions in a like-minded group to get answers and share your hard-won process solutions? I’m organising another of my Executives at PyData sessions for the upcoming PyDataGlobal (virtual, wordwide) conference for December 1-3. On Thursday Dec 1st I’ll run a session over a couple of hours focused on leaders, anyone who is approaching leadership or who runs a team is welcome to join.
I have a plan to make this more problem-solving focused than previous sessions, with a write-up to be shared after the conference so there’s something to take away. Attendance for these sessions is free if you have a Global ticket. This builds on the sessions I’ve volunteered to run in the past and the Success calls I’ve organised via this newsletter earlier this year.
Reply to this (or write to me - ian at ianozsvald com) if you’d like to be added to a reminder and a GCal calendar entry (there’s no obligation, these just remind you and set it in your calendar).
Python 3.11 - better exceptions that help you code faster
PEP 657 introduces fine grained exception location information. Click through to see the examples. The upshot is that when you get an exception, you get more detail than before about where your error is.
If for example I add a deliberate 1/0
error in my code (see the speed-up snippet further below), in Python 3.11 I get:
Traceback (most recent call last):
File ".../estimate_pi_err.py", line 14, in <module>
print(monte_carlo_pi(1_000_000))
^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../estimate_pi_err.py", line 11, in monte_carlo_pi
1/0 # deliberate error
~^~
ZeroDivisionError: division by zero
but in Python 3.10 I only got:
Traceback (most recent call last):
File ".../estimate_pi_err.py", line 14, in <module>
print(monte_carlo_pi(1_000_000))
File ".../estimate_pi_err.py", line 11, in monte_carlo_pi
1/0 # deliberate error
ZeroDivisionError: division by zero
The PEP link has a lot more examples - in short we get a better visual guide (especially for long tracebacks) about where our error is, at each step in the call chain, inside compound statements (note that the print
is not highlighted in the example above).
For a longer reader checkout this Python 3.11 new features write-up, it goes into more depth on the new exception reporting along with discussing other improvements to async, TOML config, faster startup times and more.
Python 3.11 - Specialist explains where your speed-ups come from
Specialist is a Python package that annotates your code to explain which bits can be specialised for faster execution, and which can’t. Using the code snippet in the last issue we can ask Specialist “what can be specialised?” using:
$ specialist estimate_pi.py
Red means “adaptive” code, these instructions are a slightly slower form of the original code that’s checking to see if it can be specialised. Green has been “specialised”, so the math has been type-specialised (in this case). Yellow and orange means partial-success with specialisation, depending on the ratio of successful specialisations to un-specialised changes. Un-specialised means that if the specialisation fails (e.g. a certain type was expected but later a different type was presented), then it has to be switched back to the more general “adaptive” state.
Why x
is never specialised and why random
seems to fail specialisation is currently a mystery to me, and may be fixed in a later Python release.
The yellow around 4.0 * acc
is explained by the combination of float and int operations which currently aren’t optimised.
I noted in the last issue that this code snippet runs in about 65% of the time in Python 3.11 compared to the same code in Python 3.10.
Changing the iterations from 1M to 10M doesn’t change the specialisation colours, so this sample is stable - the specialising code checks thousands of times before deciding if a specialism will stick.
One obvious issue in this code is the 4.0
in the final calculation. Changing the 4.0
to 4
lets Specialist show that the final computation is fully green.
Since this is called only once in the function, we don’t get a speed-up, but it is nice to see that this manipulation helps the current version of the specialiser.
What if we try a numpy
equivalent? The following bit of code is much faster - most of the work happens at the C layer. It generates all the samples in one big go, then creates the final result into acc
in a couple of C-based operations. The execution time is 10x faster than the pure Python version (circa 0.02s on my laptop). The execution time is the same between Python 3.10 and Python 3.11:
def monte_carlo_pi_np(n_samples):
xs = rng.random(size=n_samples)
ys = rng.random(size=n_samples)
acc = ((xs**2 + ys**2) < 1.0).sum()
return 4.0 * acc / n_samples
print(monte_carlo_pi_np(1_000_000)) # 3.1422 - approx!
...
$ specialist estimate_pi_np.py
3.143044
specialist: No quickened code found in estimate_pi_np.py! Try modifying it to run longer, or use the --targets option to analyze different source files.
Here Specialist is telling us that nothing can be optimised in this code, as nothing at the Python level can be specialised. Because all of the calculations are at the C level (in numpy
), there’s nothing happening thousands of times at the Python level to be optimised.
The Python 3.11 improvements don’t help our numpy
or Pandas code to get any faster.
Should you upgrade to 3.11?
Probably yes. The improved exceptions will improve your developer experience, that’ll save you time and frustration.
Should you modify your code for more speed? Almost certainly not. For starters - this is a work in progress and the Faster Python project will get better at identifying parts of your code that can be specialised. I’d expect int/float operations to be early wins, so if you have to rewrite your code you’re probably only saving a bit of time before Python could specialise that code for you.
More importantly within a year we’ll see the fruits of the new JIT support (for 3.12), so you’ll get much better wins when 3.12 arrives. Tweaking your code now for hard-to-reach improvements is unlikely to be a sensible investment - just enjoy what you get for free by upgrading to Python 3.11.
Random “and now for something different”
Whilst reading the Python 3.11 release notes I see we’ve now got a “And now for something completely different” - this one talks about
When a spherical non-rotating body of a critical radius collapses under its own gravitation under general relativity, theory suggests it will collapse to a single point. This is not the case with a rotating black hole (a Kerr black hole)…
Irreverently we get a discussion of singularity physics at the end of the release notes :-)
It turns out that the Python 3.10 release notes had one too - this one on Schwarzschild black holes. And now that I’m checking also 3.9 (string) and 3.8 (spotted camels). But no further it seems.
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,500+ subscribers. Your first job listing is free and it’ll go to all 1,500 subscribers 3 times over 6 weeks, subsequent posts are charged.
Senior Cloud Platform Applications Engineer, Medidata
our team at Medidata is hiring a Senior Cloud Platform Applications Engineer in the London office. Medidata is a massive software company for clinical trials and our team focus on developing the Sensor Cloud, a technology with capabilities in ingesting, normalizing, and analyzing physiological data collected from wearable sensors and remote devices. We offer a good salary and great benefits !!
- Rate:
- Location: Hammersmith, London
- Contact: kmachadogamboa@mdsol.com (please mention this list when you get in touch)
- Side reading: link
Natural Language Processing Engineer
In this role, NLP engineers will:
Collaborate with a multicultural team of engineers whose focus is in building information extraction pipelines operating on various biomedical texts Leverage a wide variety of techniques ranging from linguistic rules to transformers and deep neural networks in their day to day work Research, experiment with and implement state of the art approaches to named entity recognition, relationship extraction entity linking and document classification Work with professionally curated biomedical text data to both evaluate and continuously iterate on NLP solutions Produce performant and production quality code following best practices adopted by the team Improve (in performance, accuracy, scalability, security etc…) existing solutions to NLP problems
Successful candidates will have:
Master’s degree in Computer Science, Mathematics or a related technical field 2+ years experience working as an NLP or ML Engineer solving problems related to text processing Excellent knowledge of Python and related libraries for working with data and training models (e.g. pandas, PyTorch) Solid understanding of modern software development practices (testing, version control, documentation, etc…) Excellent knowledge of modern natural language processing tools and techniques Excellent understanding of the fundamentals of machine learning A product and user-centric mindset
- Rate:
- Location: London/Hybrid
- Contact: david.sparks@causaly.com 07730 893 999 (please mention this list when you get in touch)
- Side reading: link, link
Senior Data Engineer at Causaly
We are looking for a Senior Data Engineer to join our Applied AI team.
Gather and understand data based on business requirements. Import big data (millions of records) from various formats (e.g. CSV, XML, SQL, JSON) to BigQuery. Process data on BigQuery using SQL, i.e. sanitize fields, aggregate records, combine with external data sources. Implement and maintain highly performant data pipelines with the industry’s best practices and technologies for scalability, fault tolerance and reliability. Build the necessary tools for monitoring, auditing, exporting and gleaning insights from our data pipelines Work with multiple stakeholders including software, machine learning, NLP and knowledge engineers, data curation specialists, and product owners to ensure all teams have a good understanding of the data and are using them in the right way.
Successful candidates will have:
Master’s degree in Computer Science, Mathematics or a related technical field 5+ years experience in backend data processing and data pipelines Excellent knowledge of Python and related libraries for working with data (e.g. pandas, Airflow) Solid understanding of modern software development practices (testing, version control, documentation, etc…) Excellent knowledge of data processing principles A product and user-centric mindset Proficiency in Git version control