Develop faster with Python 3.12, profile multi-core with VizTracer
Develop faster with Python 3.12, multi-process speed profiling, polars-business and sequencing sourdough DNA
Further below are 5 jobs including: Data Scientist Insights (Customer Experience & Product) at Catawiki, Permanent, Amsterdam, Senior machine learning engineer - NLP/DL, Senior Data Engineer at Hackney Council, ML Researcher at Velo (stealth startup), Senior Data Engineer at Hertility Ltd, Permanent, Remote
Python 3.10 and the just-released 3.12 help pinpoint errors with more useful exception message and this can improve your software development process. I give an example below. If you're tackling multi-core programming and you want to understand which bit is slow - check out VizTracer
below.
I also give an example of polars-business
for date/time logic and, totally off topic, I talk a tiny bit about DNA sequencing in sourdough yeast starters and custom mechanical keycaps.
Improve your Software Engineering and Higher Performance Python courses in November
I've listed dates for the following public courses, they'll each run as virtual courses via Zoom in the UK mornings. You can also add your email for future date notifications. Early bird tickets are available for each:
- Software Engineering for Data Scientists (November 20th, 21st, 22nd)
- Higher Performance Python (November 27th, 28th, 29th)
The Software Engineering courses for people who don't write tests and who don't yet benefit from collaborating during development. We start with identifying "what's wrong" with a badly written Notebook and move through refactoring, adding unit tests and assumption-tests, structuring code for maintenance and getting to a decent project structure that supports a setup.py
for installation. This is great for anyone who feels that they don't understand topics like unit testing, Python's sys.path
, modules and folder structures. It'll also help you gain seniority faster.
The Higher Performance Python class focuses on figuring out what's slow and then making it faster. We start with profiling (critical if you're to work on the actual slow part of your code!), then moves through making numeric and Pandas code faster, scaling up with Dask and looking at the new Polars to start to understand why it is a compelling alternative to Pandas. If you're stuck with slow code and expensive bills then profiling and trying the right speed-up will make your team more productive.
If you've got queries - just reply to this newsletter, I'm happy to chat about the content and of course these can be run privately.
Multi-process Profiling with VizTracer
I'm about to run a private Python profiling course for quants and I've added a deeper dive into multi-process profiling tools. I was really stoked to see that VizTracer makes it easy to visualise the concurrent processing of CPU-bound code. VizTracer is pretty cool - you can click through the report to dive right into your code and you can check the timeline to see which methods are called over time.
What I hadn't tried was digging into multi-process support. During my Higher Performance Python course I work through making a CPU-bound simulator both very fast (100x speed-ups on a naieve implementation) and then parallelise it to use all the cores. I'd never tried to dive into that parallel execution.
Using viztracer --min-duration 0.2ms
I could profile my simulator (without the switch the trace is too large to analyse). The result with a parent process and two subprocesses are shown below, time runs left to right. You can't see it here but if you click and zoom in you can see each method call (the simulator makes the same calls many times in a loop) and dig into the source, per process. Neat!
If you have a team of quants and they need to understand what's slow in their Python code so they can make it faster, I'm happy to chat about this course.
Open Source
polars-business
polars-business by the esteemed Marco Gorelli is an extension to Polars to handle date-time manipulation with respect to business rules. Marco notes that it is
about an order of magnitude faster than pandas...Polars will take care of parallelisation for you
If you're doing business logic in your Pandas workflow and you've been thinking about Polars, maybe this helps you give Polars a try?
from datetime import date
import polars as pl
import polars_business as plb
df = pl.DataFrame(
{"date": [date(2023, 4, 3), date(2023, 9, 1), date(2024, 1, 4)]}
)
import holidays
england_holidays = holidays.country_holidays("UK", subdiv='ENG', years=[2023, 2024])
df.with_columns(
date_shifted=plb.col("date").bdt.offset_by(
by='5bd',
weekend=('Sat', 'Sun'),
holidays=england_holidays,
)
)
# leading to
shape: (3, 2)
┌────────────┬──────────────┐
│ date ┆ date_shifted │
│ --- ┆ --- │
│ date ┆ date │
╞════════════╪══════════════╡
│ 2023-04-03 ┆ 2023-04-12 │
│ 2023-09-01 ┆ 2023-09-08 │
│ 2024-01-04 ┆ 2024-01-11 │
└────────────┴──────────────┘
Initially it seemed that there might be an offset bug but it just turns out that one of the bank holidays in April wasn't UK-wide (one doesn't exist in Scotland) so the subdiv='ENG'
argument was added above (so it was right all along!).
Do you want to learn Polars?
I'm starting work on a Polars course and I need your input. If you reply and send me an email with some feedback, I'll give you a solid discount for this course which can also apply to my other courses (above, even for this November).
Here's what I want to know - why do you want to learn Polars? Is it because Pandas is slow (Polars is generally faster at execution than Pandas - sometimes 10x or more). Pandas can certainly be confusing and that hurts development speed (and Polars can focus you far quicker). Polars has more useful error messages (see an example below for date/time conversion). Would you want advice on converting Pandas code over to Polars? How much testing would you want to put around using Polars in production? Do you have medium-data needs (bigger-than-RAM) which Pandas can't cope with, but Polars could (it does a really nice job here)?
Just reply to this and let me know why you're thinking of moving to Polars, I'll keep you informed about dates for the course and I'll give you a discount for this and my other courses.
Python 3.12
Python 3.12 was released earlier this month, I've started to use it on some personl projects, the what's new list has a lot in it but only a few things really interest me.
f-strings
have a better-defined grammar via PEP 701 so nesting f-strings inside each other is possible, they can also span multiple lines so they're easier to read.
There's yet more extension of the explicit type system in Python via PEP 484 but I'm totally not excited by this. I left the C++-style of early type declarations a decade back and I've not looked back, I find it weird that some of the Python community wants to turn duck-typing into the strictures of what's provided in Java/C++ already and I prefer to ignore the issue entirely.
I find teams who use types as hint their IDE when they're coding, but rarely do I find a team with explicit mypy
checks that also validate that the type hints are correct - and if they're not, they're just like comments that lie and I hate lies in the code. You're welcome to tell me I'm wrong, just reply and give me a good reason to care about type declarations in Python and I'll engage!
The ability to expose Python functions to the command-line perf tool looks pretty interesting, as the perf
ecosystem is deep. Have any of you had a chance to play with this yet?
...it is capable of statistical profiling of the entire system (both kernel and userland code).
PEP 684 introduces a per-interpreter GIL
so that sub-interpreters may now be created with a unique GIL per interpreter. This allows Python programs to take full advantage of multiple CPU cores. This is currently only available through the C-API, though a Python API is anticipated for 3.13.
So that's a nice multi-core step and might be interesting to data scientists (but IPC between the processes probably still incurs an expense), and it comes in a year.
The last bit that's more pragmatically useful right now is the continuing evolution towards better error messages. I got caught by the following and I thought that improvement came from 3.12, it turns out this NameError
similar-suggestion came in 3.10:
File "/home/....py", line 76, in <module>
print(calculate_bootstrap_ci(arr, repeats=10_000))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/....py", line 49, in calculate_bootstrap_ci
aggs = calculate_bootstrap(arr, repeats, agg_fn)
^^^^^^^^^^^^^^^^^^^
NameError: name 'calculate_bootstrap' is not defined. Did you mean: 'calculate_bootstraps'?
The calculate_bootstraps
example comes from the metaculus solar prediction challenge I mentioned last issue.
If you're using earlier versions of Python with less-useful error messages, I'd strongly urge you to check the "What's new" sections for 3.12, 3.10. These little clues really help calm down some of the irritations in the dev process.
line_profiler
needs help for Python 3.12 support
I'm going to be working on an updated version of my High Performance Python book for a 3rd edition and we're looking to use Python 3.12. I was momentarily stuck when I saw that the excellent line_profiler was not easy to update for 3.12 but just a week later - success and now it has 3.12 support.
If you're looking for slow code - particularly slow numeric code - I'll be teaching line_profiler
in my next High Performance Python class.
Off topic - sequencing Sourdough bacteria and custom mechanical keycaps
I've talked in the past about sourdough baking and every now and again I try to dig further into the science behind it. It hadn't occurred to me that perhaps the finer details of the underlying yeasts (which are a fungus) and bacteria are still not well understood.
I was interested to read Using different flours for sourdough fosters different bacteria—and flavors which talks about how different flours produce different colonies:
"In other words, our findings show that bakers can influence the aroma of their sourdough by using different flours, because those flours will foster different communities of bacteria."
I've been using Rye flour for my started for over a year, rather than all-purpose white or strong white (the latter you'd normally use for bread). Since the total weight of rye flour (which develops a low amount of gluten) is small in a typical bread recipe, and rye "makes my starter taste better", I've not worried or really thought about why I used this flour. It turns out there's evidence interesting stuff is going on (whether I can actually taste this - who knows?):
"We found more than 30 types of bacteria in the rye starters at maturity. The next highest was buckwheat, which had 22 types of bacteria. All of the other flours had between three and 14."
There's a related article on where an Intercontinental study sheds light on the microbial life of sourdough by collecting starters from around the world:
For this study, the researchers collected 500 samples of sourdough starter, primarily from home bakers in the United States and Europe, though there were also samples from Australia, New Zealand and Thailand.
Custom arty keycaps for your mechanical keyboard
My friend Natalia designs and sells custom keycaps on Etsy, she does a bunch of stuff on Etsy (origami, earrings, magnets and more) but her keycaps really caught my eye. How about a custom solar system space bar or cute cats or lots more.
If you use a mechanical keyboard then you can swap out keys with these custom designs. Let me know if you give them a try?
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,600+ subscribers. Your first job listing is free and it'll go to all 1,600 subscribers 3 times over 6 weeks, subsequent posts are charged.
Data Scientist Insights (Customer Experience & Product) at Catawiki, Permanent, Amsterdam
We are looking for a Data Scientist to work with our Customer Experience (CX) and Product teams to enable them to make operational and strategic decisions backed by data. You will help them not only to define and measure their success metrics but also to provide insights and improvement opportunities.
You will be part of the Data Science Insights team, helping us make sense of our data, finding actionable insights and creating self-service opportunities by combining data from multiple sources and working closely together with your colleagues in other departments.
- Rate: -
- Location: Amsterdam, The Netherlands
- Contact: Iria Kidricki i.kidricki@catawiki.nl (please mention this list when you get in touch)
- Side reading: link, link
Senior machine learning engineer - NLP/DL
You will be part of the ML team at Mavenoid, shaping the next product features to help people around the world get better support for their hardware devices. The core of your work will be to understand users’ questions and problems to fill the semantic gap.
The incoming data consists mostly of textual conversations, search queries (more than 600Ks conversations or 2Ms of search queries per month), and documents. You will help to process this data and assess new NLP models to build and improve the set of ML features in the product.
- Rate: 90K+ euros
- Location: remote - EU
- Contact: gdupont@mavenoid.com (please mention this list when you get in touch)
- Side reading: link
Senior Data Engineer at Hackney Council
The Data & Insight team's vision is to help Hackney maximise the value of its data so that the council can make better decisions, do more with less, and proactively support our residents. We're a collaborative, curious and friendly team made up of data analysts, data engineers and cloud engineers, supported by a product manager and delivery manager. We're looking for a Senior Data Engineer to help us continue developing and expanding the adoption of one of local government's most advanced data analytics platforms.
As the most senior data engineer in the organisation, we'll look to you to help us: design and implement a data warehouse layer; select and implement appropriate technologies to deliver efficient data pipelines; set our code standards; productionise ML models; and mentor others in the team. Our ideal candidate wants to deliver data products that are a force for good; has a track record of delivering efficient and scalable data pipelines in AWS; is skilled in using Python, SQL and Apache Spark; and enjoys learning new things as well as supporting others to learn.
- Rate: £61,163 to £63,206 (includes £6,338 salary supplement)
- Location: London
- Contact: lisa.stidle@hackney.gov.uk (please mention this list when you get in touch)
- Side reading: link, link, link
ML Researcher at Velo (stealth startup)
Velo is a (seed-funded) stealth start up working on code generation using LLMs. We're looking to hire an ML researcher to help build out and optimise our system. Right now we're just using Pytorch (but open to suggestions!).
We're looking for someone with the following skills: - Experience in a research setting: - Discussing potential experiments and deciding which to run - Making any (software) changes required to run them - Analysing the results and communicating outcomes, taking successful findings forward (either into system improvements or further experiments) - Proficient with Python (and able to follow our existing way of doing things) - Knowledge of LLMs - Comfortable with remote work
- Rate:
- Location: Remote
- Contact: rob.hudson@gmail.com (please mention this list when you get in touch)
Senior Data Engineer at Hertility Ltd, Permanent, Remote
Hertility is a women’s health company built by women, for women. We’re shaping the future of reproductive healthcare by pioneering unique diagnostic testing that provides data-driven and advanced insights into reproductive health, fertility decline and the onset of menopause. We provide expert advice, education and access to care - all from the comfort of your home.
We’re looking for a Senior Data Engineer to help us build the world’s first data platform for women to manage their hormonal health. This is an exciting opportunity to work with a variety of data from multiple sources. You will be building out scalable data solutions for clinical services that are changing the lives of women everywhere.
Key responsibilities will include building out data infrastructure on AWS, developing ETL code for data cleaning/linking and collaborating with data scientists/machine learning experts to design cutting AI tools that will revolutionise healthcare. We’re looking for someone with 5+ years of experience, a degree in Computer Science, IT, or similar field from a top university and a ‘can-do’ product mindset.
- Rate: £80k - £90k
- Location: Remote
- Contact: matt@hertilityhealth.com (please mention this list when you get in touch)
- Side reading: link