Do you want your Python code to run faster? Good news!
Do you want your Python code to run faster? Good news! You’ll soon get easier debugging in sklearn
too
Further below are 5 job roles including Senior roles in Data Science and Data Engineering at organisations like Causaly, Cultivate and MDSol.
Two issues back I spoke a little on the upcoming Python 3.11 release, that release is now out and I give a little benchmark below - the speedups for no work on your behalf are very nice.
Do you use scikit-learn with Pandas? If so there’s news of an upcoming nice change below where Transformers (e.g. the preprocessing tools) will maintain DataFrames rather than converting down to numpy array
objects which will ease debugging. I’m also experimenting with mamba
in place of conda
, notes further below.
Executives at PyDataGlobal for Dec 1st
Are you a data science leader? Would you like to raise leadership questions in a like-minded group to get answers and share your hard-won process solutions? I’m organising another of my Executives at PyData sessions for the upcoming PyDataGlobal (virtual, wordwide) conference for December 1-3. On Thursday Dec 1st I’ll run a session over a couple of hours focused on leaders, anyone who is approaching leadership or who runs a team is welcome to join.
I have a plan to make this more problem-solving focused than previous sessions, with a write-up to be shared after the conference so there’s something to take away. Attendance for these sessions is free if you have a Global ticket. This builds on the sessions I’ve volunteered to run in the past and the Success calls I’ve organised via this newsletter earlier this year.
Reply to this (or write to me - ian at ianozsvald com) if you’d like to be added to a reminder and a GCal calendar entry (there’s no obligation, these just remind you and set it in your calendar).
Python 3.11
Python 3.11 has just been released, I’ve had a tiny play directly from Anaconda (see the demo of using mamba
as a replacement conda
for Python 3.11 below). There’s a lot of information out there about the new Faster Python project spearheaded by Mark Shannon. The bottom line for this release is that pure Python code can be sped-up “10-60%” (depending on what you’re doing), but it is unlikely to impact any Pandas or Numpy code (as the slow stuff there is delegated to compiled C routines).
Normally I use the following code snippet as an introduction to how the Numba compiler makes math functions faster during my Higher Performance Python course. It estimates Pi really inefficiently.
# approximately guess at pi using slow pure Python
import random
def monte_carlo_pi(n_samples):
acc = 0
for i in range(n_samples):
x = random.random()
y = random.random()
if (x ** 2 + y ** 2) < 1.0:
acc += 1
return 4.0 * acc / n_samples
print(monte_carlo_pi(1_000_000)) # 3.1422 - approx!
%timeit monte_carlo_pi(1_000_000)
Using Python 3.10.6 this takes 443 ms
, using Python 3.11.0 this takes 302 ms
, so it runs in 70% of the time using the latest version of Python, with no code changes, just by upgrading the CPython interpreter.
I’m going to look more into the changes in a later issue. See the release notes for more details, I’d suggest holding off of any significant updates until at least the first bugfix release comes out.
Pandas support coming to sklearn
Transformers
Soledad (author of feature engine) tweeted out about the new Pandas DataFrame support for sklearn transformers, this is introduced in this 15 min demo video. This is coming in the next version (i.e. it isn’t available right now), the video is a feature preview for v1.2.
The short story is that whilst some parts of sklearn
preserved a DataFrame
if you had one (e.g. traintestsplit
) but the transformers such as the StandardScaler
always turned your DataFrame
into a numpy array
. Historically sklearn
only supported numpy and Pandas support only came later. You turn this on using set_config(transform_output="pandas")
.
At 5:00 in the video we see the pretty Pipeline
visual representation that you can interact with. You can see a longer demo via Binder here and that link has a few clickable elements. Somehow I’d missed this when it got introduced.
What tricks in sklearn
and Pandas have helped you recently?
Replacing conda
with mamba
You may have seen references to the replacement for the Anaconda installation tool conda
with Quantsight’s open source mamba
. The big sell is that it is much faster at resolving the right set of versioned packages to install. I’d somewhat given up on using conda
for anything other that base packages as it could take 30-60 minutes for a complex new environment, and I make lots of new lightweight environments.
mamba
offers a much faster solver and presents as a tool and library and it is helping conda
evolve to accept alternative solvers (as libraries). For you - it is much faster, so try it if conda
is too slow.
Building a 3.11 environment looks just the same as with conda
, being simple this didn’t suffer from much dependency resolution so mamba create -n tmp311_mamba python=3.11 ipython
ran pretty quickly - it also shows some nice graphics in the terminal whilst it sets up the new environment.
From what I’ve read it is pretty stable now and generally always faster than setting up or adding packages using conda
. I installed it with conda install mamba -n base -c conda-forge
from here.
Have you switched to mamba
already? Have you been happy with the experience? I’m guessing there’s some weird edge cases that might be worth knowing about?
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,500+ subscribers. Your first job listing is free and it’ll go to all 1,500 subscribers 3 times over 6 weeks, subsequent posts are charged.
Senior Cloud Platform Applications Engineer, Medidata
our team at Medidata is hiring a Senior Cloud Platform Applications Engineer in the London office. Medidata is a massive software company for clinical trials and our team focus on developing the Sensor Cloud, a technology with capabilities in ingesting, normalizing, and analyzing physiological data collected from wearable sensors and remote devices. We offer a good salary and great benefits !!
- Rate:
- Location: Hammersmith, London
- Contact: kmachadogamboa@mdsol.com (please mention this list when you get in touch)
- Side reading: link
Natural Language Processing Engineer
In this role, NLP engineers will:
Collaborate with a multicultural team of engineers whose focus is in building information extraction pipelines operating on various biomedical texts Leverage a wide variety of techniques ranging from linguistic rules to transformers and deep neural networks in their day to day work Research, experiment with and implement state of the art approaches to named entity recognition, relationship extraction entity linking and document classification Work with professionally curated biomedical text data to both evaluate and continuously iterate on NLP solutions Produce performant and production quality code following best practices adopted by the team Improve (in performance, accuracy, scalability, security etc…) existing solutions to NLP problems
Successful candidates will have:
Master’s degree in Computer Science, Mathematics or a related technical field 2+ years experience working as an NLP or ML Engineer solving problems related to text processing Excellent knowledge of Python and related libraries for working with data and training models (e.g. pandas, PyTorch) Solid understanding of modern software development practices (testing, version control, documentation, etc…) Excellent knowledge of modern natural language processing tools and techniques Excellent understanding of the fundamentals of machine learning A product and user-centric mindset
- Rate:
- Location: London/Hybrid
- Contact: david.sparks@causaly.com 07730 893 999 (please mention this list when you get in touch)
- Side reading: link, link
Senior Data Engineer at Causaly
We are looking for a Senior Data Engineer to join our Applied AI team.
Gather and understand data based on business requirements. Import big data (millions of records) from various formats (e.g. CSV, XML, SQL, JSON) to BigQuery. Process data on BigQuery using SQL, i.e. sanitize fields, aggregate records, combine with external data sources. Implement and maintain highly performant data pipelines with the industry’s best practices and technologies for scalability, fault tolerance and reliability. Build the necessary tools for monitoring, auditing, exporting and gleaning insights from our data pipelines Work with multiple stakeholders including software, machine learning, NLP and knowledge engineers, data curation specialists, and product owners to ensure all teams have a good understanding of the data and are using them in the right way.
Successful candidates will have:
Master’s degree in Computer Science, Mathematics or a related technical field 5+ years experience in backend data processing and data pipelines Excellent knowledge of Python and related libraries for working with data (e.g. pandas, Airflow) Solid understanding of modern software development practices (testing, version control, documentation, etc…) Excellent knowledge of data processing principles A product and user-centric mindset Proficiency in Git version control
- Rate:
- Location: London/Hybrid
- Contact: david.sparks@causaly.com 07730 893 999 (please mention this list when you get in touch)
- Side reading: link, link
Data Engineering Lead - Purpose
This is an exciting opportunity to join a diverse team of strategists, campaigners and creatives to tackle some of the world’s most pressing challenges at an impressive scale.
- Rate:
- Location: London OR Remote
- Contact: sarah@cultivateteam.org (please mention this list when you get in touch)
- Side reading: link
Mid/Senior Python Software Engineer at an iGambling Startup (via recruiter: Difference Digital)
This role is for a software start-up, although is a part of a much larger established group, so they have solid finance behind them. You would be working on iGaming/online Gambling products. As well as working on the product itself you would also work on improving the backend application architecture for performance, scalability and robustness, reducing complexity and making development easier.
Alongside Python, experience of one or more of the following would be useful: Flask, REST, APIs, OOP, TDD, databases (Datastore, MySQL, Postgres, MongoDB), Git, Microservices, Websocket, Go, Java, PHP, Javascript, GCP.
- Rate: Up to £90k
- Location: Hybrid - 2 days per week in office opposite Victoria station
- Contact: davina@makeadifference.digital (please mention this list when you get in touch)