Faster development with Polars and a bit of ChatGPT, plus some leadership tips
Develop faster in Polars via better error messages and a couple of leadership resources
Further below are 5 jobs including: Senior machine learning engineer - NLP/DL, Senior Data Engineer at Hackney Council, ML Researcher at Velo (stealth startup), Senior Data Engineer at Hertility Ltd, Permanent, Remote, Software Engineer at Qualis Flow Ltd
I've started work on a Polars course, I'd love to hear if and why you'd like to learn Polars. I'm going to document some of my findings and I start below with a to_datetime
observation plus a Pandas helper. I got to discuss some of this during a recent Software Engineering for Data Scientists course, I share a few insights below, notably on getting ChatGPT to write unit tests (not my recommendation, but it certainly can get you started).
I've also kicked off my RebelAI leadership group with an initial group of excellent data science leaders. I've posted a couple of resources that came from our kick-off chat below. If you're in a leadership role and joining a peer-support leadership group might help you, just reply to this and we'll have a chat.
Improve your Software Engineering for Higher Performance Python in November
I've listed dates for the following public courses, they'll each run as virtual courses via Zoom in the UK mornings. You can also add your email for future date notifications. Early bird tickets are available for each:
- Software Engineering for Data Scientists (November 20th, 21st, 22nd)
- Higher Performance Python (November 27th, 28th, 29th)
The Software Engineering courses for people who don't write tests and who don't yet benefit from collaborating during development. We start with identifying "what's wrong" with a badly written Notebook and move through refactoring, adding unit tests and assumption-tests, structuring code for maintenance and getting to a decent project structure that supports a setup.py
for installation. This is great for anyone who feels that they don't understand topics like unit testing, Python's sys.path
, modules and folder structures. It'll also help you gain seniority faster.
The Higher Performance Python class focuses on figuring out what's slow and then making it faster. We start with profiling (critical if you're to work on the actual slow part of your code!), then moves through making numeric and Pandas code faster, scaling up with Dask and looking at the new Polars to start to understand why it is a compelling alternative to Pandas. If you're stuck with slow code and expensive bills then profiling and trying the right speed-up will make your team more productive.
If you've got queries - just reply to this newsletter, I'm happy to chat about the content.
Do you want to learn Polars?
I'm starting work on a Polars course and I need your input. If you reply and send me an email with some feedback, I'll give you a solid discount for this course which can also apply to my other courses (above, even for this November).
Here's what I want to know - why do you want to learn Polars? Is it because Pandas is slow (Polars is generally faster at execution than Pandas - sometimes 10x or more). Pandas can certainly be confusing and that hurts development speed (and Polars can focus you far quicker). Polars has more useful error messages (see an example below for date/time conversion). Would you want advice on converting Pandas code over to Polars? How much testing would you want to put around using Polars in production? Do you have medium-data needs (bigger-than-RAM) which Pandas can't cope with, but Polars could (it does a really nice job here)?
Just reply to this and let me know why you're thinking of moving to Polars, I'll keep you informed about dates for the course and I'll give you a discount for this and my other courses.
RebelAI peer leadership group up and running
My plan for a data science leadership forum has been in preparation for a while and I'm very pleased to say we kicked off a couple of days back.
Topics requested for internal discussion include the following and I'll share insights here when relevant:
- Increasing DS project success rates - finding the projects that'll deliver value
- Increasing the sense of urgency to get projects deployed
- Figuring out why a project is valuable to the organisation
- Showing value in data governance
- Managing teams and effectively collaborating with other business units
- Increasing efficiency in distributed teams
- Staying sane in the sea of hype
Talk around "how to improve leadership ability" led to a discussion around leadership courses. It seems there really aren't that many for data science leadership but Meri Williams' Be a Brilliant People Developer was recommended as an open course, albeit with no current dates. Do you have a recommended course on leadership? If so, please reply and let me know, I'd love to be able to share it here.
On the subject of staying sane in the LLM hype storm we talked a bit about "how the heck do you keep up?!". One recommendation was AI Explained, it looks like it started 8 months back and now there's 50+ shortish videos on the current state of the art.
ChatGPT and Software Engineering
During a recent private Software Engineering for Data Scientists course we got to talk about using LLMs to generate code (see the next public course here for end of November).
Before this we got to talk about the need for code reviews - we talked about tools that help but had to agree that if a team isn't yet doing code reviews then adopting this practice is a great first step. Along with this a standard folder structure, some unit tests and a library of code enable a group to individuals to collaborate and move faster. There is more on this in my upcoming Software Engineering class if useful to you.
Earlier in the year I asked ChatGPT to write a function get_domain
that'd complete this unit test:
EMAIL1 = ("jane@google.com", "google.com")
EMAIL2 = ("geoff@bbc.co.uk ", "bbc.co.uk")
EMAIL3 = ("'julie@bbc.co.uk'", "bbc.co.uk")
EMAIL4 = (" 'julie@BBC.CO.UK'", "bbc.co.uk")
EMAILS = [EMAIL1, EMAIL2, EMAIL3, EMAIL4]
@pytest.mark.parametrize("email, domain", EMAILS)
def test_str(email, domain):
assert get_domain(email) == domain
Back then the solution written was pretty close to what the students wrote and ChatGPT proclaimed that it was complete, but it passed only 3 of the 4 examples in practice (I then told it this and it fixed the mistake). In this week's course I asked the same and I got a different and more-correct answer:
import re
def get_domain(email):
# Use a regular expression to find the domain in the email
match = re.search(r'@([\w.-]+)', email)
# If a match is found, return the domain part, otherwise return None
if match:
return match.group(1).strip().lower()
else:
return None
On the one hand it does work and it passes the test cases. It is more involved than I'd like, the if match:
is unnecessary in this context (the re
will always match these examples) and the regular expression includes a -
(minus) which isn't needed in these test cases. I asked ChatGPT and it responded that it wrote this code to handle domains with hyphens in the domain name (even though we don't have any).
I'm dead impressed that the code works (and presumbaly there's a pile of training data for this code of coding homework example), but it still feels like the result of a machine that's regurgitating canned examples (which of course it is) rather than something that actually understands the problem it is solving. I'd expect a junior dev to do this if they'd copied from StackOverflow, rather than if they'd built up the solution in a test-driven-development approach completing one example at a time. Impressive, but still something I need to manually review.
Inside RebelAI and elsewhere I've asked about how LLMs will impact the tech part of the organisation and the main issue raised (correctly IMHO!) is that context is always missing, regardless of the size of the input context window. Your domain knowledge is critical and is likely to always be missed - that's technical knowledge, client relationship knowledge, politics, the shape and meaning of the data. All these things can't be seen by an LLM. Being integral to the organisation is probably going to grow as a critical skill for all of us, and rightly so.
What's the most impressive thing you've seen an LLM do with code generation?
Polars for digging into data quicker than Pandas
Recently I've been working the prediction site Metaculus to setup a time-series prediction task for UK solar installations. This prediction site plays for points (not money), so you do it for the intellectual challenge. This particular challenge is to predict the growth in solar installations in a year, when we only know a limited amount about past trends, with laggy reporting and an uncertain set of political and self-driven needs behind solar installations.
Why would you care about such things? Aside from intellectual curiosity - typically humans are pretty bad at estimating things, especially things outside of our area of expertise (but we still don't become well-calibrated without practice). This site is useful as it helps me improve my estimation of timescales, change and risk and this pays off on client strategic work. You might find the challenges help you improve your estimation ability (the on-boarding tutorial is pretty good for this).
Predicting on this challenge involves taking some UK Government XLS spreadsheet data and parsing it. I've decided to do it all in Polars starting with to_datetime. This is parsed using the chrono library which is simliar to the Pandas equivalent but with some subtle differences.
With a parsed time-series of MW (mega watts) of installed capacity, I can use a monte-carlo approach to estimate possible future scenarios and this is one way to predict the future with a necessary error-bound.
Having identified the relevant date row (which I transpose
to a Series) I see entries that look like Apr 2023 May 2023 Jun 2023 Jul 2023 Aug 2023
but which actually contain carriage returns, spurious whitespace and are obviously hand-typed.
Working up a first date-time converter I tried:
In[]: ser.str.to_datetime('%b %Y')
Out[]: # stripping lots of stack trace
ComputeError: strict datetime[μs] parsing failed for 165 value(s) (165 unique): ["Apr
2011", "Aug
2012", … "Sep
2010"]
Ouch - so 165 (of 165) rows are in error, due to apparent carriage returns. Further investigation shows some double-spaces too, so let's try:
In[]: ser.str.replace_all('\n', ' ').str.replace_all('\W+', ' ') \
.str.to_datetime('%b %Y')
Out[]: # long stack trace removed
ComputeError: strict datetime[μs] parsing failed for 2 value(s) (2 unique): [" Nov 2013", "June 2022"]
You might want to try:
- setting `strict=False`
- setting `exact=False` (note: this is much slower!)
- checking whether the format provided ('%b %Y') is correct
I rather like the short error report on [" Nov 2013", "June 2022"]
and I was quickly able to add to my processing string to fix this:
In[]: ser.str.replace_all('\n', ' ').str.replace_all('\W+', ' ') \
.str.strip_chars().str.replace('June', 'Jun') \
.str.to_datetime('%b %Y')
This gave me the right result and it took all of 5 minutes of iteration, having not used the function before.
I returned to Pandas to try the same and reminded myself how painful it is in Pandas - you get no error report, so you have to generate your own mask to mask-out the errors as a piece of debug code. I also realised that the whitespace had to be carefully handled in Polars - but in Pandas it seems to be far more generous (more than I'd expected).
Just in case it is useful, this is the equivalent Pandas assistant function if you need an error report, pass in your string and Pandas to_datetime
format and it'll print examples of the first set of errors:
def to_datetime_helper(ser, format="%b %Y", trim_at=10):
"""Show conversion errors (as NaT) from original strings during `to_datetime` conversion
A `format` of `%b %Y` corresponds to e.g. 'Jan 2023'
`to_datetime` seems to skip whitespace"""
ser_nat = pd.to_datetime(ser, errors="coerce", format=format)
# ser_nat can have NaT if error occurred
mask = ser_nat.isna()
print(f"{mask.sum()} errors seen in conversion")
# show the errors, trim if there are too many to show
mask_cum = mask.cumsum()
if mask.sum() > trim_at:
print(f"{mask.sum()} is too many errors, trimming to {trim_at}")
mask[mask_cum > trim_at] = False # get the items up until the trim point
for idx, value in ser[mask].items():
print(f"Row {idx} '{value}'")
return ser_nat
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,600+ subscribers. Your first job listing is free and it'll go to all 1,600 subscribers 3 times over 6 weeks, subsequent posts are charged.
Senior machine learning engineer - NLP/DL
You will be part of the ML team at Mavenoid, shaping the next product features to help people around the world get better support for their hardware devices. The core of your work will be to understand users’ questions and problems to fill the semantic gap.
The incoming data consists mostly of textual conversations, search queries (more than 600Ks conversations or 2Ms of search queries per month), and documents. You will help to process this data and assess new NLP models to build and improve the set of ML features in the product.
- Rate: 90K+ euros
- Location: remote - EU
- Contact: gdupont@mavenoid.com (please mention this list when you get in touch)
- Side reading: link
Senior Data Engineer at Hackney Council
The Data & Insight team's vision is to help Hackney maximise the value of its data so that the council can make better decisions, do more with less, and proactively support our residents. We're a collaborative, curious and friendly team made up of data analysts, data engineers and cloud engineers, supported by a product manager and delivery manager. We're looking for a Senior Data Engineer to help us continue developing and expanding the adoption of one of local government's most advanced data analytics platforms.
As the most senior data engineer in the organisation, we'll look to you to help us: design and implement a data warehouse layer; select and implement appropriate technologies to deliver efficient data pipelines; set our code standards; productionise ML models; and mentor others in the team. Our ideal candidate wants to deliver data products that are a force for good; has a track record of delivering efficient and scalable data pipelines in AWS; is skilled in using Python, SQL and Apache Spark; and enjoys learning new things as well as supporting others to learn.
- Rate: £61,163 to £63,206 (includes £6,338 salary supplement)
- Location: London
- Contact: lisa.stidle@hackney.gov.uk (please mention this list when you get in touch)
- Side reading: link, link, link
ML Researcher at Velo (stealth startup)
Velo is a (seed-funded) stealth start up working on code generation using LLMs. We're looking to hire an ML researcher to help build out and optimise our system. Right now we're just using Pytorch (but open to suggestions!).
We're looking for someone with the following skills: - Experience in a research setting: - Discussing potential experiments and deciding which to run - Making any (software) changes required to run them - Analysing the results and communicating outcomes, taking successful findings forward (either into system improvements or further experiments) - Proficient with Python (and able to follow our existing way of doing things) - Knowledge of LLMs - Comfortable with remote work
- Rate:
- Location: Remote
- Contact: rob.hudson@gmail.com (please mention this list when you get in touch)
Senior Data Engineer at Hertility Ltd, Permanent, Remote
Hertility is a women’s health company built by women, for women. We’re shaping the future of reproductive healthcare by pioneering unique diagnostic testing that provides data-driven and advanced insights into reproductive health, fertility decline and the onset of menopause. We provide expert advice, education and access to care - all from the comfort of your home.
We’re looking for a Senior Data Engineer to help us build the world’s first data platform for women to manage their hormonal health. This is an exciting opportunity to work with a variety of data from multiple sources. You will be building out scalable data solutions for clinical services that are changing the lives of women everywhere.
Key responsibilities will include building out data infrastructure on AWS, developing ETL code for data cleaning/linking and collaborating with data scientists/machine learning experts to design cutting AI tools that will revolutionise healthcare. We’re looking for someone with 5+ years of experience, a degree in Computer Science, IT, or similar field from a top university and a ‘can-do’ product mindset.
- Rate: £80k - £90k
- Location: Remote
- Contact: matt@hertilityhealth.com (please mention this list when you get in touch)
- Side reading: link
Software Engineer at Qualis Flow Ltd
We’re looking for someone to someone that will be responsible for designing and developing the software that powers our products. You’ll need to collaborate with other teams, write high-quality code and ensure the codebase follows best practices. You are curious and enthusiastic with a drive to constantly learn and acquire new knowledge.
You’ll be working in our Engineering team, working closely with Product and other technical teams and reporting to the team lead.
- Design, develop, and maintain the core engine that powers our products, ensuring scalability, performance, and reliability
- Write high-quality, maintainable code that is well-documented and tested (we are fans of TDD)
- Extensive collaboration with other engineers, including pair-programming and mob programming
- Ensure the codebase follows best practices for software development, such as using appropriate design patterns, writing clean and modular code, and ensuring the codebase is easy to understand and maintain
- Continuously improve our development processes and technologies to ensure high-quality software delivery
- Participate in code reviews and provide feedback to other engineers on their code
- Work closely with the Product team to translate product requirements into technical specifications
- Collaborate with other internal teams to develop software that meets the needs of the business and our customers
- Contribute to the technical direction of the team and provide ideas and input on architectural decisions
- Provide technical guidance and mentorship to more junior members of the team
-
Always have an eye on the big picture to avoid getting lost in the weeds
-
Rate: £60,000 – £75,000,
- Location: Remote (Access to London office)
- Contact: sam.joseph@qualisflow.com (please mention this list when you get in touch)
- Side reading: link