Hiring tips from RebelAI & Fast Pandas advice
Hiring tips from RebelAI & Fast Pandas advice
Further below are 3 jobs including: Data Engineer at Airtime Rewards, Permanent, Manchester, Analytics Engineer at Yoto, Permanent, London, Software engineers at the Bennett Institute of Applied Data Science (note <- Ben Goldacre!)
Oh heck, so several months passed - oops. I became happily busy building my RebelAI leadership community, my new Fast Pandas course and working on the updates to my High Performance Python book (3e for early next year) and somehow a couple of months popped by. I've got a Rebel updateto be better than below, plus one of the easy speed-ups I teach in the new Fast Pandas course.
The PyDataLondon 2024 schedule is live and the talks look really good - tickets are on sale and they've always sold out in the past. On the Saturday morning I'll run another of my regular leadership discussions (reply to this and I can add you to the GCal reminder).
The goal for the leadership discussions is to dig into what's not working in teams and to get advice from the crowd on how they've solved similar issues, so members can get closer to repeatable success. This format evolved into my RebelAI private leadership group (noted below), I've been running these sessions for 7 or so years at PyData conferences now - people tell me that they meet very interesting people in these sessions. If you're looking for other leaders, I'd strongly suggest you get a ticket and attend on the Saturday morning (and mail me back to get a GCal invite).
Training
I have new dates to announce for my upcoming training. The links on my training page will be updated in a few days to point at eventbrite - if you fill in my training notification form I'll happily send you a 10% discount code valid for this year. In a few months I'll be running:
- Fast Pandas - make critical sections of your existing Pandas codebase 2-30x faster by addressing common bottlenecks with powerful speed-ups (details will follow on the training page)
- Successful Data Science Projects - decrease failures and make success more likely with better project planning and execution
- Software Engineering for Data Scientists - increase your speed of delivery by modularising, running code reviews, testing for increased confidence and preparing for production from early on
If you'd like your Pandas code to run faster, your team to write more maintainable DS code and your projects to succeed more frequently - check out the above and fill in my training survey.
Rebel AI
My RebelAI leadership group for "excellent data scientists turned leaders" is growing very well, now 7 months in with 20+ leaders attending monthly Zoom calls to talk through opportunities and challenges backed by weekly conversation in the slack group. I've got a 3 page PDF doc which explains what we do, reply to this if you'd like a copy. I'm actively seeking new members for our next intake.
Recent deep discussions have included "communicating uncertainty" to the wider leadership team, balancing "productivity and growth" for the team in small organisations, positioning data science within the larger org (to avoid just being "another reporting unit" and to push the org towards opportunity), digging into corporate data governance and loads more. One of our asynchronous practices is the "monday morning question" where someone asks a questions and everyone shares their thoughts.
A recent such question focused on "How do you balance efficiency and objectivity (eliminating bias) in an interview process?". You might have a long, involved multi-person process designed to remove bias, or a lightweight fast process that may enable implicit bias but improves the chance that you don't lose great candidates. What to do? If you're hiring - or in the hiring process - you might find this summary of one of our discussions useful:
- Several members use 1 or 2 people to screen CVs with a structured process leading to a shortlist, then a short screening call, then a deeper call which includes either a light (emphasis - light) exercise or a walk-through of a past project - this moves quickly, involves 2+ people and uses a well coded structure
- Some have been involved in very long processes - the danger here is that you lose candidates (especially seniors who are short of time)
- Light take-home tasks are common, to take 1-2 hours (I've seen up to a week's worth of work in the past in some roles, which de-selects pretty much anyone with a family which can never help for diversity in a team!)
- Typically members liked having 2 people on each step, even if perhaps the more junior person would be biased by the more senior
- Letting everyone in the hiring process say "hard no - stop now!" feels like a key step for an efficient hiring pipeline
- Having a strong job spec and making sure the recruiter understands it was noted as the best way to set the funnel up correctly, to avoid subsequent wasted time for everyone, which means planning and forethought are critical
- Some feel strongly that a candidate should talk about "we" not just "I" as a signal for team-fit (strong team members with weaker skills are likely to add more than virtuosos who can't collaborate)
- It was also noted that an interviewee who couldn't articulate "the value I brought was..." would be a bad signal
- Takehome or in-process assignments must match the "real work" as closely as possible, with data that's as real as possible
- If you ask a candidate to fix a test on the call or explain some code and watching them use the IDE through a screenshare, the general feel was that you could tell if a candidate could code (rather than them falling back on stackoverflow or chatgpt)
Colleague Ryan Varley has been writing up his tips for doing great recruitment on LinkedIn, this one on "if I made you an offer, why would you turn it down?" and this one "of your strengths, what's not been demonstrated here?" offer nice dives into tips Ryan has picked up. He's got a set, they're well worth reading.
Fast Pandas - get large speed-ups with minor changes to your Pandas codebase
I've written a new 1 day Fast Pandas course packing in everything I know about making in-memory Pandas work faster - faster ways to do joins, concats, applies, compilation, parallelisation to use all the cores, getting faster strings, doing less copies with the new Copy on Write mechanism and a discussion on getting the code ready for the upcoming Pandas 3 which may require code changes (so my best advice - get ahead of that now). I'll take out some tips from this course for each issue. Fill in my training notification form and I'll give you a 10% discount code for this and other training courses.
Did you know that depending on how you configure a merge
between index or regular columns you can achieve a 4-5x speed-up vs the slowest combination. Annoyingly it isn't always the index/index combination that is fasest - if that's a possible bottleneck in a repeated workflow then you might want to investigate.
In the course I look at setting up 4 different combinations of a pair of dataframes - using an index/index, index/column, column/index and column/column combination. You can easily measure a speed different of a merge operation and depending on the cardinality and type of the merge column you can see some very large differences. In Pandas 1.5 and 2.0 these differences could be very large, in 2.2 they're smaller but still a 4x difference can be easy to observe. Imagine that that's in a repeated loop of code (e.g. a reporting process) on large dataframes - improving that could offer a huge win for very little effort.
In the course I use my ipython_memory_usage profiler tool in a Notebook to record both RAM usage, overall speed plus CPU utilisation to investigate the effects of difference merge choices. A few past students have confirmed that they've improved their code by checking these combinations, most have noted that they'd no idea that this could even be an issue. This is just one of a large number of speed-ups that we look at in the course. If you want to experiment I'd suggest trying my ipython_memory_usage
tool, it quickly gives you some nice insights.
Motoscape - the return of the Five-LOW
Many of you will remember me talking about the charity rally I drove on last year, for a week in September driving a 24 year old "surf car" from London to Venice and back. We raised £4k for Parkinson's Research. This year we're doing it again, taking the longer drive through Venice and up to Prague, in the same car with the same team. Just as one driver had a father with Parkinson's, now another of my drivers has family experience with Alzheimer's (and I've that experience via extended family), so we're raising money for them. That means I'd love to get a donation from you.
Right now we don't have much on the new JustGiving page, except for a kindly first donation - could you do the second? Last year's page has all the photos from our various adventures, including our first car (a Volvo) trying to set itself alight after we bought it, us fixing up the second car and then lots of foolishness from the drive.
The Passat is running well (well, we've sunk a bunch more time into getting it to pass its MOT) and all going well we'll get some telemtry from a Bluetooth ODB (VCDS/VAG-COM Volkswagen unit) unit for real time sensor data. Maybe there's a further public talk in that down the line.
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,600+ subscribers. Your first job listing is free and it'll go to all 1,600 subscribers 3 times over 6 weeks, subsequent posts are charged.
Data Engineer at Airtime Rewards, Permanent, Manchester
Design and implement robust, scalable data pipelines to ingest data from internal platforms into our data warehouse. Monitor and maintain data pipelines, ensuring data quality, integrity, and availability. Optimise data pipelines to enhance performance and reduce cloud computing costs. Understand, gather, and document detailed business requirements. Take ownership of data projects from planning to delivery, collaborating with other departments as needed. Innovate and automate current processes, driving continuous improvement.
- Rate: £35,000 - 45,000
- Location: Manchester, Hybrid (2 days/week in office)
- Contact: oguzcan.koncagul@airtimerewards.com (please mention this list when you get in touch)
- Side reading: link
Analytics Engineer at Yoto, Permanent, London
We’re looking for an Analytics Engineer to join our team to accelerate the business and help us make sense of the terabytes of data we receive every day.
We’re a small team at the heart of all the decisions Yoto makes. We work in a mature, high-trust environment with a lot of independence. Everyone can contribute ideas and be part of the decision making process. We tackle a broad range of problems, from developing cutting-edge data products to building and maintaining our data orchestration platform. Our work spans across all the key strategic projects throughout the company.
- Rate: £30,000 - £40,000 based on experience.
- Location: Kings Cross, London (Hybrid)
- Contact: jeena.lakshmanan@yotoplay.com (please mention this list when you get in touch)
- Side reading: link
Software engineers at the Bennett Institute of Applied Data Science
We're looking for software developers, at all stages of their careers, to help build, maintain, and operate OpenSAFELY -- a revolutionary open source platform for secure clinical research. We're also looking for a team lead, a project manager, and a research software advocate (think "developer evangelist" for research).
Led by Ben Goldacre (clinician, researcher, and author of Bad Science and Bad Pharma), we’re a truly interdisciplinary team with a strong track record of delivering useful tools in a globally leading research setting. You’ll have the chance to use your software skills to save lives and further the state of medical data research. Our software delivery teams are collaborative, supportive, thoughtful and kind, and we support hybrid or fully remote working, with in person team events throughout the year.