Leadership lessons from my RebelAI leadership community
Leadership lessons from my RebelAI leadership community
Further below are 7 jobs including: AI and Cloud Engineers at the Incubator for Artificial Intelligence, UK wide, Data Science Intern at Coefficient, Data Scientist at Coefficient Systems Ltd, Data Science Educator - FourthRev with the University of Cambridge, Data Scientist Insights (Customer Experience & Product) at Catawiki, Permanent, Amsterdam, Senior machine learning engineer - NLP/DL, Senior Data Engineer at Hackney Council
In past issues I've talked on my new data science peer support leadership community called RebelAI. I'm very pleased to say that it has gone rather well and we've unblocked a set of challenges posed in larger organisations for the leaders in the group. I'll give a summary of some of the discussions below.
I'm also happy to say that based on recent research I've got a new training class on "Fast Pandas" that I plan to run in February focused on all the ways I've learned to make Pandas run quick (there are many things to learn). I'm also running another of my Success classes which in part will be updated based on what I've learned in the RebelAI community (details for these below).
Recently with Giles Weaver we gave an updated Pandas 2 vs Polars vs Dask talk at PyDataGlobal 2023. The slides are up and I believe the videos will be freely available on YouTube in the next month. I'll talk about this in the next issue - in short Dask closed some gaps to Polars and Pandas' Copy on Write also makes things faster. This was a nice update on our talk from PyDataLondon 6 months back.
Looking forward - Micha and I are signing a contract for the 3rd edition of our High Performance Python book (for 2025). That'll include Dask, Polars, probably some Rust and most chapters will get an update. This work will be reflected here in the newsletter and in my courses.
Through my work building RebelAI and doing strategic consulting with clients you'll also get more data science leadership thoughts here. If your team might benefit from some support, do drop me a line.
Further below I also share some notes on the following packages - ipython_memory_usage
, skimpy
, pytorch
, line_profiler
.
Successful Data Science Projects and NEW Fast Pandas course coming in February
If you're interesting in being notified of any of my upcoming courses (plus receiving a 10% discount code) please fill in my course survey and I'll email you back.
In February I run another of my Successful Data Science Projects courses virtually, aimed at anyone who sets up their own projects, who has had failures and who wants to turn these into successes. We deconstruct failures, talk about best practice and set you up with new techniques for success.
- Successful Data Science Projects (22-23rd February - few early bird tickets remain)
- Fast Pandas (date TBC in February) - lots of ways to make your Pandas run faster - reply to this for details
- Higher Performance Python (date TBC in March) - profile to find bottlenecks, compile, make Pandas faster and scale with Dask
- Software Engineering for Data Scientists (date TBC in March)
- Scientific Python Profiling (in-house only) - focused on quants who want to profile to find opportunities for speed-ups
RebelAI - successful first months
Earlier this year I talked about setting up a peer leadership community called RebelAI for excellent data scientists turned leaders.
My observation during my strategic work with companies is that whilst I think my opinion is very useful (and indeed it has led to some $1M outcomes for a couple of clients), I couldn't help but think that a trusted range of opinions might be even more valuable.
I started RebelAI in November and we've so far unblocked problems faced by 4 of our 10 members during video-chat sessions, shared a lot of mutually-useful leadership experience and explored a set of topics asynchronously via Slack.
For one member we explored how to build a solid backlog of projects in an organisation that was relatively new to data science. Processes that had worked for other members in different organisations included "learning the business units inside out" (to avoid vanity projects and focus on value) whilst aiming to "save money, save time or save effort" as ways to evaluate potential projects.
For public companies you can go as far as reading the public annual report to identify units in the company that are "under pressure/in remediation" as they're likely to be open to new opportunities. Finally running internal conferences to show off specific PoCs to chosen teams had high value if coupled with a sales focus on "here's your current state, now here's the future happier state via this PoC if you make the right investment".
This particular discussion (I call them a "crit", a constructive critique of the challenge you have, what you've tried and what you need) was very rich as all members have faced it in varying degrees, so many ideas were shared.
I also ask "Monday questions" through Slack which everyone has the week to respond to. One on "what's the most constructive change you've made that has been clearly valuable for data science" yielded answers that included hiring a team lead who had a strong commercial intent, using a scoping phase to identify the goals+potential ROI+risks+complexity to map out projects before coding, having collaborative on-site days that mix DS and business units, using a hackathon to raise morale and then getting the right seniors together across the business to figure out how collaboration will occur.
The "crit" synchronous discussion sessions are high-bandwidth on specific questions and they yield several strong opinions on process and direction per question, giving the presenter a clear set of ideas to prioritise back at work. The "Monday questions" run slower, giving more time to tease out answers to a range of topical points. Both have yielded answers for everyone in the group.
If you think you could benefit from joining this peer-supported leadership group to help you through the challenges you face, reply to this newsletter. I'm looking to add a small second group to my original 10 from February.
Open Source
Here's 4 short updates on useful open source projects.
ipython_memory_usage
This is one of mine, ipython_memory_usage records RAM and CPU usage per Jupyter Notebook cell (and for every entry at the IPython shell, which is where I'd originally developed it). After the cell you get a short report telling you how slow and expensive it was.
This is brilliant for benchmarking and testing ideas on larger datasets and it runs automatically. I'd love to hear if you've given it a go? You'll see references to this in the benchmarking talks I've done recently e.g. on Pandas 2 vs Polars vs Dask - here comparing Copy on Write on/off in Pandas 2.
You get an output like
# your expensive Notebook cell...
In [10] used 763.1 MiB RAM in 3.25s (system mean cpu 10%, single max cpu 100%),
peaked 0.0 MiB above final usage,
current RAM usage now 898.9 MiB
You can learn that in 3.25 seconds nearly 800MB of RAM was used, where 1 CPU ran at 100% and the mean usage of CPUs was 10% (so only 1 core was running, not all of them), at the end of execution the memory usage of this process is almost 900MB (up from 100MB before running this cell) and during the execution of the cell 0MB of additional RAM was temporarily used.
I teach this in my Scientific Python Profiling and Higher Performance Python classes.
line_profiler
now on Python 3.12
Some of you know that I'm a huge fan of line_profiler for numeric profiling - notably for finding out where Pandas, numpy or sklearn are slow. Amongst other tools I teach it for my Higher Performance Python and Scientific Python Profiling courses. Until recently it didn't work on Python 3.12 which was a bit of a pain.
Thankfully support got added quickly once the issue was noted. I've seen a set of tools not supporting Python 3.12 (or 3.11 even) recently, where support is quickly added once someone files an issue.
If you've never used it - in a Jupyter Notebook or at the command line you can figure out which lines (not just methods!) are slow, so you can narrow down your search for your bottlenecks.
Pandera will have Polars schema support
Those of you who have been on my Software Engineering for Data Scientists training will know that I'm a firm believer in adding run-time schema checks to Pandas DataFrames using Pandera.
Amongst other things I always teach to add both a single unit-test with PyTest and a schema check with Pandera early on to a new research project. If you've got one of each, it is easy to extend them and add more checks and they'll save you time in the long run. If you "save time" by avoiding adding them, it'll cost you in the long run.
One of the challenges for any non-Pandas equivalent (like Polars) is that Pandas is a foundation layer to other projects - like Pandera. Backfitting another DataFrame library is harder work. I'm glad to see that Polars support in Pandas is being worked on, even though it is at an early stage. You can always ask Polars to generate a temporary Pandas DataFrame to run the schema check but by integrating Polars we'll be able to check for the Polars-specific datatypes, with no duplication of data.
skimpy
and dataprep
Skimpy is a lightweight dataframe describer for Pandas and (recently added) Polars. It is better-than-describe
and much simpler than a tool like ydata-profiling.
You ask skim(df)
and you'll get a rich-text output describing your columns.
It also uses the dataprep package which is new to me, it tries to simplify both the eda exploration process and clean(ing) the data. It looks a bit US-centric (but with a handful of international clean-ups), but maybe it shortcuts some of the clean-up process you manually support in your org?
PyTorch now on Python 3.11 (not 3.12)
I was playing with PyTorch and GPUs recently and was surprised to see that not only was Python 3.12 not supported yet but Python 3.11 only got support 3 weeks ago.
I suspect the dependency chain for a project as complex as PyTorch coupled with the internal speed-ups that came in Python 3.11+ are making these deeper integrated tools harder to update.
Light Therapy in these darker months
If you've ever noted that your motivation to work declines in the winter months you might want to read up on light therapy. Personally I'm not "SAD affected", or so I thought, but I know others who can't work in the winter unless they get a SAD Lamp and a supplement like Vitamin D.
I had however noted that my office felt a bit too dingy and I preferred to work in a different, lighter room. As a consequence of these two excellent articles I spent £60 on a corn-cob 100W LED light (1000W filament equivalent) - that's 1.5 metres from my head right now (burning away like a mini sun).
It actually does make a difference and I'm considering buying a second. If your mood is at all less-positive in these darker months, you may want to read up on how changing your lighting might help you.
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,600+ subscribers. Your first job listing is free and it'll go to all 1,600 subscribers 3 times over 6 weeks, subsequent posts are charged.
AI and Cloud Engineers at the Incubator for Artificial Intelligence, UK wide
The Government is establishing an elite team of highly empowered technical experts at the heart of government. Their mission is to help departments harness the potential of AI to improve lives and the delivery of public services.
We're looking for AI, cloud and data engineers to help build the tools and infrastructure for AI across the public sector.
- Rate: £64,700 - £149,990
- Location: Bristol, Glasgow, London, Manchester, York
- Contact: ai@no10.gov.uk (please mention this list when you get in touch)
- Side reading: link
Data Science Intern at Coefficient
We are looking for a Data Science Intern to join the Coefficient team full-time for 3 months. A permanent role at Coefficient may be offered depending on performance. You'll be working on projects with multiple clients across different industries, including the UK public sector, financial services, healthcare, app startups and beyond. You can expect hands-on experience delivering data science & engineering projects for our clients as well as working on our own products. You can also expect plenty of mentoring and guidance along the way: we aim to be best-in-class at what we do, and we want to work with people who share that same attitude.
We'd love to hear from you if you: Are comfortable using Python and SQL for data analysis, data science, and/or machine learning. Have used any libraries in the Python Open Data Science Stack (e.g. pandas, NumPy, matplotlib, Seaborn, scikit-learn). Enjoy sharing your knowledge, experience, and passion. Have great communication skills. You will be expected to write and contribute towards presentation slide decks to showcase our work during sprint reviews and client project demos.
- Rate: £28,000
- Location: London, Hybrid
- Contact: jobs@coefficient.ai (please mention this list when you get in touch)
- Side reading: link, link, link
Data Scientist at Coefficient Systems Ltd
We are looking for a Data Scientist to join the Coefficient team full-time. You can expect hands-on experience delivering data science & engineering projects for our clients across multiple industries, from financial services to healthcare to app startups and beyond. This is no ordinary Data Scientist role. You will also be delivering Python workshops, mentoring junior developers and taking a lead on some of our own product ideas. We aim to be best in class at what we do, and we want to work with people who share the same attitude.
You may be a fit for this role if you: Have at least 1-2 years of experience as a Data Analyst or Data Scientist, using tools such as Python for data analysis, data science, and/or machine learning. Have used any libraries in the Python Open Data Science Stack (e.g. pandas, NumPy, matplotlib, Seaborn, scikit-learn). Can suggest how to solve someone’s problem using good analytical skills e.g. SQL. Have previous consulting experience. Have experience with teaching and great communication skills. Enjoy sharing your knowledge, experience, and passion with others.
- Rate: £40,000-£45,000 depending on experience
- Location: London, Hybrid
- Contact: jobs@coefficient.ai (please mention this list when you get in touch)
- Side reading: link, link, link
Data Science Educator - FourthRev with the University of Cambridge
As a Data Science Educator / Subject Matter Expert at FourthRev, you will leverage your expertise to shape a transformative online learning experience. You will work at the forefront of curriculum development, ensuring that every learner is equipped with industry-relevant skills, setting them on a path to success in the digital economy. You’ll collaborate in the creation of content from written materials and storyboards to real-world case studies and screen-captured tutorials.
If you have expertise in subjects like time series analysis, NLP, machine learning concepts, linear/polynomial/logistic regression, decision trees, random forest, ensemble methods: bagging and boosting, XGBoost, neural networks, deep learning (Tensorflow) and model tuning - and a passion for teaching the next generation of business-focused Data Scientists, we would love to hear from you.
- Rate: Approx. £200 per day
- Location: Remote
- Contact: Apply here (https://jobs.workable.com/view/1VhytY2jfjB3SQeB75SUu1/remote-subject-matter-expert---data-science-(6-month-contract)-in-london-at-fourthrev) or contact d.avery@fourthrev.com to discuss (please mention this list when you get in touch)
- Side reading: link, link, link
Data Scientist Insights (Customer Experience & Product) at Catawiki, Permanent, Amsterdam
We are looking for a Data Scientist to work with our Customer Experience (CX) and Product teams to enable them to make operational and strategic decisions backed by data. You will help them not only to define and measure their success metrics but also to provide insights and improvement opportunities.
You will be part of the Data Science Insights team, helping us make sense of our data, finding actionable insights and creating self-service opportunities by combining data from multiple sources and working closely together with your colleagues in other departments.
- Rate: -
- Location: Amsterdam, The Netherlands
- Contact: Iria Kidricki i.kidricki@catawiki.nl (please mention this list when you get in touch)
- Side reading: link, link
Senior machine learning engineer - NLP/DL
You will be part of the ML team at Mavenoid, shaping the next product features to help people around the world get better support for their hardware devices. The core of your work will be to understand users’ questions and problems to fill the semantic gap.
The incoming data consists mostly of textual conversations, search queries (more than 600Ks conversations or 2Ms of search queries per month), and documents. You will help to process this data and assess new NLP models to build and improve the set of ML features in the product.
- Rate: 90K+ euros
- Location: remote - EU
- Contact: gdupont@mavenoid.com (please mention this list when you get in touch)
- Side reading: link
Senior Data Engineer at Hackney Council
The Data & Insight team's vision is to help Hackney maximise the value of its data so that the council can make better decisions, do more with less, and proactively support our residents. We're a collaborative, curious and friendly team made up of data analysts, data engineers and cloud engineers, supported by a product manager and delivery manager. We're looking for a Senior Data Engineer to help us continue developing and expanding the adoption of one of local government's most advanced data analytics platforms.
As the most senior data engineer in the organisation, we'll look to you to help us: design and implement a data warehouse layer; select and implement appropriate technologies to deliver efficient data pipelines; set our code standards; productionise ML models; and mentor others in the team. Our ideal candidate wants to deliver data products that are a force for good; has a track record of delivering efficient and scalable data pipelines in AWS; is skilled in using Python, SQL and Apache Spark; and enjoys learning new things as well as supporting others to learn.