Further leadership notes from PyDataLondon 2024, a new commercial scikit-learn support model
Further leadership notes from PyDataLondon 2024, a new commercial scikit-learn support model
Further below are 5 jobs including:
- Data Scientist (FTC - 12 Months)
- Senior Data Scientist
- Principal Data Scientist - MOPAC
- Data Engineer at Airtime Rewards, Permanent, Manchester
- Analytics Engineer at Yoto, Permanent, London
Below I have the 2nd part of the write-up from the Leadership discussion at PyDataLondon 2024 conference, this builds on part 1 in the last issue. I also have a note from Gaël Varoquaux of scikit-learn on a new funding model they're putting together in Probabl.
There's been a recent paper on a certain definition of "bullshit" that could be used for LLMs in place of "hallucination". This feels like a change in the vocabulary used to talk about the current wave of AI, I quote a bit below. Finally I end with a sample of recent library updates and a note on the NumPy 2.0 release which has caused a wobble for some popular projects.
I work with clients to help make data science teams more effective, often by bringing clarity to their process. If you've got too much on your plate, lots of opportunity which is hard to prioritise and a challenge describing the impact you're having - I'm happy to have a chat and perhaps share some advice. Hit me up by replying to this email.
Training
I have new dates to announce for my upcoming training in July and September. The links on my training page show the July and September dates - if you fill in my training notification form I'll happily send you a 10% discount code valid for this year. In a few months I'll be running:
- Software Engineering for Data Scientists July 8-10 - increase your speed of delivery by modularising, running code reviews, testing for increased confidence and preparing for production from early on
- Fast Pandas July 18-19 - make your existing Pandas codebase 2-30x faster per bottleneck by addressing common issues with powerful speed-ups
- Successful Data Science Projects September 26-27 - decrease failures and make success more likely with better project planning and execution
If you'd like your Pandas code to run faster, your team to write more maintainable DS code and your projects to succeed more frequently - check out the above and fill in my training survey.
Data Science Leadership at PyDataLondon 2024 (part 2 of 2)
In the last issue I talked on the Leadership discussion that 45 of us had at PyDataLondon 2024 conference. There I talked on the first question we covered - "Leadership vs demotivation" in the team.
Here I'll talk a bit about the subsequent questions "Specialisation vs collaboration" and "Enabling change".
Specialisation vs collaboration
How do we encourage the team to think beyond specialisation of their skills and instead to having a broader impact within the organisation?
- Think on circles of impact - things you're supposed to do around you, then your team, the department and the company - how can you multiply your impact?
- Getting your team to step outside of the DS day to day and go collaborate in the business would be a great start to open their eyes
- Going further afield and listening to anyone your team could interact would be an excellent further step
- One of the larger organisations uses a structure that keeps an eye each on career growth, business impact and skill-driven growth - sometimes people have to focus on business impact, other times they can focus on skill-driven ("CV driven") development - this is acknowledged and balanced out
- This avoids people needing to "sneak tech in" to try to enable skills growth when it might not be relevant
- Juniors need to focus on career development and maybe they are motivated personally to gain skills - you have to help them manage their expectations on business need vs personal need
- In another large org they're experimenting with 3-person self-directed teams who self-organise (under a "better to ask forgiveness than permission" experiment), they figure out the best way to engage with the wider business
- Perhaps also moving to a Data Mesh Architecture encompassing principles around domain ownership, data as a product, self serve data infrastructure platform and federated governance to enable teams to engage without being silos in a growing organisation
Enabling change
How do you deal with managers and leaders who resist change with arguments like "I can do it with intuition well enough already" - even when it is clear that valuable improvements can be made beyond this?
- Avoid any big-change projects that'll surprise the end users at the "successful end" of the development phase - instead get several end users involved early so that confidence builds and positive change occurs from early on
- Getting buy-in with end users can still be forgotten by a DS team in their rush to demonstrate value from a model
- The buy-in typically brings critical institutional knowledge back to the DS team to help align the data product with the user's actual day to day needs
- What numbers do they use already to drive decision? What causes them to get buy in? Could you couple your ideas with the results they're seeking?
- Again talking to the end users and figuring out how they judge their own progress was felt to be a great way to learn to talk in their language, rather than pushing perhaps alien metrics back at them
- With a "conservative" team you might take them slowly on a journey from simple automation and reporting to increasing complexity, reducing fear and increasing the size of outcomes
- Some people talked about the need to go very slowly, building up trust in iterations and only adding automation in simple but productive steps
- Can you find ambassadors who help to bridge the gaps of fear and mis-understanding, then give them examples of what's worked elsewhere in the organisation to build confidence?
- People can be afraid of new technology and process change so support anything that derisks the new approach and solves whatever causes the resistance
You can see the full write-up on my blog.
If you face topics like this, maybe you need to join my RebelAI private leadership group for "excellent data scientists turned leaders"? Reply to this and I can send you a 3 pager PDF which explains what we've done so far.
Probabl - commercial support behind scikit-learn
One of the core developers for scikit-learn, Gaël Varoquaux, has a blog post on a new funding model call Probabl for scikit-learn and a set of associated libraries. Whilst the open source nature of the package won't change, to enable stronger financial support a separate entity is being created in collaboration with the French government. He cites the ongoing challenge of getting companies that get value from these packages actually providing financial support back to them and in the articles gives some ideas of where they're going:
Our sustainability model is still being finetuned. What I can tell is that it will involve a mix of professional service, support & sponsorship agreement, as well as a product-based offer, where we supplement scikit-learn with enterprise features. Our focus will be on features that are typically not the focus of open-source developers: integration in large structures, such as access control, LDAP connection, regulatory compliance. We will not shoehorn scikit-learn in open core or dual licensing approaches: we want our incentives to be aligned with scikit-learn, and its ecosystem, being as complete as possible.
That's a pretty exciting direction for a commercial team sitting on top of a foundational and very popular open source library.
ChatGPT is "bullshit(ing)"
ChatGPT is Bullshit (Hicks, Humphries, Slater 2024) makes the case to stop talking about "hallucinations" and instead talk about the "bullshit" generation from LLMs, notably ChatGPT.
The authors start with "a question about the nature and meaning of the text produced, and of its connection to truth" and later note "[LLMs] are not designed to represent the world at all; instead they are designed to convey convincing lines of text" and "it is aimed at being convincing rather than accurate".
The go on to discuss a distinction between "hard bullshit" where there's an intention to deceive and "soft bullshit" where text is produced without concern for truth and settle on "soft bullshit" as being what's probably happening, and maybe "hard bullshit" under certain conditions.
Personally I don't like the common use of "hallucination" when talking about the errors that LLMs make and I prefer to talk of "lies" as that's closer to a useful and painful description - hallucinations sound almost innocent and weird, lies feel like they might hurt the value we could get from the technology and make people pay a bit more attention. I'm not sure if adding the label "soft bullshit" helps move the domain forwards, it does make me think we're in the downslope of the Gartner adoption curve where we're past the peak of inflated expectations and perhaps we're all a bit more sober about what's going on in these machines.
Here's a short summary.
Recent package updates from PyPI
Be aware that NumPy 2.0 was released recently and it has broken some projects - see this compatibility list. Notably the following seem to lack NumPy 2.0 support: Tensorflow, Polars, Catboost.
Itamar's newsletter has a nice piece on getting ready for NumPy 2 support.
This is a random sample from a set of popular projects, that have been updated very recently.
- numba 0.60.0 compiling Python code using LLVM
- polars 1.0.0rc2 Blazingly fast DataFrame library
- numpy 2.0.0 Fundamental package for array computing in Python
- scikit-image 0.24.0 Image processing in Python
- flake8 7.1.0 the modular source code checker: pep8 pyflakes and co
- sktime 0.30.1 A unified framework for machine learning with time series
- coverage 7.5.4 Code coverage measurement for Python
- scikit-optimize 0.10.2 Sequential model-based optimization toolbox.
- modin 0.27.1 Modin: Make your pandas code run faster by changing one line of code.
- ruff 0.4.10 An extremely fast Python linter and code formatter, written in Rust.
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,700+ subscribers. Your first job listing is free and it'll go to all 1,700 subscribers 3 times over 6 weeks, subsequent posts are charged.
Data Scientist (FTC - 12 Months)
We’re setting up a Data Science team at MOPAC. We’ll be building the capabilities as we go: establishing analytical best practice; setting up the infrastructure; and demonstrating data science potential. You’ll be unlocking knowledge into what causes the overall trends of crime in London and compiling the code to derive meaning from survey and consultation data…and a huge number of other things!
This post is ideally suited to someone who is keen to break into the world of data science, with excellent statistical, technical, and interpersonal skills. If you're passionate about using data for the benefit of all Londoners, apply today!
- Rate: £39,604.00 - £45,411.00 per annum
- Location: Remote (One day a month in Union Street, London)
- Contact: anthony.duguay@mopac.london.gov.uk (please mention this list when you get in touch)
- Side reading: link
Senior Data Scientist
We’re setting up a Data Science team at MOPAC. We’ll be building the capabilities as we go: establishing analytical best practice; setting up the infrastructure; and demonstrating data science potential. You’ll be unlocking knowledge into what causes the overall trends of crime in London and compiling the code to derive meaning from survey and consultation data…and a huge number of other things!
This post is ideally suited to someone with data science experience who wants to be hands on in a data role, is a curious flexible thinker, with excellent statistical, technical, and interpersonal skills. If you're passionate about using data for the benefit of all Londoners, apply today!
- Rate: £46,597.00 - £53,209.00 per annum
- Location: Remote (One day a month in Union Street, London)
- Contact: anthony.duguay@mopac.london.gov.uk (please mention this list when you get in touch)
- Side reading: link
Principal Data Scientist - MOPAC
We’re setting up a Data Science team at MOPAC. We’ll be building the capabilities as we go: establishing analytical best practice; setting up the infrastructure; and demonstrating data science potential. You’ll be unlocking knowledge into what causes the overall trends of crime in London and compiling the code to derive meaning from survey and consultation data…and a huge number of other things!
The Principal Data Scientist role is ideally suited to someone with management experience, with excellent data science knowledge to lead the way on our journey into data science. If you're passionate about using data for the benefit of all Londoners, apply today!
- Rate: £55,009.00 - £62,860.00 per annum
- Location: Remote (One day a month in Union Street, London)
- Contact: anthony.duguay@mopac.london.gov.uk (please mention this list when you get in touch)
- Side reading: link
Data Engineer at Airtime Rewards, Permanent, Manchester
Design and implement robust, scalable data pipelines to ingest data from internal platforms into our data warehouse. Monitor and maintain data pipelines, ensuring data quality, integrity, and availability. Optimise data pipelines to enhance performance and reduce cloud computing costs. Understand, gather, and document detailed business requirements. Take ownership of data projects from planning to delivery, collaborating with other departments as needed. Innovate and automate current processes, driving continuous improvement.
- Rate: £35,000 - 45,000
- Location: Manchester, Hybrid (2 days/week in office)
- Contact: oguzcan.koncagul@airtimerewards.com (please mention this list when you get in touch)
- Side reading: link
Analytics Engineer at Yoto, Permanent, London
We’re looking for an Analytics Engineer to join our team to accelerate the business and help us make sense of the terabytes of data we receive every day.
We’re a small team at the heart of all the decisions Yoto makes. We work in a mature, high-trust environment with a lot of independence. Everyone can contribute ideas and be part of the decision making process. We tackle a broad range of problems, from developing cutting-edge data products to building and maintaining our data orchestration platform. Our work spans across all the key strategic projects throughout the company.
- Rate: £30,000 - £40,000 based on experience.
- Location: Kings Cross, London (Hybrid)
- Contact: jeena.lakshmanan@yotoplay.com (please mention this list when you get in touch)
- Side reading: link