Talking about derisking, building a backlog and estimating Data Science projects
Talking about derisking, building a backlog and estimating Data Science projects and PyDataLondon 2022 Conf videos
PyDataLondon2022 happened 6 weeks back ( schedule ), all the videos are now online alongside all the videos from all the conferences on the PyDataTV channel. Go fill your boots with great content. I’ll talk on my video further below.
Further below are 5 job roles including Senior and Staff roles in DS and DEng at organisations like Deliveroo and Regulatory Genome
In the last issue I talked about the Zoom call I ran recently (thanks to those who joined!) discussing “DS Team Structure and Growth”. On Monday 8th Aug (next week) I’ll run another on making a Backlog, Derisking and Estimation. Reply to this if you’d like an invite, we’ll have open discussion for an hour on a prioritised list of topics. I’ll write-up the notes for the next newsletter plus I’ll share that recording but if you have questions on this topic, you’re best off joining the call (even if just to lurk).
Please forward this email to a colleague if they’d benefit from attending this session, I’m happy to invite them along. Note that these sessions are for leaders in data science teams, not recruiters or solo consultants.
I had some feedback that whilst the book reference (Clean Code) in the last issue was OK, the book’s author was perhaps not as aligned to the community values of the Python world as might be hoped. I’ll ask you all - do you have books that you recommend to your teams on software engineering? I’ll happily share any useful resources that you can tell me about.
Building Successful Data Science Projects - a list of lessons from my talk at PyDataLondon 2022
I spoke on Building Successful Data Science Projects video slides at PyDataLondon 6 weeks back and really it was a talk on “15 years of my failures and the lessons I’ve learned, so you don’t have to”. If you’ve been reading my newsletter for a while you’ll know full well I like to share tips on how we “do things that work”, because having repeated failures is just so painful. I gave an earlier version of this talk as the keynote for PyDataBudapest earlier in the year. It is useful to get a chance to properly reflect on what’s worked and what’s failed in your career - making the same mistakes again is boring and best avoided.
The crux of my talk is that a lot of my observations on failed projects focus on human issues. Whilst the technology might not be good enough, it seems consistently more likely to be the case that we’re missing things like a well-understood problem to solve, or good-enough data or a route to deployment that’s feasible.
I talk on:
- Understanding a clear problem that’s actionable, where we can write a specification that all parties can agree to (plus I give a screenshot of the spec I use in my Success course)
- The need to talk to the people who understand the business prior to trying to write code (or mess around with the data) - they’re the ones who are most likely to know if signal might exist and if action is possible with an outcome
- Thinking about the Data Maturity Model of your org - if the team you work with is immature, you probably can’t do strong data science (yet)
- Making sure you can pivot to alternate problem areas so knowledge you gain can be built on - not thrown away if you have a failure
- Building trust through diagnostics and by building the simplest models you can get away with that show value
- Confirming you have a valid route to delivery (avoid being told to re-write your Random Forest in SQL!)
Do you ask these questions? Think about a recent failure - could the above questions have helped you derisk sooner and with less pain? Are there any questions you ask that I’ve not covered above? If so I’d love to hear them!
I wrap this up with a reflection on how following this advice recently let me help a client find $2M in recoverable fraud and over-billing across multiple projects in insurance and how that’s leading to further projects. I really believe that there’s a lot of business value to be had in covering the “basics” sensibly at the start of each new project.
The talk before by Dillon was also rather good - AUC is worthless: Lessons in transitioning from academic to business data science video. Dillon took a deeper look at why metrics like AUC aren’t so useful in a business setting. If you want to think deeper on how to communicate useful metrics at work, do take a look at his video. Do you have any preferred references to think at a deeper level about metrics that work best with business?
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Training dates now available
I’m pleased to say I’ve fixed the dates for my next on-line (Zoom) courses, I’ve been running these remotely with great success during our pandemic and I’m going to continue the virtual format. Early bird tickets are available and limited for each course:
- Successful Data Science Projects (aimed at leaders and project owners) - October 3-4
- Higher Performance Python (aimed at anyone who needs faster NumPy, Pandas and scaling to Dask) - October 31 - November 2
I’m happy to answer questions about the above, just reply here. If you want a notification for future dates please fill in this form.
For those of you who have been waiting a while for me to get these listed - apologies, being the father to an infant has eaten a lot of time this year and I’ve had to take things sensibly before scheduling new courses.
Open source - testing Jupyter notebooks, SciPy hosting by NumFOCUS
PyData member Eduardo Blancas is building a new library to help with testing Jupyter Notebooks. You flag certain cells in the Notebook as containing “the reference output” and then you compare newer versions against your reference. The blog post describes how it works and how to add your Notebooks into a Continuous Integration pipeline. Testing on Notebooks can be very useful if you’re auto-running Notebooks to make reports, so embedding data-focused tests where you generate your reporting seems like a useful idea. Eduardo notes:
nbsnapshot is an open-source library to test Jupyter notebooks. It works by comparing cell outputs to historical values and flagging anomalies. So, for example, if you have a notebook that trains a model, you can use nbsnapshot to record the performance. If, at any point, the performance deviates from an expected range, nbsnapshot will alert you so you can check what happened. Read more in the blog post.
Nic Crane tweets about their rstudio::conf talk and the PR process noting “Learning not to interpret brevity as rudeness is a skill I had to develop” and “…pointing the contributor in the right direction when they make a mistake, and expanding on reasoning for decisions, can also help.”. This is a reflection on open source communication - maybe this thread helps some of you writing (or receiving) PR feedback, both in the OS community and in the workplace.
Kyle Niemeyer notes that the SciPy conference series is handing over the reigns to NumFOCUS. Typically SciPy has been more scientifically-focused whilst PyData is focused on data science & business. If you like PyData then you’ll love SciPy, I’ve spoken at the EuroSciPy variant here a number of times in the past and the talks are always top notch. Kyle goes on to say:
Wow, big news at #SciPy2022: @enthought is handing the reins of @SciPyConf to @NumFOCUS! Exciting change, will definitely make it easier to hold in other locations in the future. (Also great to have an even tighter connection with the scientific open source world, & a nonprofit)
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it’ll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Senior Data Scientist, J2-Reliance, Permanent, London
As a Senior Data Scientist at J2-Reliance, you will be responsible for developing Data Science solutions and Machine Learning models tailored to the needs of J2-Reliance’s clients. You will typically act as a Full-Stack Project Owner: you will be the main referent for a well delimited client’s problem, and you will be responsible for the conception and the implementation of the end-to-end solution to solve it. You will be supported by a program manager (your direct supervisor) and a Data Engineer helping with industrialisation. The specific nature and level of their involvement will depend on your areas of expertise and the specificities of the project.
- Rate: >60000 p.a.
- Location: Fleet Street, Central London
- Contact: damien.arnol@j2reliance.co.uk (please mention this list when you get in touch)
- Side reading: link
Data Scientist (Plants for Health) at Royal Botanic Gardens, Kew
The Royal Botanic Gardens, Kew (RBG Kew) is a leading plant science institute, UNESCO World Heritage Site, and major visitor attraction. Our mission is to understand and protect plants and fungi for the well-being of people and the future of all life on Earth.
Kew’s new Plants for Health initiative aims to build an enhanced resource for data about plants used in food supplements, allergens, cosmetics, and medicines to support novel research and the correct use and regulation of these plants.
We are looking for a Data Scientist with experience in developing data mining tools to support this. The successful candidate’s responsibilities will include developing semi-automonous tools to mine published literature for key medicinal plant data that can be used by other members of the team and collaborators at partner institutes.
- Rate: £32,000
- Location: Hybrid, Kew (London)
- Contact: b.alkin@kew.org (please mention this list when you get in touch)
- Side reading: link
Data Engineer at IndexLab
IndexLab is a new research and intelligence company specialising in measuring the use of AI and other emerging technologies. We’re setting out to build the world’s first index to publicly rank the largest companies in the world on their AI maturity, using advanced data gathering techniques across a wide range of unstructured data sources. We’re looking for an experienced Data Engineer to join our team to help set up our data infrastructure, put data gathering models into production and build ETL processes. As we’re a small team, this role comes with the benefit of being able to work on the full spectrum of data engineering tasks, right through to the web back-end if that’s what interests you! This is an exciting opportunity to join an early stage startup and help shape our tech stack.
- Rate: £50-70K
- Location: London (Mostly Remote)
- Contact: Send CV to careers@indexlab.com (please mention this list when you get in touch)
- Side reading: link
Staff & Principal Machine Learning Engineer at Deliveroo
We are looking for Staff, Senior Staff & Principal ML Engineers to design and build algorithmic and machine learning systems that power Deliveroo. Our MLEs work in cross-functional teams alongside engineers, data scientists and product managers, who develop systems that make automated decisions at a massive scale.
We have many problems available to solve across the company, including optimising our delivery network, optimising consumer and rider fees, building recommender systems and search and ranking algos, detecting fraud and abuse, time-series forecasting, building a ML platform, and more.
- Rate:
- Location: London / Remote
- Contact: james.dance@deliveroo.co.uk (please mention this list when you get in touch)
- Side reading: link
Data Scientist, Regulatory Genome Development Ltd
The Regulatory Genome Project (RGP), part of the Cambridge Centre for Alternative Finance, was set up in 2020 to promote innovation by unlocking information hidden on regulator’s websites and in PDFs. We’re a commercial spin-out from The University of Cambridge’s Judge Business School and our proposition is to make the world’s regulatory information machine-readable and thereby enable an active ecosystem of partners, law firms, standard-setting bodies and application providers to address the world’s regulatory challenges.
We’re looking for a data scientist to join our remote-friendly technical team of software engineers, machine learning experts, and data scientists who’ll work closely with skilled regulatory analysts to engineer features and guide the work of a dedicated annotation team. You’ll help develop, train, and evaluate information extraction and classification models against the regulatory taxonomies devised by the RGP as we scale our operations from 100 to over 600 publishers of regulation worldwide.