Two lessons for you to avoid failed data science projects
Thoughts
Thanks for the good wishes some of you have sent. We're all much better now - sleep deprivation is crap (and nasty infant illness is just awful), but everyone is on the mend. So much so that I've just been off surfing - w00t! Never is repeatedly falling off of a board so much fun.
Covid in the house passed without consequence, my wife got over hers and by isolating in my roof office she didn't pass it on (my boy and I were negative all week). We used the 3M mask I'd noted a couple of issues back plus fabric gloves for plate passing. It was a pain. I then got myself an antigen test and it seems I've never had it (it states pretty good sensitivity & specificity) which is annoying, I'd hoped I'm been asymptomatic and had just missed it at some point.
So - I'm still wearing masks when out and about and still trying to be sensible. I recommend that you stay safe, especially when in busy places - the long Covid stats are still very unpleasant, see the Independent for ONS stats - 1.2M people in the UK (a non-trivial fraction of folk who had Covid, got 3mo+ long-Covid debilitating symptoms).
Some of you ask if we "go out" and yes, cautiously, rarely to London but certainly to pubs early in the week when they're not busy. We're hardly as gregarious as we once were, but we're no longer hiding at home. Country pubs are brilliant for a lunch around a nice walk, that's our mainstay.
See further below for 13 jobs including Head of Data Science, Senior roles, quants and more for organisations like Monzo, Metropolitan Police and DeepMind.
Last week I had the pleasure of giving the opening keynote talk on Building Successful Data Science Projects at the combined PyDataBudapest and BudapestML 2022 conference (see link for agenda, scroll down and the middle track's written in English). Organiser Bence set this up to start with "how to make it work by diagnosing failures" by me, with a bunch of solid talks on technique, then finishing with "how to successfully communicate your result" by author Bill Franks. I presented remotely, I'm not quite ready to fly just yet.
My keynote focused on the multitude of failures I've had in my career - and how all these failures come down to human-factors, not "I wish I had a better algorithm!". Reflecting on these human factors was sobering (and something I do in post-client retrospectives to see what I can learn from engagements). I'll outline a story below and will add some more in coming issues.
Learning through our failures
This is a new section, I'm going to contribute some of my own stories to see what can be learned - maybe you can take a shortcut around some of my failures to improve your own career. I'll be asking experienced colleagues to share some of their own too, I figure we've a lot to learn here.
Now - do the Waynes World wiggly-hand-timeshift-routine and let's just back to the early 00s, this was the first company I was in. We were an AI speciality house (this is back before anyone thought to call the domain "data science"). My colleagues and I in the startup were working on a Sentiment Analysis engine for text. This is back when 100s of MB of RAM was "big", a disk was GBs if you were lucky, Python was young, we mostly wrote in C++ and ANNs were a complicated extravagance that nobody used. We had to solve "is this news document negative, neutral or positive about this subject" for a selection of entities in the text (so Sentiment + Named Entity Extraction). We were going to beat IBM's solution and become rich and famous. Ultimately it was the start of my consulting career, because I bailed due to the way things turned out.
One of our biggest lessons was learning - late - that if you build the best thing ever (we scored on a 5-point scale, better than IBM's 3-point scale) but haven't figured out who your users are, you won't be selling it to anyone. There's a whole start-up lesson in here which is besides the point (we consulted in many verticals using evolutionary programming, GAs and NLP), but it is safe to say that we hadn't given any real thought to what the end users would need or where we'd find them.
That sales job fell to me and I read my first "how to sell stuff" book, as you may imagine putting an MSc in charge of sales because he's the one who least resents the idea wasn't a recipe for success. This is where I first started to realise that "if we build really clever stuff, there's no guarantee that they'll buy or even know you're there". Ouch. It took a bit of random cold-calling (yep, that was my best first attempt) to realise that most companies have a manual good-enough solution and don't want to hear from a bunch of academics with a half-baked set of scripts about a possible better future that's not ready yet. Ouch again.
Most of the market didn't need automation - being very-cutting-edge (this was around 2003, so 19 years back) can make finding clients really tricky.
Once we got a client we learned another hard lesson. They worked in sentiment analysis using humans - humans read documents and scored them, but wanted to scale up and go faster - an AI solution would be ideal. They also wanted "very few errors" and it took a while to figure out that their humans were in agreement about 80% of the time, so actually we needed to agree with the consensus most of the time to be "equivalent" (so we didn't need "very few errors"). It turns out humans disagree or make mistakes when they're tired, excited, distracted or have different points of view - who knew?
So two useful lessons here would be:
- Make sure you've got your client(s) in mind before you spend the better part of a year building something complex - this is just as true for internal organisational clients, users of your site or your consulting clients. Know thy user.
- Figure out what the "definition of done" looks like and make sure it is reasonable, not aspirational.
Do you have a data science story to share coupled with a useful lesson you could share? Please drop me an email!
Tooling
In the last issue I linked to a tweet thread on using GBMs without one-hot encoding to get superior results. At the PyDataBudapest conference last week Szilárd Pafka spoke on using GBMs as the "generally best tabular data tool" and I quite agree with his point. Rather than chasing DNNs, instead focusing on the 99.9% likely best tool (a good GBM like XGB), fixing your data and moving on to harder challenges is probably your best quick-win.
Have you found situations with tabular data (i.e. spreadsheet-like data) where a GBM is beaten by a DNN? Reply and let me know, I'd love to update my knowledge and I'd happily share an anecdote here.
In Practical Automated Machine Learning for the AutoML Challenge 2018 the authors describe their winning entry to the AutoML competition (PoSH Auto-sklearn, now found here ), the last before the competition switched to automated deep-learning-only. I'll describe some of their process, the short message is that out of a number of pre-built pipelines and options then ended up just using some XGB pipelines as they consistently outperformed the results provided by the other non-DNN ML solutions in their ensemble. That's pretty sobering.
Building on a previous entry they built a set of pre-defined pipelines using a variety of estimators including from sklearn
and XGB
for regression and classification tasks. These were assembled using a pre-collected large set of test datasets, building "good pipelines" that worked well across all the known training datasets. These seed pipelines can then be updated using a Bayesian meta-learner and updated on a target challenge. This avoids a cold-start automation problem as they build a set of "pretty good start points". So far so good.
During evaluation they note that the XGB pipelines consistently outperformed the other estimators from sklearn
, so they just stuck with XGB pipelines and dropped all the others. Personally I love starting with sklearn
but often find I jump to XGB and then "we're done, let's get it shipped". Obviously this'll vary depending on your project but often getting 95%+ of the signal is more than good enough, and fighting over each incremental percent incurs huge additional time and resource costs. Using the above process they won the 2018 AutoML challenge which involves a variety of datasets and RAM+runtime constraints.
Do you have a contrasting view? I'd love to hear an opposing view and the tabular-data problem you're tackling that needs more than XGB. Just reply to this email.
Pandas request to you
What’s the hardest thing you had to learn in Pandas? I’m working on writing an intermediate-focused course for Pandas, taking the stuff that’s hard to find in the likes of stackoverflow and summarising it into a solid intermediate+ course. Please cast your mind back - what hurt you the most to learn in Pandas? If you email me (just reply to this) and you have a question about Pandas, I’ll endeavour to answer it.
0x26res (thanks!) shares these useful links, I quite agree they're worth a read:
- Modern Python by Tom Augsperger (I really don't like the chaining approach and I think "modern" is dating a bit now, but worth a read)
- Uwe's post on the Block Manager (it is still mystical, but this helps shed a bit of light onto it)
- Wes' thoughts on solving hard problems in Pandas - again dated, but worth a read - things like
Categorical
and new Extension types have been solved, parallelisedgroupby().apply()
have not!
Open source
The current version of Python is 3.10 which introduced better error messages. The next major release will be 3.11 which just had an alpha release. Excitingly we see even better tracebacks which make for easier debugging and a set of speed-ups leading to faster cpython including in the re
module which could lead to a 10-60% speed improvement in regular Python code. That's pretty cool!
This new work started from Mark Shannon's proposal a year ago which led to significant Microsoft support. The final release will be months away yet, but worth considering for pure-Python code.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it'll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Analyst and Lead Analyst at the Metropolitan Police Strategic Insight Unit
The Met is looking for an analyst and a lead analyst to join its Strategic Insight Unit (SIU). This is a small, multi-disciplinary team that combines advanced data analytics and social research skills with expertise in, and experience of, operational policing and the strategic landscape.
We're looking for people able to work with large datasets in R or Python, and care about using empirical methods to answer the most critical public safety questions in London! We're a small, agile team who work throughout the police service, so if you're keen to do some really important work in an innovative, evidence based but disruptive way, we'd love to chat.
- Rate: £30,294 - £45,189 plus location allowance
- Location: New Scotland Yard, London
- Contact: andreas.varotsis@met.police.uk (please mention this list when you get in touch)
- Side reading: link, link, link
Principal Population Health & Care Analyst
An exciting opportunity has arisen for a Principal Population Health Analyst to join the Population Health and Care team at Lewisham and Greenwich Trust (LGT) where the post holder will be instrumental in leading the analytics function and team for Lewisham's Population Health and Care system.
Lewisham is the only borough in South East London to have a population health management information system (Cerner HealtheIntent) that is capable of driving change, innovation and clinical effectiveness across the borough. The post-holder will therefore work closely with public health consultants, local stakeholders and third-party consultancies to explore epidemiology through the use of HealtheIntent, and design new models of transformative care that will deliver proactive and more sustainable health care services.
LGT is therefore seeking an experienced Principal Population Health Analyst who is equally as passionate about transforming and improving the lives and care of patients through data analytics and can draw key and actionable insights from our data. The successful candidate will be an experienced people manager with strong communication skills to lead a team of analysts and manage the provision of data analytics to a diverse range of stakeholders across Lewisham, with particular focus on population health and bring together best practice and innovative approaches.
- Rate: £61,861 - £70,959
- Location: Laurence House, 1 Catford Rd, London SE6 4RU
- Contact: rachael.crampton@nhs.net (please mention this list when you get in touch)
- Side reading: link
Data Engineer (Python, Node.js, or Go)
We import mid/large-scale data simultaneously from multiple sources (large databases, proprietary data stores, gigabyte spreadsheets), and merge it into a single queryable data-store. We need someone with a DevOps, DataScience, or Back End Engineering background to impose order on the chaos. This role is a mix of data-science and engineering-for-scale, taking real-world data and inventing automated, scalable, systems to deal with it.
This is a chance to join a well-funded startup (with revenue and customers) at the beginning of a new growth phase. Working with our Lead Back-End Engineer and CTO, you’ll be designing the new systems and taking the lead on implementing and maintaining them. Ideally you have experience of implementing backends using a variety of frameworks, techs, languages - we’re agnostic on specific tech, in most cases using the best tool for each job.
- Rate: £70k-90k
- Location: Remote
- Contact: adam@plural.ai (please mention this list when you get in touch)
Data Research Engineer at DeepMind, Permanent, London
We are looking for Data Research Engineers to join DeepMind’s newly formed Data Team. Data is playing an increasingly crucial role in the advancement of AI research, with improvements in data quality largely responsible for some of the most significant research breakthroughs in recent years. As a Data Research Engineer you will embed in research projects, focusing on improving the range and quality of data used in research across DeepMind, as well as exploring ways in which models can make better use of data.
This role encompasses aspects of both research and engineering, and may include any of the following: building scalable dataset generation pipelines; conducting deep exploratory analyses to inform new data collection and processing methods; designing and implementing performant data-loading code; running large-scale experiments with human annotators; researching ways to more effectively evaluate models; and developing new, scalable methods to extract, clean, and filter data. This role would suit a strong engineer with a curious, research-oriented mindset: when faced with ambiguity your instinct is to dig into the data, and not take performance metrics at face value.
- Rate: Competitive
- Location: London
- Contact: Please apply through the website (and please mention this list and my name — Katie Millican — when you get in touch) (please mention this list when you get in touch)
- Side reading: link
Full Stack Python Developer at Risilience
Join us in our mission to help tackle climate change, one of the biggest systemic threats facing the planet today. We are a start-up providing analytics and software to assist companies in navigating climate uncertainty and transitioning to net zero. We apply research frameworks pioneered by the Centre for Risk Studies at the University of Cambridge Judge Business School and are already engaged by some of the Europe’s biggest brands. The SaaS product that you will be working on uses cloud and Python technologies to store, analyse and visualize an organization’s climate risk and to define and monitor net-zero strategies. Your focus will be on full stack web development, delivering the work of our research teams through a scalable analytics platform and compelling data visualization. The main tech-stack is Python, Flask, Dash, Postgres and AWS. Experience of working with scientific data sets and test frameworks would be a plus. We are recruiting developers at both junior and senior levels.
- Rate:
- Location: Partial remote, Cambridge
- Contact: mark.pinkerton@risilience.com (please mention this list when you get in touch)
- Side reading: link
Python Quant Developer at Balyasny Asset Management
We want to hire an enthusiastic Python developer to help put together a modern analytics pipeline on the cloud. The ideal candidate will be able to contribute to the overall architecture, thinking about how we can distribute our data flows and calculations, our choice of databases and messaging components, and then working with devops and the rest of our team, to implement our decisions, etc. We want someone who’s an expert Python programmer, and can work well in a multi-disciplinary team, from liaising with Portfolio Managers through to our researchers writing in C++ and numpy / pandas / Jupyter / etc, and front-end developers working in react.js or Excel. Interest and experience in tech like microservices (FastAPI / Flask), Kafka, SQL, Redis, Mongo, javascript frameworks, PyXLL, etc, a bonus. But above all, leading by example when it comes to writing modern, high quality Python, and helping to put in place the structure and SDLC to enable the rest of the team to do so.
- Rate: very competitive + benefits + bonus
- Location: London / New York / hybrid
- Contact: Tony Gould tgould@bamfunds.com (please mention this list when you get in touch)
- Side reading: link, link
Research Data Scientist at Callsign, Permanent, London
We are looking for a Research Scientist who will help build, grow and promote the machine learning capabilities of Callsign's AI-driven identity and authentication solutions. The role will principally involve developing and improving machine learning models which analyse behavioural, biometric, and threat-related data. The role is centred around the research skill set--the ability to devise, implement and evaluate new machine learning models is a strong requirement. Because the role involves the entire research and development cycle from idea to production-ready code we require some experience around good software development practices, including unit testing. There is also opportunity to explore the research engineer pathway. Finally, because the role also entails writing technical documentation and whitepapers, strong writing skills are essential.
- Rate:
- Location: St. Paul's - London & Flexible
- Contact: daniel.maldonado@callsign.com (please mention this list when you get in touch)
- Side reading: link
Senior to Director Data Scientists
Data Scientists at Monzo are embedded into nearly every corner of the business, where we work on all things data: analyses and customer insights, A/B testing, metrics to help us track against our goals, and more. If you enjoy working within a cross-disciplinary team of engineers, designers, product managers (and more!) to help them understand their products, customers, and tools and how they can leverage data to achieve their goals, this role is for you!
We are currently hiring for Data Scientists across several areas of Monzo: from Monzo Flex through to Payments, Personal Banking, User Experience, and Marketing; we are additionally hiring for Manager in our Personal Banking team and Head Of-level roles in marketing. I’ve linked to some recent blog posts from the team that capture work they have done and the tools they use; if you have any questions, feel free to reach out!
- Rate: Varies by level
- Location: London / UK remote
- Contact: neal@monzo.com (please mention this list when you get in touch)
- Side reading: link, link
Head of Machine Learning
Monzo is the UK’s fastest growing app-only bank. We recently raised over $500M, valuing the company at $4.5B, and we’re growing the entire Data Science discipline in the company over the next year! Machine Learning is a specific sub-discipline of data: people in ML work across the end-to-end process, from idea to production, and have recently been focusing on several real-time inference problems in financial crime and customer operations.
We’re currently hiring more than one Head of Machine Learning, as we migrate from operating as a single, centralised team into being deeply embedded across product engineering squads all over the company. In this role, you’ll be maximising the impact and effectiveness of machine learning in an entire area of the business, helping projects launch and land, and grow and develop a diverse team of talented ML people. Feel free to reach out to Neal if you have any questions!
- Rate: >100k
- Location: London / UK remote
- Contact: neal@monzo.com; https://www.linkedin.com/in/nlathia/ (please mention this list when you get in touch)
- Side reading: link, link
Senior Data Scientist at Caterpillar, permanent, Peterborough
Caterpillar is the world’s leading manufacturer of construction and mining equipment, diesel and natural gas engines, industrial gas turbines and diesel-electric locomotives. Data is at the core of our business at Caterpillar, and there are many roles and opportunities in Data Science field. The Industrial Power Systems Division of Caterpillar currently has an opportunity for a Senior Data Scientist to support power system product development engineers with data insights, and to develop digital solutions for our customers to maximise the value they get from their equipment through condition monitoring.
As a Senior Data Scientist, you will work across, and lead, project teams to implement analytical models and data insights on a variety of telemetry and test data sources, in a mechanical product development environment.
- Rate: £50,000 to £55,000 (depending on experience) with up to 12% bonus
- Location: Peterborough (flexible working considered)
- Contact: sheehan_dan@cat.com (please mention this list when you get in touch)
- Side reading: link, link
NLP Data Scientist at Shell, Permanent, London
Curious about the role of NLP in the energy transition? Wondering how we can apply NLP to topics such as EV charging, biofuels and green hydrogen? If you are enthusiastic about all things NLP and are based in the UK, come and join us. We have an exciting position for an NLP Data Scientist to join Shell’s AI organization. Our team works on several projects across Shell’s businesses focusing on developing end-to-end NLP solutions.
As an NLP Data Scientist, you will work hands-on across project teams focusing on research and implementation of NLP models. We offer a friendly and inclusive atmosphere, time to work on creative side projects and run a biweekly NLP reading group.
- Rate:
- Location: London (Hybrid)
- Contact: merce.ricart@shell.com (please mention this list when you get in touch)
- Side reading: link
Data Scientist at EDF Energy, Permanent
If you’re an experienced Data Scientist looking for their next challenge, then we have an exciting opportunity for you. You’ll be joining a team who are striving to build a world class Data Centre of excellence helping Britain achieve Net Zero and delivering value across the customers business.
If you’re a self-starter and someone who has hands on experience with deploying data science and machine models in a commercial environment, then this is the perfect role for you.
You will also be committed to building an inclusive, diverse and value focussed culture within the Data & CRM team with a dedication to lead by example and act as a mentor for the junior members within the team.
- Rate:
- Location: Remote with occasional travel to our offices in Croydon or Hove
- Contact: gavin.hurley@edfenergy.com (please mention this list when you get in touch)
- Side reading: link