On “making $1M for a client” and speeding-up Dask
Thoughts - on “making $1M for a client” and speeding-up Dask
Recently with one of my clients we’ve celebrated finding over $1M in recoverable overbilling and fraud at a large fintech. I’ve reflected on the “data science” (deliberate double-quotes!) journey that we’ve undertaken that quickly got us to this high-value success on a green field project. I’d love to hear if you’ve got anecdotes to share for this sort of process?
Monzo (and others!) are hiring for DS and DEng positions, the job ads are down below as usual. I asked Neal Lathia (Dir. of ML at Monzo) about their hiring process and he pointed me at a public link that details the process. If you want to see a good example of how you might setup your own hiring process - read that link. If you want to work at Monzo then, obviously, do the same (and check the ads below).
For the last issue my link to the Metaculus prediction site was the most clicked link (13% of clicks). A couple of you asked how to go about learning to forecast and for me I’m enjoying see how others reason their way to an answer - I discussed a few in the last issue. On the Will Russia control Kyiv by June challenge (current forecast - 3% likely) there’s a nice comment by quant_speaking reasoning between the June and April challenge to look at the implied daily probability of success as estimated by the community, noting that the implied probability between the two challenges reasonably aligned. I had to spend a little while thinking about the math and found that a fun challenge (I’ve not tried winding a long-forecast to a daily-implied, then pushing it forwards again). This is the sort of comment I’d suggest reading if you want to get your head into these challenges.
There are 11 Data Sci and Data Eng jobs further below.
Better Business Process - finding a spare $1,000,000 with “data science”
So I love my data science (no double quotes) and, as you well know, quite often we can achieve high business value without much of it at all (hence “data science”). Here’s a short story on how I’ve worked with a team to find a first $1M-equivalent in recoverable overbilling and fraud at a European fintech, with another $1M+ to come.
It boils down to:
- Making sure we understand the real problem by asking the right questions
- Having good examples and metrics in place
- Iterating with human feedback using simple methods
- Validating that we find actionable & valuable cases
At the start we had a long list of potential problems, some of which might yield to good ML techniques but with a little digging we realised we knew very little about the domain. We had few positive examples, some anecdotes, little understanding of the business process that enabled these positive cases to occur and no understanding of what the business could do with new cases we might discover.
We spent a couple of months learning the domain, finding key business people, discovering more positive cases and learning where the experts thought that we might find new wins. Getting ad-hoc meetings with a variety of busy experts can be a slow business, baking in enough time for this is important.
Next we evaluated all the projects using a scoring system (something I teach in my Success class). This let us rank the “good” projects at the top, not all of which were obvious when we’d started the project. Thankfully it also pushed some stinkers to the bottom of the list.
Then we focused on using simple measures - counting, ratios, percentiles - to try to isolate the positive cases that we knew of from a subset of 10M+ invoices. Thankfully our derisking had identified that the data was largely detailed, clean and trusted (phew).
By iterating on 2 projects in a small team we tackled the highest value ideas, learning more about the business and what they both could-use and would-not-use (some fraud is such a pain to chase or of low value, there’s no point discovering it) which let us refine our efforts. Within a month we had our first deliverable of value and every few weeks we’d deliver new “looking good” cases, some of which turned into actionable cases for investigation.
Just recently we realised we’d delivered over $1M equivalent in actionable results using reproducible methodologies that’d continue to find new cases for the business and we see a route to doubling this output. I’m pretty chuffed with this and the team have done an amazing job.
One critical point is that we had established early on who our human validators would be - without critical-thinking feedback from business experts we’d have had nobody to validate our results, since we weren’t subject-domain-fraud-experts.
After several months we’re now starting to exploit some ML (notably Isolation Forests and some classic supervised methods) which helps to identify a richer set of cases that aren’t so easy to spot with human-derived rules. Starting with these techniques would have been a losing proposition as we didn’t know what we needed to solve.
I figured this was an interesting anecdote to share - sometimes teams rush to ML without building a strong foundation, perhaps forgetting to look for the easiest wins (which buy confidence and political “wins” with the wider stakeholders) and ignoring the tried-and-tested simple techniques like “counting stuff and looking for the biggest oddity”. Another win in this project was having a number of adjacent problems we could explore, allowing us to pivot if needed between ideas.
David MacIver explores the topic of a problem rich environment in his (well recommended) newsletter. Tackling a domain where there’s a set of well identified “wins” to be had, with flexibility to pivot as you go, maximises the chance of serendipity and cumulative gains across the projects and pivots you undertake. I’ve failed to do this and worked on hard isolated problems before (not always with success) and now I always look for a more flexible and rich environment for new challenges.
How have you derisked projects to get to a big gain? I’d happily share a tip back here if you reply to this.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Open Source - dask
and cookiecutter
I’m running another (private) Higher Performance Python class next week and so I’ve had some fun refreshing a few parts of my course. It is nice to see PyPy give a straight 10x performance boost on a numeric (pure-Python, no numpy
) challenge by just changing the Python executable. If you don’t use numpy
or pandas
then you might want to give it a go. Numba
gives a similar gain and more if you’ve got vectorised array
expressions.
Whilst refreshing my Dask “bigger-than-RAM” material I ran into a bug whereby Dask wouldn’t even start! It turns out it’ll optionally use a diagnostic library that isn’t installed by Dask but was installed for one of my other demos. If you get AttributeError: 'str' object has no attribute 'decode due to return pynvml.nvmlDeviceGetName(h).decode()
then come here to see what needs uninstalling.
This looks like an odd case of a helpful optional library being installed by one tool - the combined cpu+ram profiler scalene uses pynvml
for GPU profiling (which I don’t need), which is picked up by Dask and used even though I don’t have an installed GPU which causes the break. Dealing with complex real world environments is, as ever, tricky.
If you ever wonder “how can I help contribute back to open source” do take a look at my addition to the above bug report - I validated the original submitter’s issue from a different angle, diagnosed where it came from and provided a very simple reproducible demo. Another colleague (Oscar-thanks) mailed to say he’d had the same issue and +1’d it on GitHub helped to raise this issue’s profile to the core devs. Always consider adding some helpful annotations, diagnostics or path to reproducibility to a bug to help other developers figure out the necessary fix. Lightweight contributions are equally important to reducing the workload which is a lovely way to give back to the community.
This reminded me of a nice article on “making Dask 40x faster” on a large geographic dataset. This image gives annotations from one of the Dask diagnostic screens showing the issue - lots of spilling to disk (slow due to low RAM), slow data transfers and lots of unutilised CPUs (wasting processing time). BUT they then dig further using more of the Dask diagnostics for some nice reveals, one tip was to replace a groupby
with a map_partition
to remove a dependency. The critical point is that focusing on profiling gets you to the truth, don’t just focus on the first result (this is a key topic in my own course and a massive reveal for lots of students).
“Dask Dataframes are great at providing a simple Pandas-like API. But its simplicity can be deceiving: large machinery hides behind its operations. As much as the clean API tries to hide this fact, one often requires a detailed understanding of what actually happens under the hood to use Dask effectively.”
I plan to run another of my Higher Performance Python courses around June (aligned with PyDataLondon 2022 ) - reply to this if you’d like an early reminder when I’ve fixed the date. See all my courses here, dates will be confirmed soon.
PyDataLondon co-org Marco has a Pandas beyond the basics training course coming up that you might want to read about.
cookiecutter
needs some more maintainers
One of the core devs for cookiecutter
is asking for help, anyone with experience and a desire to help maintain a popular project is welcome. I use cookiecutter
in my Software Engineering class as it leads to simplified and standard folder structures. It would be a good project to learn from, if you want to give back a bit and in turn learn some new skills.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Conferences
We’ve got PyDataBerlin coming for April 11-13 as a part of PyConDE and PyDataLondon for June 17-19 at the usual location (Tower Hotel, not far from London Bridge station). For PyDataLondon we’re arranging a smaller capacity to enable more distancing, the Call for Proposals is open for both.
PyDataLondon will be an in-person (not hybrid) conference and I believe that PyDataBerlin is doing the same. It is both exciting and a bit scary (well, for me) to think about being around a lot of other people again. I’ve had a child during lockdown and practiced very-safe distancing, to the extent that my wife and I went a bit nuts and had to remember to re-engage with the wider world to save our heads.
If you have questions about PyDataLondon - particularly about sponsoring, you can reply to me directly. I’m a part of the organising group and if you want to meet smart people to hire, sponsoring is a very sensible idea. Hit me up if you’d like to discuss what that means. I’m likely to help organise the “Execs at PyData” session again, aimed at managers visiting PyData, probably I’ll also setup another pre-conference Briefings event on state of the art topics.
If you could retweet this announce it would be appreciated. All money raised from our volunteer-run conferences goes back to NumFOCUS who support our PyData core packages and much more - see supported projects here.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Footnotes
See recent issues of this newsletter for a dive back in time.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it’ll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Research Data Scientist at Callsign, Permanent, London
We are looking for a Research Scientist who will help build, grow and promote the machine learning capabilities of Callsign’s AI-driven identity and authentication solutions. The role will principally involve developing and improving machine learning models which analyse behavioural, biometric, and threat-related data. The role is centred around the research skill set–the ability to devise, implement and evaluate new machine learning models is a strong requirement. Because the role involves the entire research and development cycle from idea to production-ready code we require some experience around good software development practices, including unit testing. There is also opportunity to explore the research engineer pathway. Finally, because the role also entails writing technical documentation and whitepapers, strong writing skills are essential.
- Rate:
- Location: St. Paul’s - London & Flexible
- Contact: daniel.maldonado@callsign.com (please mention this list when you get in touch)
- Side reading: link
Senior to Director Data Scientists
Data Scientists at Monzo are embedded into nearly every corner of the business, where we work on all things data: analyses and customer insights, A/B testing, metrics to help us track against our goals, and more. If you enjoy working within a cross-disciplinary team of engineers, designers, product managers (and more!) to help them understand their products, customers, and tools and how they can leverage data to achieve their goals, this role is for you!
We are currently hiring for Data Scientists across several areas of Monzo: from Monzo Flex through to Payments, Personal Banking, User Experience, and Marketing; we are additionally hiring for Manager in our Personal Banking team and Head Of-level roles in marketing. I’ve linked to some recent blog posts from the team that capture work they have done and the tools they use; if you have any questions, feel free to reach out!
- Rate: Varies by level
- Location: London / UK remote
- Contact: neal@monzo.com (please mention this list when you get in touch)
- Side reading: link, link
Head of Machine Learning
Monzo is the UK’s fastest growing app-only bank. We recently raised over $500M, valuing the company at $4.5B, and we’re growing the entire Data Science discipline in the company over the next year! Machine Learning is a specific sub-discipline of data: people in ML work across the end-to-end process, from idea to production, and have recently been focusing on several real-time inference problems in financial crime and customer operations.
We’re currently hiring more than one Head of Machine Learning, as we migrate from operating as a single, centralised team into being deeply embedded across product engineering squads all over the company. In this role, you’ll be maximising the impact and effectiveness of machine learning in an entire area of the business, helping projects launch and land, and grow and develop a diverse team of talented ML people. Feel free to reach out to Neal if you have any questions!
- Rate: >100k
- Location: London / UK remote
- Contact: neal@monzo.com; https://www.linkedin.com/in/nlathia/ (please mention this list when you get in touch)
- Side reading: link, link
Senior Data Scientist at Caterpillar, permanent, Peterborough
Caterpillar is the world’s leading manufacturer of construction and mining equipment, diesel and natural gas engines, industrial gas turbines and diesel-electric locomotives. Data is at the core of our business at Caterpillar, and there are many roles and opportunities in Data Science field. The Industrial Power Systems Division of Caterpillar currently has an opportunity for a Senior Data Scientist to support power system product development engineers with data insights, and to develop digital solutions for our customers to maximise the value they get from their equipment through condition monitoring.
As a Senior Data Scientist, you will work across, and lead, project teams to implement analytical models and data insights on a variety of telemetry and test data sources, in a mechanical product development environment.
- Rate: £50,000 to £55,000 (depending on experience) with up to 12% bonus
- Location: Peterborough (flexible working considered)
- Contact: sheehan_dan@cat.com (please mention this list when you get in touch)
- Side reading: link, link
NLP Data Scientist at Shell, Permanent, London
Curious about the role of NLP in the energy transition? Wondering how we can apply NLP to topics such as EV charging, biofuels and green hydrogen? If you are enthusiastic about all things NLP and are based in the UK, come and join us. We have an exciting position for an NLP Data Scientist to join Shell’s AI organization. Our team works on several projects across Shell’s businesses focusing on developing end-to-end NLP solutions.
As an NLP Data Scientist, you will work hands-on across project teams focusing on research and implementation of NLP models. We offer a friendly and inclusive atmosphere, time to work on creative side projects and run a biweekly NLP reading group.
- Rate:
- Location: London (Hybrid)
- Contact: merce.ricart@shell.com (please mention this list when you get in touch)
- Side reading: link
Data Scientist at EDF Energy, Permanent
If you’re an experienced Data Scientist looking for their next challenge, then we have an exciting opportunity for you. You’ll be joining a team who are striving to build a world class Data Centre of excellence helping Britain achieve Net Zero and delivering value across the customers business.
If you’re a self-starter and someone who has hands on experience with deploying data science and machine models in a commercial environment, then this is the perfect role for you.
You will also be committed to building an inclusive, diverse and value focussed culture within the Data & CRM team with a dedication to lead by example and act as a mentor for the junior members within the team.
- Rate:
- Location: Remote with occasional travel to our offices in Croydon or Hove
- Contact: gavin.hurley@edfenergy.com (please mention this list when you get in touch)
- Side reading: link
Principal Consultant at Semantic Partners
Semantic Partners are experiencing significant demand for people with semantic technology skills, and to this end we are looking to hire around 50 people over the next 2 years. Ideally you have some practical Knowledge Graph experience but for those looking to get into Semantics, we are offering the chance to cross train into the technology skills listed below and offer full product training across and into several vendor graph products. About you - fast learner, critical thinking, requirements capture, logical reasoning, conceptual modelling, investigation, Engineering; Python/Java/C#/Javascript, HTML, CSS etc, Preferred skills - SQL, API design, HTTP System architecture
- Rate: Competitive
- Location: Remote
- Contact: Dan Collier dan.collier@semanticpartners.com (please mention this list when you get in touch)
- Side reading: link
Cloud Engineer (Python) - Anglo American
We are a new team at Anglo American (a large mining and minerals company), working on image data at scale. There are several problems with how things are currently, including: storage limited to local hard drives, compute limited to desktops, data silos and difficulties in finding and sharing image data.
The solution that you will help build will use cloud and web technologies to store, search, visualize and run compute on global scale image archives (Terabyte to Petabyte). Your focus will be on using cloud technology to scale up capabilities (e.g. storage, search and compute). You will be building and orchestrating cloud services including serverless APIs, large databases, storage accounts, Kubernetes clusters and more, all using an Infrastructure as Code approach. We work on the Microsoft Azure cloud platform, and are building on top of open-source tools and open-standards such as the Spatio-Temporal Asset Catalog, webmap tiling services such as Titiler, and the Dask parallel processing framework.
- Rate: Competitive day rate
- Location: Remote (UTC +/- 2)
- Contact: samuel.murphy@angloamerican.com (please mention this list when you get in touch)
- Side reading: link
Full Stack developer (Python & React) at Anglo American
We are a new team at Anglo American (a large mining and minerals company), working on image data at scale. There are several problems with how things are currently, including: storage limited to local hard drives, compute limited to desktops, data silos and difficulties in finding and sharing image data.
The solution that you will help build will use cloud and web technologies to store, search, visualize and run compute on global scale image archives (Terabyte to Petabyte). Your focus will be full stack web development, building back-end APIs and front-end interfaces for users to easily access these large image archives. We will be building on open tools and standards written in Python (such as the Spatio-Temporal Asset Catalog, Titiler and Dask), and you will be extending and modifying these, as well as writing new serverless APIs in Python. Front-end development will be within the context of Anglo American frameworks, primarily using React, and will involve map visualization tools such as Leaflet and/or OpenLayers.
- Rate: Competitive day rate
- Location: Remote (UTC +/- 2)
- Contact: samuel.murphy@angloamerican.com (please mention this list when you get in touch)
- Side reading: link
Senior Data Scientist, Experience Team at Spotify
We are looking for a Senior Data Scientist to join our Experience insights team to help us drive and support evidence-based design and product decisions throughout Spotify’s product development process. As part of our team, you will study user behaviour, strategic initiatives, product features and more, bringing data and insights into every decision we make.
What you will do: 1) Co-operate with cross-functional teams of data scientists, user researchers, product managers, designers and engineers who are passionate about our consumer experience. 2) Perform analysis on large sets of data to extract impactful insights on user behaviour that will help drive product and design decisions. 3) Communicate insights and recommendations to stakeholders across Spotify. 4) Be a key partner in our work to build out our product strategy so that we are relevant in the daily lives of consumers.
- Rate:
- Location: Covent Garden, London or Remote within the EMEA region
- Contact: annabeller@spotify.com (please mention this list when you get in touch)
- Side reading: link, link
Backend & Data Engineer
At Good With, you’ll work at the heart of a dynamic multidisciplinary agile team to develop a platform and infrastructure connecting a voice-enabled intelligent mobile app, financial OpenBanking data sources, state of the art intelligent analytics and real-time recommendation engine to deliver personalised financial guidance to young and vulnerable adults.
As a founding member, you’ll get share options in an innovative business, supported by Innovate UK, Oxford Innovation and SETsquared accelerator, with ambitions and roadmap to scale internationally.
Supported by Advisors: Cambridge / FinHealthTech, Paypal/Venmo & Robinhood Brand Exec, Fintech4Good CTO & cxpartners CEO.
Working with: EPIC e-health programme for financial wellbeing & ICO Sandbox for ‘user always owns data’ approaches.
• Rate: £50-65K + Share Options • Location: Flexible, remote working. Cornwall HQ
- Rate: £50-65K
- Location: Remote
- Contact: gabriela@goodwith.co (please mention this list when you get in touch)
- Side reading: link