Metrics used whilst “making $1M for my client”, saving power wastage through data, finding effective Covid masks
Thoughts - metrics used whilst “making $1M for my client”
Below I talk on system profiling before trying to make things faster, Covid masks (maybe you have better recommendations?), metrics to monitor research progress to build on my comments in the last issue, extracting Pandas dataframes from PDF tables, Python 3.10 and lowering one’s energy bill by recording data in the house.
Further below there are 13 job ads for data science and data engineering including with DeepMind, Risilience (real interesting!), Monzo, BAMFunds and more.
PyDataLondon 2022 is coming for June 17th. The Call for Proposals is open, we’re looking for sponsors (totally useful if you’re hiring) and tickets are on early-bird for <1 week so get them cheaper if you buy in the next few days. This conference is in-person, with a reduced capacity to make it safer. Reply directly to me if you’re interested in sponsoring.
I recently ran another of my Higher Performance Python private classes. During the discussion I was asked “but how do we know when to rewrite our Python app from processes to threads, or if even that’s our bottleneck?” which led to a nice discussion on profiling the whole system. My advice - start with dstat to figure out where the bottleneck is on the machine. Is it low network capacity due to flooding? Is it slow for disk read or write due to slow or misconfigured disks and high data volumes? Is it low on CPU resources due to the code’s implementation? Maybe it is low RAM causing processes to die and restart?
Until you know your first bottleneck, there’s no point arguing fine points about an implementation - you have to know where to focus first. I’ll run a public version of this course in a few months - reply to this if you’d like a one-off notification for the course (or any of my other courses ). I reference successful profiling with household power monitoring at the end of this newsletter too.
Covid-19
We’ve got Covid in the house - just my wife so far (not too bad, she’s isolating in my roof office) and my son and I test negative. Given that we’re going to see more strains we’re still practicing “fairly safe behaviour”. We started with these FFP2 ctc masks but they fit my face poorly. I then went for these FFP2 Medisana which fit better, but have settled on these FFP2 3M masks which better fit my face, have a below-chin section and a padded-nose-bridge which forms the best seal of them all. I feel comfy wearing these as I can see them puff in/out as I breathe, so there’s a pressure difference to the external environment so most of the air I breathe is filtered. Have you found anything better for public traveling?
Better research process - progress metrics
In the last issue I wrote on making $1M for a client and some of the process behind that. This time I’ll talk a little on the metrics we’ve used.
Recently I wrote on leading vs lagging indicators - the notion that tracking a fast-moving leading indicator is better for marking progress than following a slower-moving indicator. Discovering fraud is a slow game and it is subject to lots of noise - lots of unusual behaviour isn’t fraud, but requires investigation to understand. After many false positive results you lose confidence and motivation can wane. To keep motivation high I designed a set of metrics - I’d value your feedback on how you’ve tackled similar situations.
With my team we focused on a range of indicators, each of decreasing frequency and with an increasing lag to feedback:
- number of analyses performed each week per project (always >1)
- number of deliveries made to the internal fraud experts (generally 1 across the team per week)
- number of times the fraud expert says “ohh - now that’s interesting” plus a count of the value of these deliveries (once every few deliveries)
- number of confirmations back from the fraud experts and related industrial collaborators (some take days, some will take many months - we have a handful of very positive results)
- amount of confirmed recoverable fraud (this started at $10ks and occasionally jumped by large fractions of a million - super exciting and so hard to predict)
One of our earliest deliveries is unlikely to yield a clear answer from a domain expert for 3 more months (and we delivered a few months back). Other deliveries get a clear answer in days. You can see why having a range of progress-markers is important. With these we can communicate progress to the bosses, keep the team’s motivation high and keep reviewing where we think we might want to focus more effort.
Have you faced similar challenges? How have you solved the discontinuity between rare discoveries and slow feedback cycles?
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Open Source
Convert PDF tables into Pandas DataFrames with Tabula-Py, noted by Marlene Mhangami. Have you tried this? It looks like it’ll pull out multiple tables from a document.
Cython turned 20 - I used to do a lot of high performance work in Cython (but now use and teach Numba). Pre-Numba I used OpenMP to accelerate physics and fluid dynamics simulations at commercial orgs using Cython (with prange
). Writing in the Cython meta language was a bit annoying but the results via gcc
were awesome. Numba gets you most of the way there now for very little work but there’s still a place in my heart for Cython. Notably Cython integrates with other libraries (Numba can’t) and builds a static lib that’s easily added to your distribution. You may not realise that Cython is heavily used inside sklearn
and Pandas
for predictable performance.
Do you use the Python array
module for storing blocks of homogenous data with low RAM cost? There was a discussion on Twitter about this and I noted that I’m still not sure why it is there, except as the reference implementation - do you know otherwise?
Little reminder - Python 3.10 introduces better error messages which help you to debug your code faster. Anything that keeps you in the zone longer has got to be a good thing. You might want to upgrade to give this a go, I’ve upgraded some of my projects specifically because of this.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Doing some good
If you want to do something useful right now - consider donating to a UK foodbank due to the rise in food and fuel poverty. You could donate to the Ukraine too (military or humanitarian aid). Personally I’ve donated over a grand a year to my local food bank, along with other donations to the Ukraine, wikipedia, save the children and others. As usual - cash beats light activities like a retweet or wishful thinking.
Lowering your energy bill and saving the climate through data
I’ve long been interested in power usage in the house, I last made a power inventory back in 2010 - cripes, 12 years ago! I’ve moved house many times since but once I’ve gathered data at home I’ll do whatever comparison I can.
I’m using a Power Meter Plug to measure the Watts/hr usage of various appliances at home. Using our generic home energy meter I can see that overnight - when everything is “probably off” we’re still drawing a baseline current of 0.3kWh. This (at 6p/h) equates to circa £500/year in our baseline power draw (of an estimated £2-2.5k combined energy bill). Can I improve on this?
Using the above device I noted that:
- overnight my front-room amplifier and bass box are on costing over 30W/hr (sitting idle, every night)
- overnight my office LCD monitor fails to powersave even though the laptop is off - so it cycles between 4W and 35W modes whilst seeking an input (and has done for years) - the backlight is the expensive part
In the same rooms the TV+media laptop and work laptop respectively are off. By turning these extra devices off I’m saving over 50W/hr overnight and for a chunk of the day, worth £100 on my energy bill. I care less about the £100 and more about the unnecessary waste burning (probably carbon-based) fuel to keep devices on with no utility. Given that the Power Meter Plug cost £20, that’s obviously paid for itself. I’m hoping to figure out where the rest of the overnight draw goes to see what else might be saved.
Along the way I found a couple of power-bricks plugged into sockets, but not connected to devices, drawing 1W each (e.g. I had my previous Virgin power-brick despite having replaced the Virgin unit a year back). Some of these savings are trivial in cash but what I’m more interested in is behaviour - what did I forget about that’s just wasting resources? What might be old and very inefficient? Only the data can give me a straight answer. As I noted at the start of the newsletter - it all comes down to profiling if you want to make sensible decisions.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Footnotes
See recent issues of this newsletter for a dive back in time.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it’ll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Data Research Engineer at DeepMind, Permanent, London
We are looking for Data Research Engineers to join DeepMind’s newly formed Data Team. Data is playing an increasingly crucial role in the advancement of AI research, with improvements in data quality largely responsible for some of the most significant research breakthroughs in recent years. As a Data Research Engineer you will embed in research projects, focusing on improving the range and quality of data used in research across DeepMind, as well as exploring ways in which models can make better use of data.
This role encompasses aspects of both research and engineering, and may include any of the following: building scalable dataset generation pipelines; conducting deep exploratory analyses to inform new data collection and processing methods; designing and implementing performant data-loading code; running large-scale experiments with human annotators; researching ways to more effectively evaluate models; and developing new, scalable methods to extract, clean, and filter data. This role would suit a strong engineer with a curious, research-oriented mindset: when faced with ambiguity your instinct is to dig into the data, and not take performance metrics at face value.
- Rate: Competitive
- Location: London
- Contact: Please apply through the website (and please mention this list and my name — Katie Millican — when you get in touch) (please mention this list when you get in touch)
- Side reading: link
Full Stack Python Developer at Risilience
Join us in our mission to help tackle climate change, one of the biggest systemic threats facing the planet today. We are a start-up providing analytics and software to assist companies in navigating climate uncertainty and transitioning to net zero. We apply research frameworks pioneered by the Centre for Risk Studies at the University of Cambridge Judge Business School and are already engaged by some of the Europe’s biggest brands. The SaaS product that you will be working on uses cloud and Python technologies to store, analyse and visualize an organization’s climate risk and to define and monitor net-zero strategies. Your focus will be on full stack web development, delivering the work of our research teams through a scalable analytics platform and compelling data visualization. The main tech-stack is Python, Flask, Dash, Postgres and AWS. Experience of working with scientific data sets and test frameworks would be a plus. We are recruiting developers at both junior and senior levels.
- Rate:
- Location: Partial remote, Cambridge
- Contact: mark.pinkerton@risilience.com (please mention this list when you get in touch)
- Side reading: link
Python Quant Developer at Balyasny Asset Management
We want to hire an enthusiastic Python developer to help put together a modern analytics pipeline on the cloud. The ideal candidate will be able to contribute to the overall architecture, thinking about how we can distribute our data flows and calculations, our choice of databases and messaging components, and then working with devops and the rest of our team, to implement our decisions, etc. We want someone who’s an expert Python programmer, and can work well in a multi-disciplinary team, from liaising with Portfolio Managers through to our researchers writing in C++ and numpy / pandas / Jupyter / etc, and front-end developers working in react.js or Excel. Interest and experience in tech like microservices (FastAPI / Flask), Kafka, SQL, Redis, Mongo, javascript frameworks, PyXLL, etc, a bonus. But above all, leading by example when it comes to writing modern, high quality Python, and helping to put in place the structure and SDLC to enable the rest of the team to do so.
- Rate: very competitive + benefits + bonus
- Location: London / New York / hybrid
- Contact: Tony Gould tgould@bamfunds.com (please mention this list when you get in touch)
- Side reading: link, link
Research Data Scientist at Callsign, Permanent, London
We are looking for a Research Scientist who will help build, grow and promote the machine learning capabilities of Callsign’s AI-driven identity and authentication solutions. The role will principally involve developing and improving machine learning models which analyse behavioural, biometric, and threat-related data. The role is centred around the research skill set–the ability to devise, implement and evaluate new machine learning models is a strong requirement. Because the role involves the entire research and development cycle from idea to production-ready code we require some experience around good software development practices, including unit testing. There is also opportunity to explore the research engineer pathway. Finally, because the role also entails writing technical documentation and whitepapers, strong writing skills are essential.
- Rate:
- Location: St. Paul’s - London & Flexible
- Contact: daniel.maldonado@callsign.com (please mention this list when you get in touch)
- Side reading: link
Senior to Director Data Scientists
Data Scientists at Monzo are embedded into nearly every corner of the business, where we work on all things data: analyses and customer insights, A/B testing, metrics to help us track against our goals, and more. If you enjoy working within a cross-disciplinary team of engineers, designers, product managers (and more!) to help them understand their products, customers, and tools and how they can leverage data to achieve their goals, this role is for you!
We are currently hiring for Data Scientists across several areas of Monzo: from Monzo Flex through to Payments, Personal Banking, User Experience, and Marketing; we are additionally hiring for Manager in our Personal Banking team and Head Of-level roles in marketing. I’ve linked to some recent blog posts from the team that capture work they have done and the tools they use; if you have any questions, feel free to reach out!
- Rate: Varies by level
- Location: London / UK remote
- Contact: neal@monzo.com (please mention this list when you get in touch)
- Side reading: link, link
Head of Machine Learning
Monzo is the UK’s fastest growing app-only bank. We recently raised over $500M, valuing the company at $4.5B, and we’re growing the entire Data Science discipline in the company over the next year! Machine Learning is a specific sub-discipline of data: people in ML work across the end-to-end process, from idea to production, and have recently been focusing on several real-time inference problems in financial crime and customer operations.
We’re currently hiring more than one Head of Machine Learning, as we migrate from operating as a single, centralised team into being deeply embedded across product engineering squads all over the company. In this role, you’ll be maximising the impact and effectiveness of machine learning in an entire area of the business, helping projects launch and land, and grow and develop a diverse team of talented ML people. Feel free to reach out to Neal if you have any questions!
- Rate: >100k
- Location: London / UK remote
- Contact: neal@monzo.com; https://www.linkedin.com/in/nlathia/ (please mention this list when you get in touch)
- Side reading: link, link
Senior Data Scientist at Caterpillar, permanent, Peterborough
Caterpillar is the world’s leading manufacturer of construction and mining equipment, diesel and natural gas engines, industrial gas turbines and diesel-electric locomotives. Data is at the core of our business at Caterpillar, and there are many roles and opportunities in Data Science field. The Industrial Power Systems Division of Caterpillar currently has an opportunity for a Senior Data Scientist to support power system product development engineers with data insights, and to develop digital solutions for our customers to maximise the value they get from their equipment through condition monitoring.
As a Senior Data Scientist, you will work across, and lead, project teams to implement analytical models and data insights on a variety of telemetry and test data sources, in a mechanical product development environment.
- Rate: £50,000 to £55,000 (depending on experience) with up to 12% bonus
- Location: Peterborough (flexible working considered)
- Contact: sheehan_dan@cat.com (please mention this list when you get in touch)
- Side reading: link, link
NLP Data Scientist at Shell, Permanent, London
Curious about the role of NLP in the energy transition? Wondering how we can apply NLP to topics such as EV charging, biofuels and green hydrogen? If you are enthusiastic about all things NLP and are based in the UK, come and join us. We have an exciting position for an NLP Data Scientist to join Shell’s AI organization. Our team works on several projects across Shell’s businesses focusing on developing end-to-end NLP solutions.
As an NLP Data Scientist, you will work hands-on across project teams focusing on research and implementation of NLP models. We offer a friendly and inclusive atmosphere, time to work on creative side projects and run a biweekly NLP reading group.
- Rate:
- Location: London (Hybrid)
- Contact: merce.ricart@shell.com (please mention this list when you get in touch)
- Side reading: link
Data Scientist at EDF Energy, Permanent
If you’re an experienced Data Scientist looking for their next challenge, then we have an exciting opportunity for you. You’ll be joining a team who are striving to build a world class Data Centre of excellence helping Britain achieve Net Zero and delivering value across the customers business.
If you’re a self-starter and someone who has hands on experience with deploying data science and machine models in a commercial environment, then this is the perfect role for you.
You will also be committed to building an inclusive, diverse and value focussed culture within the Data & CRM team with a dedication to lead by example and act as a mentor for the junior members within the team.
- Rate:
- Location: Remote with occasional travel to our offices in Croydon or Hove
- Contact: gavin.hurley@edfenergy.com (please mention this list when you get in touch)
- Side reading: link
Principal Consultant at Semantic Partners
Semantic Partners are experiencing significant demand for people with semantic technology skills, and to this end we are looking to hire around 50 people over the next 2 years. Ideally you have some practical Knowledge Graph experience but for those looking to get into Semantics, we are offering the chance to cross train into the technology skills listed below and offer full product training across and into several vendor graph products. About you - fast learner, critical thinking, requirements capture, logical reasoning, conceptual modelling, investigation, Engineering; Python/Java/C#/Javascript, HTML, CSS etc, Preferred skills - SQL, API design, HTTP System architecture
- Rate: Competitive
- Location: Remote
- Contact: Dan Collier dan.collier@semanticpartners.com (please mention this list when you get in touch)
- Side reading: link
Cloud Engineer (Python) - Anglo American
We are a new team at Anglo American (a large mining and minerals company), working on image data at scale. There are several problems with how things are currently, including: storage limited to local hard drives, compute limited to desktops, data silos and difficulties in finding and sharing image data.
The solution that you will help build will use cloud and web technologies to store, search, visualize and run compute on global scale image archives (Terabyte to Petabyte). Your focus will be on using cloud technology to scale up capabilities (e.g. storage, search and compute). You will be building and orchestrating cloud services including serverless APIs, large databases, storage accounts, Kubernetes clusters and more, all using an Infrastructure as Code approach. We work on the Microsoft Azure cloud platform, and are building on top of open-source tools and open-standards such as the Spatio-Temporal Asset Catalog, webmap tiling services such as Titiler, and the Dask parallel processing framework.
- Rate: Competitive day rate
- Location: Remote (UTC +/- 2)
- Contact: samuel.murphy@angloamerican.com (please mention this list when you get in touch)
- Side reading: link
Full Stack developer (Python & React) at Anglo American
We are a new team at Anglo American (a large mining and minerals company), working on image data at scale. There are several problems with how things are currently, including: storage limited to local hard drives, compute limited to desktops, data silos and difficulties in finding and sharing image data.
The solution that you will help build will use cloud and web technologies to store, search, visualize and run compute on global scale image archives (Terabyte to Petabyte). Your focus will be full stack web development, building back-end APIs and front-end interfaces for users to easily access these large image archives. We will be building on open tools and standards written in Python (such as the Spatio-Temporal Asset Catalog, Titiler and Dask), and you will be extending and modifying these, as well as writing new serverless APIs in Python. Front-end development will be within the context of Anglo American frameworks, primarily using React, and will involve map visualization tools such as Leaflet and/or OpenLayers.
- Rate: Competitive day rate
- Location: Remote (UTC +/- 2)
- Contact: samuel.murphy@angloamerican.com (please mention this list when you get in touch)
- Side reading: link
Senior Data Scientist, Experience Team at Spotify
We are looking for a Senior Data Scientist to join our Experience insights team to help us drive and support evidence-based design and product decisions throughout Spotify’s product development process. As part of our team, you will study user behaviour, strategic initiatives, product features and more, bringing data and insights into every decision we make.
What you will do: 1) Co-operate with cross-functional teams of data scientists, user researchers, product managers, designers and engineers who are passionate about our consumer experience. 2) Perform analysis on large sets of data to extract impactful insights on user behaviour that will help drive product and design decisions. 3) Communicate insights and recommendations to stakeholders across Spotify. 4) Be a key partner in our work to build out our product strategy so that we are relevant in the daily lives of consumers.