Memory use diagnosis, Ibis 3, "XGB for all"
Thoughts
This will be a short issue, my infant has been very ill in recent weeks and I confess that sleep deprivation is catching up. I had some great replies to the Coronavirus masks question that I need to summarise, that'll happen in the next issue (when I've had some sleep). I also took an antigen test which says I've never had it (go great masks!), more next time.
There's 15 jobs below including full stack devs, quants, research and senior roles at orgs including greentech, healthcare and DeepMind.
Open source - memray, Ibis, an awful bug report
I see a new memory profiler called memray from Bloomberg, described in this lovely tweet sequence. A couple of us shared some observations on its relationship to Scalene
, VizTracer
and memory-profiler
. I teach these three in my Higher Performance Python course and I need to test memray
to see how it stacks up. Have you tried it? Do you have an observation to share? Reply to this if you want to know the date for my next course.
I see that Ibis v3 has been released, see the release notes. Ibis lets you query a variety of data sources with an SQL-like interface with Pandas integration, it aims to be more concise that SQLAlchemy and abstracts away the SQL implementation (e.g. for SQLite, PostgreSQL, BigQuery). There's a set of short demos on the homepage. Have you had any value trying Ibis?
Take a look here to see an energy-sapping crappy bug report on sklearn. Obviously none of us would want to write something like this, but it happens nonetheless. If you don't yet contribute back and you'd like to support a project you like - please consider checking new bug reports and providing some polite triage. This can be as simple as pointing out that a bug report isn't usefully complete through to running a supplied test case and messaging that you verify (or not) the issue.
If you worry that you don't "have anything useful to share back", acting as a back-office admin helping with triage is a very helpful way to support the projects that help get you earn your salary.
Pandas request to you
What's the hardest thing you had to learn in Pandas? I'm working on writing an intermediate-focused course for Pandas, taking the stuff that's hard to find in the likes of stackoverflow and summarising it into a solid intermediate+ course.
Please cast your mind back - what hurt you the most to learn in Pandas? If you email me (just reply to this) and you have a question about Pandas, I'll endeavour to answer it. Topics like speeding-up groupby
operations, understanding the BlockManager
, figuring out correct logical masks, understanding datetime objects (and their limits) and resampling and generally "getting stuff done correctly the first time" feel like firm favourite issues.
Data Science Technique - XGB for all?
Here's a short tweet thread on why you probably don't want to use One Hot Encoding with XGB if you have many levels, with Kaggle Grand Master Bojan agreeing. In my (admittedly ancient now) experimentation I found the same - OHE is good for linear models but mostly not useful for RF/XGB. Also keeping the column-count down means the model trains faster and explanations and feature importances make more sense. What's your experience with GBTs and categorical columns?
I'll be giving a keynote talk on Building Successful Data Science Projects at the BudapestML conference next week (more on that in the next issue). Pafka Szilárd will be talking on Best Algorithm for Tabular/Business Data: Sorry, it’s not deep learning on how XGB is probably all you need for most of your tabular data needs. I quite agree and I'm looking forward to the talk. I've no idea if/when these talks will be public so you may want to but a ticket to virtually attend (I'm live streaming my talk as I'm not travelling at present).
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Footnotes
See recent issues of this newsletter for a dive back in time.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it'll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Principal Population Health & Care Analyst
An exciting opportunity has arisen for a Principal Population Health Analyst to join the Population Health and Care team at Lewisham and Greenwich Trust (LGT) where the post holder will be instrumental in leading the analytics function and team for Lewisham's Population Health and Care system.
Lewisham is the only borough in South East London to have a population health management information system (Cerner HealtheIntent) that is capable of driving change, innovation and clinical effectiveness across the borough. The post-holder will therefore work closely with public health consultants, local stakeholders and third-party consultancies to explore epidemiology through the use of HealtheIntent, and design new models of transformative care that will deliver proactive and more sustainable health care services.
LGT is therefore seeking an experienced Principal Population Health Analyst who is equally as passionate about transforming and improving the lives and care of patients through data analytics and can draw key and actionable insights from our data. The successful candidate will be an experienced people manager with strong communication skills to lead a team of analysts and manage the provision of data analytics to a diverse range of stakeholders across Lewisham, with particular focus on population health and bring together best practice and innovative approaches.
- Rate: £61,861 - £70,959
- Location: Laurence House, 1 Catford Rd, London SE6 4RU
- Contact: rachael.crampton@nhs.net (please mention this list when you get in touch)
- Side reading: link
Data Engineer (Python, Node.js, or Go)
We import mid/large-scale data simultaneously from multiple sources (large databases, proprietary data stores, gigabyte spreadsheets), and merge it into a single queryable data-store. We need someone with a DevOps, DataScience, or Back End Engineering background to impose order on the chaos. This role is a mix of data-science and engineering-for-scale, taking real-world data and inventing automated, scalable, systems to deal with it.
This is a chance to join a well-funded startup (with revenue and customers) at the beginning of a new growth phase. Working with our Lead Back-End Engineer and CTO, you’ll be designing the new systems and taking the lead on implementing and maintaining them. Ideally you have experience of implementing backends using a variety of frameworks, techs, languages - we’re agnostic on specific tech, in most cases using the best tool for each job.
- Rate: £70k-90k
- Location: Remote
- Contact: adam@plural.ai (please mention this list when you get in touch)
Data Research Engineer at DeepMind, Permanent, London
We are looking for Data Research Engineers to join DeepMind’s newly formed Data Team. Data is playing an increasingly crucial role in the advancement of AI research, with improvements in data quality largely responsible for some of the most significant research breakthroughs in recent years. As a Data Research Engineer you will embed in research projects, focusing on improving the range and quality of data used in research across DeepMind, as well as exploring ways in which models can make better use of data.
This role encompasses aspects of both research and engineering, and may include any of the following: building scalable dataset generation pipelines; conducting deep exploratory analyses to inform new data collection and processing methods; designing and implementing performant data-loading code; running large-scale experiments with human annotators; researching ways to more effectively evaluate models; and developing new, scalable methods to extract, clean, and filter data. This role would suit a strong engineer with a curious, research-oriented mindset: when faced with ambiguity your instinct is to dig into the data, and not take performance metrics at face value.
- Rate: Competitive
- Location: London
- Contact: Please apply through the website (and please mention this list and my name — Katie Millican — when you get in touch) (please mention this list when you get in touch)
- Side reading: link
Full Stack Python Developer at Risilience
Join us in our mission to help tackle climate change, one of the biggest systemic threats facing the planet today. We are a start-up providing analytics and software to assist companies in navigating climate uncertainty and transitioning to net zero. We apply research frameworks pioneered by the Centre for Risk Studies at the University of Cambridge Judge Business School and are already engaged by some of the Europe’s biggest brands. The SaaS product that you will be working on uses cloud and Python technologies to store, analyse and visualize an organization’s climate risk and to define and monitor net-zero strategies. Your focus will be on full stack web development, delivering the work of our research teams through a scalable analytics platform and compelling data visualization. The main tech-stack is Python, Flask, Dash, Postgres and AWS. Experience of working with scientific data sets and test frameworks would be a plus. We are recruiting developers at both junior and senior levels.
- Rate:
- Location: Partial remote, Cambridge
- Contact: mark.pinkerton@risilience.com (please mention this list when you get in touch)
- Side reading: link
Python Quant Developer at Balyasny Asset Management
We want to hire an enthusiastic Python developer to help put together a modern analytics pipeline on the cloud. The ideal candidate will be able to contribute to the overall architecture, thinking about how we can distribute our data flows and calculations, our choice of databases and messaging components, and then working with devops and the rest of our team, to implement our decisions, etc. We want someone who’s an expert Python programmer, and can work well in a multi-disciplinary team, from liaising with Portfolio Managers through to our researchers writing in C++ and numpy / pandas / Jupyter / etc, and front-end developers working in react.js or Excel. Interest and experience in tech like microservices (FastAPI / Flask), Kafka, SQL, Redis, Mongo, javascript frameworks, PyXLL, etc, a bonus. But above all, leading by example when it comes to writing modern, high quality Python, and helping to put in place the structure and SDLC to enable the rest of the team to do so.
- Rate: very competitive + benefits + bonus
- Location: London / New York / hybrid
- Contact: Tony Gould tgould@bamfunds.com (please mention this list when you get in touch)
- Side reading: link, link
Research Data Scientist at Callsign, Permanent, London
We are looking for a Research Scientist who will help build, grow and promote the machine learning capabilities of Callsign's AI-driven identity and authentication solutions. The role will principally involve developing and improving machine learning models which analyse behavioural, biometric, and threat-related data. The role is centred around the research skill set--the ability to devise, implement and evaluate new machine learning models is a strong requirement. Because the role involves the entire research and development cycle from idea to production-ready code we require some experience around good software development practices, including unit testing. There is also opportunity to explore the research engineer pathway. Finally, because the role also entails writing technical documentation and whitepapers, strong writing skills are essential.
- Rate:
- Location: St. Paul's - London & Flexible
- Contact: daniel.maldonado@callsign.com (please mention this list when you get in touch)
- Side reading: link
Senior to Director Data Scientists
Data Scientists at Monzo are embedded into nearly every corner of the business, where we work on all things data: analyses and customer insights, A/B testing, metrics to help us track against our goals, and more. If you enjoy working within a cross-disciplinary team of engineers, designers, product managers (and more!) to help them understand their products, customers, and tools and how they can leverage data to achieve their goals, this role is for you!
We are currently hiring for Data Scientists across several areas of Monzo: from Monzo Flex through to Payments, Personal Banking, User Experience, and Marketing; we are additionally hiring for Manager in our Personal Banking team and Head Of-level roles in marketing. I’ve linked to some recent blog posts from the team that capture work they have done and the tools they use; if you have any questions, feel free to reach out!
- Rate: Varies by level
- Location: London / UK remote
- Contact: neal@monzo.com (please mention this list when you get in touch)
- Side reading: link, link
Head of Machine Learning
Monzo is the UK’s fastest growing app-only bank. We recently raised over $500M, valuing the company at $4.5B, and we’re growing the entire Data Science discipline in the company over the next year! Machine Learning is a specific sub-discipline of data: people in ML work across the end-to-end process, from idea to production, and have recently been focusing on several real-time inference problems in financial crime and customer operations.
We’re currently hiring more than one Head of Machine Learning, as we migrate from operating as a single, centralised team into being deeply embedded across product engineering squads all over the company. In this role, you’ll be maximising the impact and effectiveness of machine learning in an entire area of the business, helping projects launch and land, and grow and develop a diverse team of talented ML people. Feel free to reach out to Neal if you have any questions!
- Rate: >100k
- Location: London / UK remote
- Contact: neal@monzo.com; https://www.linkedin.com/in/nlathia/ (please mention this list when you get in touch)
- Side reading: link, link
Senior Data Scientist at Caterpillar, permanent, Peterborough
Caterpillar is the world’s leading manufacturer of construction and mining equipment, diesel and natural gas engines, industrial gas turbines and diesel-electric locomotives. Data is at the core of our business at Caterpillar, and there are many roles and opportunities in Data Science field. The Industrial Power Systems Division of Caterpillar currently has an opportunity for a Senior Data Scientist to support power system product development engineers with data insights, and to develop digital solutions for our customers to maximise the value they get from their equipment through condition monitoring.
As a Senior Data Scientist, you will work across, and lead, project teams to implement analytical models and data insights on a variety of telemetry and test data sources, in a mechanical product development environment.
- Rate: £50,000 to £55,000 (depending on experience) with up to 12% bonus
- Location: Peterborough (flexible working considered)
- Contact: sheehan_dan@cat.com (please mention this list when you get in touch)
- Side reading: link, link
NLP Data Scientist at Shell, Permanent, London
Curious about the role of NLP in the energy transition? Wondering how we can apply NLP to topics such as EV charging, biofuels and green hydrogen? If you are enthusiastic about all things NLP and are based in the UK, come and join us. We have an exciting position for an NLP Data Scientist to join Shell’s AI organization. Our team works on several projects across Shell’s businesses focusing on developing end-to-end NLP solutions.
As an NLP Data Scientist, you will work hands-on across project teams focusing on research and implementation of NLP models. We offer a friendly and inclusive atmosphere, time to work on creative side projects and run a biweekly NLP reading group.
- Rate:
- Location: London (Hybrid)
- Contact: merce.ricart@shell.com (please mention this list when you get in touch)
- Side reading: link
Data Scientist at EDF Energy, Permanent
If you’re an experienced Data Scientist looking for their next challenge, then we have an exciting opportunity for you. You’ll be joining a team who are striving to build a world class Data Centre of excellence helping Britain achieve Net Zero and delivering value across the customers business.
If you’re a self-starter and someone who has hands on experience with deploying data science and machine models in a commercial environment, then this is the perfect role for you.
You will also be committed to building an inclusive, diverse and value focussed culture within the Data & CRM team with a dedication to lead by example and act as a mentor for the junior members within the team.
- Rate:
- Location: Remote with occasional travel to our offices in Croydon or Hove
- Contact: gavin.hurley@edfenergy.com (please mention this list when you get in touch)
- Side reading: link
Principal Consultant at Semantic Partners
Semantic Partners are experiencing significant demand for people with semantic technology skills, and to this end we are looking to hire around 50 people over the next 2 years. Ideally you have some practical Knowledge Graph experience but for those looking to get into Semantics, we are offering the chance to cross train into the technology skills listed below and offer full product training across and into several vendor graph products. About you - fast learner, critical thinking, requirements capture, logical reasoning, conceptual modelling, investigation, Engineering; Python/Java/C#/Javascript, HTML, CSS etc, Preferred skills - SQL, API design, HTTP System architecture
- Rate: Competitive
- Location: Remote
- Contact: Dan Collier dan.collier@semanticpartners.com (please mention this list when you get in touch)
- Side reading: link
Cloud Engineer (Python) - Anglo American
We are a new team at Anglo American (a large mining and minerals company), working on image data at scale. There are several problems with how things are currently, including: storage limited to local hard drives, compute limited to desktops, data silos and difficulties in finding and sharing image data.
The solution that you will help build will use cloud and web technologies to store, search, visualize and run compute on global scale image archives (Terabyte to Petabyte). Your focus will be on using cloud technology to scale up capabilities (e.g. storage, search and compute). You will be building and orchestrating cloud services including serverless APIs, large databases, storage accounts, Kubernetes clusters and more, all using an Infrastructure as Code approach. We work on the Microsoft Azure cloud platform, and are building on top of open-source tools and open-standards such as the Spatio-Temporal Asset Catalog, webmap tiling services such as Titiler, and the Dask parallel processing framework.
- Rate: Competitive day rate
- Location: Remote (UTC +/- 2)
- Contact: samuel.murphy@angloamerican.com (please mention this list when you get in touch)
- Side reading: link
Full Stack developer (Python & React) at Anglo American
We are a new team at Anglo American (a large mining and minerals company), working on image data at scale. There are several problems with how things are currently, including: storage limited to local hard drives, compute limited to desktops, data silos and difficulties in finding and sharing image data.
The solution that you will help build will use cloud and web technologies to store, search, visualize and run compute on global scale image archives (Terabyte to Petabyte). Your focus will be full stack web development, building back-end APIs and front-end interfaces for users to easily access these large image archives. We will be building on open tools and standards written in Python (such as the Spatio-Temporal Asset Catalog, Titiler and Dask), and you will be extending and modifying these, as well as writing new serverless APIs in Python. Front-end development will be within the context of Anglo American frameworks, primarily using React, and will involve map visualization tools such as Leaflet and/or OpenLayers.
- Rate: Competitive day rate
- Location: Remote (UTC +/- 2)
- Contact: samuel.murphy@angloamerican.com (please mention this list when you get in touch)
- Side reading: link
Senior Data Scientist, Experience Team at Spotify
We are looking for a Senior Data Scientist to join our Experience insights team to help us drive and support evidence-based design and product decisions throughout Spotify's product development process. As part of our team, you will study user behaviour, strategic initiatives, product features and more, bringing data and insights into every decision we make.
What you will do: 1) Co-operate with cross-functional teams of data scientists, user researchers, product managers, designers and engineers who are passionate about our consumer experience. 2) Perform analysis on large sets of data to extract impactful insights on user behaviour that will help drive product and design decisions. 3) Communicate insights and recommendations to stakeholders across Spotify. 4) Be a key partner in our work to build out our product strategy so that we are relevant in the daily lives of consumers.