Part 2 of "Polars - Faster than Pandas" interview, Pandera for dataframe checks and "how to measure anything"
Did you know that Netacea are hiring for a Head of DS + Data Engineer, Signal AI need a Senior Data Scientist, Aflorithmic need a Data Engineer and Software Engineer and VivacityLabs have a Special Projects role. Details for all of these and more are down below.
Thoughts and an interview with the author of Polars
Below I have the second part of the interview with Ritchie Vink on his "faster than Pandas" Polars dataframe library - part one came last issue. It is only 2 years old and beats Pandas (sometimes significantly) in many benchmarks. Previously I interviewed Kaggle Competition Expert Mani Sarkar which contained lots of great ML tips.
Building on the "beginner mind" note from the last issue I'll add that I keep a GDoc with high-level notes I've taken from books I've read. I read a lot and whilst lots of little ideas get integrated into my thinking, it is so easy to forget other useful points, especially "easy stuff". I've spent over a year building this doc - it contains bullets of points and is deliberately light. Looking back on it always reminds me of something useful. Do you have any practices to build a second long-term memory that you've found valuable?
Successful Data Science Projects course for Feb 9th+10th
My next Success course will give you the tools you need to derisk projects and increase the likelihood that you get to deliver ontime and with happy clients. Send me an email back if you have questions? It runs on Feb 9th+10th virtually in UK hours.
We'll look at good process to make new projects work well, work through common project failings, look at tools that make it easy to derisk new datasets and practice prioritisation and estimation. The project specification document is a key highlight in the course - this is especially useful if your team rarely writes things down or doesn't write a document that actually supports the team.
Business strategy - estimating successfully
Early in my career I was awful at estimating "how long it'd take" to run a task. This was a rarely taught skill and is a part of the wider "estimate anything" problem set. I've increasingly become interested in the estimation of value behind new projects. How does one estimate the financial value or other metric for a never-done-before data science problem, on unseen data, for a team who may not want it (but who may need it)?
If this interests you I recommend How To Measure Anything for a very sane dig into techniques that work - particularly around value estimation. I've also stumbled onto Metaculus which is a forecasting site that uses a similar approach to How To Measure Anything. I've found myself sucked into little games on predicting financial markers, dates for the prevalence of Omicron and how many electric cars might be sold in the USA in 2022. Figuring out how to come up with a convincing range (not just a point value) is an excellent exercise. I'd strongly recommend going through the Metaculus tutorial if you're at all interest in this.
Once you can come up with a defensible range estimate for a timescale or a range for expected business value you'll be in a much stronger position to influence projects and their success.
We'll talk a little about value estimation during my next Success course.
Open Source - writing better code faster
For my upcoming private Software Engineering for Data Scientists course I've replaced Bulwark with Pandera for data quality checks on Pandas dataframes (Pandera seems to have much more momentum). Helping folk spot that bad data hurts their pipelines has been a consistent win and Pandera appears to have quite a strong ecosystem - take a look if you know what you want in your dataframes. On young projects the next data dump I process always seems to have yet another oddity that trips me up.
For Notebook quality I've adopted flake8, pandas-vet, bugbear, variables-names and flake8-builtins driven by the excellent nbQA.
Respectively flake8
checks your general code quality (and is less whiney than PyLint
), Vet
checks Pandas idioms ("dont' call it df
"! "avoid .ix
"), BugBear
helps you avoid silly things like using default constructors in function arguments and the last two help you write better variable names (e.g. avoid result
or list
for names). nbQA
lets you apply a usually-script-only tool like flake8
onto a Notebook. Black of course auto-cleans Notebooks now as well as scripts. When I play at Project Euler I'm surprised by the number of silly mistakes I make that flake8
and friends helps me spot, before I run my code, which helps me keep up momentum.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Interview on "faster than Pandas" Polars with Ritchie Vink (Part 2 of 2)
Ritchie Vink is the creator of Polars, a faster-than-Pandas dataframe library that he started during the first lockdown of 2020. His "I wrote one of the fastest dataframe libraries" blog post outlines a number of the technical design points that make Polars so fast for in-memory single machine dataframe manipulation and is a great read. Let's learn some more...
You've developed a well engineered tool. What software engineering practices do you use? Do you like tests?
Of course we test! There are a lot of moving parts and you want to trust that your changes don’t break behavior nor cause bugs. Unit tests are essential for this in a library and a good Continuous Integration system means all your tests are running whenever you make big changes like refactoring a new new idea, so you know you didn't break something else.
Luckily almost all code is written in Rust which gives great compile time guarantees. I would have foregone the project already if it was written in Python only because I would not have dared some tremendous refactors we’ve done. The CI system automates a part of "me" so I don't have to check silly things like the code style of a contribution - in Python projects you'd be using black for this.
My development style focuses on "keeping it simple". If I'd have to make a choice I'd generally prefer to use composition over inheritance and definitely I'll keep everything simple.
How could a new contributor make a useful addition?
You could explore the issues or make a suggestion of a feature you’d like to see implemented (and maybe implement it). Helping write small examples for the documentation would also be very valuable, because it is far from complete yet. The code repo has marked "good first issues" and the book repo has tasks to tackle if you'd like a jumping-off point.
Pandas is famous for its time series support - what does Polars offer to a quant?
This weekend we just had a large boost to our time series support. We now have specific grouping on time windows (and other groupby keys), which allow you to do aggregations with all the expression API. Meaning you can do very complex stuff you could not do in Pandas without running slow python bytecode.
And of course Polars is super fast, so you can have ~5-50x your Pandas runtime.
What's the fastest way to get started in Polars? Do you have done tutorials to suggest and ideas on easy wins to try whilst someone experiments with Polars?
Start reading the expression guide in the user guide and follow the quick start guide.
What video would you recommend for people who want to dig more into performance?
Scott Meyer's has a great video from ten years back on CPU Caches and Why You Care with slides if you want to learn why keeping data local to CPU caches is essential to having higher performance.
Pandas was never designed to be multi-core, it leans on NumPy which was never designed to share its data into a heterogeneous container of same-length blocks and is used by Dask which knows that Pandas wasn't designed to scale beyond 1 CPU. Some of those issues can easily be traced back to cache locality - watch the video to get an expert view on performant computation.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Footnotes
See recent issues of this newsletter for a dive back in time.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it'll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Senior Data Scientist & Data Science Manager & Head of Data Engineering at Infogrid, Permanent, London
Infogrid is helping protect the planet and improve the lives of billions of people by making every building a smart building. Our goal is to be the global provider for connected devices in smart buildings. We already handle millions of events every day from tens of thousands of sensors and we’d like you to help us scale that by an order of magnitude over the coming months.
Sustainability is at our heart; buildings account for 39% of global carbon emissions and we’re creating real solutions to impact this! We are still early in our journey but have already achieved a lot; we raised a successful series A funding round, grew 5x in employee numbers within 12 months, and voted one of the top 10 most flexible places to work.
- Rate:
- Location: Remote UK
- Contact: myriam@infogrid.io (please mention this list when you get in touch)
- Side reading: link, link, link
Data Engineer at Netacea
Netacea is an industry-leading provider of bot detection & mitigation capabilities to business struggling with automated threats against their websites, apps and APIs. We ingest and predict on vast quantities of streamed real-time data, sometimes millions of messages per second. As a successful start-up that is now scaling up substantially, having robust and high quality data pipelines is more important than ever. We are looking for an experienced data engineer with a passion for technology and data to help us build a stable and scalable platform.
You will be part of a strong and established data science team, working with another data engineer and with our chief technical architect to research, explore and build our next generation pipelines & processes for handling vast quantities of data and applying our state-of-the-art bot detection capabilities. You will get the opportunity to explore new technologies, face unique challenges, and develop your own skills and experience through training opportunities and collaboration with our other highly skilled delivery teams.
- Rate: Up to £70k, dependent on experience.
- Location: UK-based remote, with office in Manchester
- Contact: katie.slater@netacea.com (please mention this list when you get in touch)
- Side reading: link, link, link
Lead Data Scientist and Data Scientist roles at Netacea
We have open positions for two mid-level data scientists on our team at Netacea. You will be joining a strong and established team of data scientists and data engineers, working on unique problems at a vast scale. You will be building an industry-leading bot detection product, solving new emerging threats for our customers, and developing your own skills and experience through training opportunities and collaboration with our other highly skilled delivery teams.
We also have two Lead Data Scientist roles with one of these specialised towards supporting long-term technical customer relationships. Both Lead roles will be fundamental to the success and growth of the data science function at Netacea. You will be a technical leader, driving quality and innovation in our product, and supporting a highly competent team to deliver revolutionary data science for our customers.
Application links: Lead Data Scientist (Commercial) - https://apply.workable.com/netacea-1/j/4B7ACCC80D/?utm_medium=social_share_link Lead Data Scientist - https://apply.workable.com/j/F3A4E8F82F/?utm_medium=social_share_link Data Scientist - https://apply.workable.com/j/D58EA8DCE2/?utm_medium=social_share_link
- Rate: Mid-level roles up to £55k dependent on experience. Lead roles between up to £85k dependent on experience.
- Location: UK-based remote, with office in Manchester
- Contact: katie.slater@netacea.com (please mention this list when you get in touch)
- Side reading: link, link, link
Head of Data Science at Netacea
Netacea is a Manchester based business providing revolutionary products including website queuing system to prevent traffic to websites that may cause failure and bot management solution that protects websites, mobile apps and APIs from heavy traffic and malicious attacks such as scraping, credential stuffing and account takeover. Netacea was recently categorised by Forrester as a leader in this rapidly expanding market.
We are looking for an outstanding leader to spearhead the growth and development of their data science team. As Head of Data Science, you will lead a department of skilled engineers to deliver outstanding solutions to the most interesting problems in cybersecurity. You will feel comfortable working in an agile way, taking ownership of data science strategy, effectiveness, delivery, and quality. You will grow, nurture, and develop your team and encourage them to explore their full potential. This is a mainly hands-off role, but you should feel confident talking about data science technology with internal and external stakeholders and partners. You will be passionate about data, and understand how it can be used to deliver value to customers.
- Rate:
- Location: UK-based remote, with office in Manchester
- Contact: katie.slater@netacea.com (please mention this list when you get in touch)
- Side reading: link, link, link
Senior Data Scientist (Platform) - Signal AI, Full-time, London (UK)
You will be a core player in the growth of our platform. You will work within one of our platform teams to innovate, collaborate, and iterate in developing solutions to difficult problems. Our teams are autonomous and cross-functional, encompassing every role required to build and improve on our products in whatever way we see best. You will be hands-on working on end-to-end product development cycles from discovery to deployment. This encompasses helping your team discover problems and explore the feasibility and value of potential ML-driven solutions; building prototype solutions and conducting offline and online experiments for validation; collaborating with engineers and product managers on bringing further iterations for those solutions into the products through integration, deployment and scaling.
This particular role will initially be within a team whose responsibilities include effectiveness and efficiency of our labelling processes and tool, training, monitoring and deployment of systems and models for entity linking, text classification and sentiment analysis, among others, across multiple data types. This team also works closely with the operation teams to ensure systems and models are properly maintained.
- Rate:
- Location: London (Old Street) - Hybrid model (2 days a week in the office)
- Contact: jiyin.he@signal-ai.com (please mention this list when you get in touch)
- Side reading: link, link, link
Software Engineer, Data Engineer and TPM at Aflorithmic Labs, London (Hybrid)
We're an audio as a service startup, building an API first solution to add audio to applications. We have customers and we're fast growing.
As Audio-As-A-Service API-first Voice Tech company our aim is to democratise the way audio is produced. We use AI and “Deepfake for Good” to create beautiful Voice and Audio from simple Text-to-speech - making creating beautiful audio content (from simple text) as easy as writing a blog. Join a 23 people strong international engineering, voice, R&D and business team made out of 13 nationalities (backgrounds include: Ex-University of Edinburgh, PhDs, European Space Agency, SAP, Amazon).
Looking for a data engineer to work on our core data pipelines for our voice-as-a-service and support our team growing. Our stack includes Kubernetes, Python, NodeJS and we use a lot of kubeflow and the serverless stack.
- Rate:
- Location: Bermondsey, London (hybrid)
- Contact: peadar@aflorithmic.ai (please mention this list when you get in touch)
- Side reading: link, link
Special Projects - Solutions Developer at Vivacity Labs, permanent, London
At Vivacity, we make cities smarter. Using Reinforcement Learning techniques at the forefront of academic and research thinking, our award winning teams optimise traffic lights to prioritise cyclists and improve air quality. Our work makes a real difference to real people using 'privacy by design' principles.
We’re looking for a confident developer / ML engineer, who is comfortable working in an adaptive setting: get familiar with complex concepts, implement accurately, and communicate your plans effectively with various stakeholders. We'd like to see 1-2 years of industry experience in a relevant field. Our software is in many modern programming languages (Python, Golang, C++ etc) so you will need a willingness to learn. We'd also like to see good capability with Python or Golang.
- Rate: £45,000 - £60,000pa
- Location: Kentish Town, London
- Contact: lindsey.noakes@vivacitylabs.com (please mention this list when you get in touch)
- Side reading: link, link, link
Zarr Community Manager, NumFOCUS, Inc.
Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. Built originally in Python for working with NumPy arrays, Zarr is now supported in more than half a dozen languages. With funding from the Chan Zuckerberg Initiative, we are looking to hire a full-time, open-source enthusiast for two years to work as our community manager.
- Rate: This role can be either a contract position or an employed position with fringe benefits. $60,000 – $80,000 per year dependent on position type and experience.
- Location: Remote
- Contact: hiring@numfocus.org (please mention this list when you get in touch)
- Side reading: link
SunPy Scientific Software Developer, NumFOCUS, Inc.
NumFOCUS is seeking a Scientific Software Developer to support the SunPy project. SunPy is a Python-based open source scientific software package supporting solar physics data analysis. Contract is available for U.S. residents only. This is a 1-year contract but work may be completed in less time.
- Rate: $80.00 per hour, not to exceed $51,000 for the duration of the contract (approximately 637 hours).
- Location: Remote
- Contact: hiring@numfocus.org (please mention this list when you get in touch)
- Side reading: link
Jupyter Community Events Manager at NumFOCUS, Inc.
The primary role of the Project Jupyter Community Events Manager will be to manage two event programs: JupyterCon and Jupyter Community Workshops. In conjunction with NumFOCUS and Project Jupyter leadership, you will create and implement a strategy to connect the international Jupyter community through both online and in-person events.
- Rate:
- Location: Austin, TX
- Contact: hiring@numfocus.org (please mention this list when you get in touch)
- Side reading: link
Software Engineer (Python Dev with AWS) at Inawisdom Ltd: Permanent; UK /WFH
Inawisdom are a Data Science & Machine Learning Consultancy, and AWS Premier Partner. We are looking for mid+ level Python developers with AWS experience (or OO Programmers with AWS who are willing to lean Python, or vice versa!) for a Permanent role. This is an exciting opportunity for someone to make an impact implementing and delivering cloud native solutions and serverless applications in a Data Science business. You will be required to develop software with the latest and greatest tech for high profile, enterprise clients.
• Knowledge of functional and object oriented programming. • Knowledge of synchronous and asynchronous programming. • 2 or more years developing in Python 2.6 or 3.x. • Experience in using Python frameworks (e.g. Flask, Boto 3) • Familiarity with Amazon Web Services (AWS) and REST APIs. • Understanding of databases and SQL. • Understanding of Non-SQL databases. • Experience in unit testing and TTD.
Desirable requirements: • Experience in AWS serverless services (Lambda, API GW, SNS, SQS, and Dynamo DB). • Has developed solutions using AWS SAM or the Serverless Framework and defined APIs in Swagger.