Ian Ozsvald's Thoughts - Faster Pandas GroupBys
New look and more higher performance Python
I'm moving to a new newsletter provider, hence the rather sudden change (I'll get my CSS in order in coming issues). It is still me (Ian Ozsvald), just sans CSS.
If you haven't seen this in a while - maybe using my last provider it hit your spam folder. There's an unsubscribe link directly above this if you no longer want these emails.
And back to the usual service...
I ran another of my Successful DS Projects courses two weeks back - we got to discuss a nice set of problems and talked through using my Project Spec to help derisk projects. If you aren't currently derisking your projects I'd humbly suggest you think through all the risks - from data to deployment - and ask which might impact you so you can get in front of any issues. There are some other notes in recent newsletters.
I recently spoke on Faster Pandas at NDR.ai and DevDays conferences. My new tip is the numpy-groupies library. It uses Numba to accelerate aggregations and can outperform Pandas' own groupby. One wrinkle is that you need to factorize your data ahead of time, pd.factorize is fast but I've prototyped an even faster Numba solution. I plan to contribute some code back soon. Do you suffer from slower-than-desired groupbys? I know that can be annoying for repeat-run analysis code.
On this subject we'll be discussing faster Pandas code in my next Higher Performance Python course on July 21-23 via Zoom. We'll also use profiling to understand what's slow in your code, use Numba and vectorisation for acceleration and scale to larger data with Dask. Do you want tools that'll make you shine at work? Early bird tickets are still available.
Ages back I asked "which tools make you faster?". Dave Kirby kindly submitted VisiData - "VIM for tabular data". The videos look great, I've yet to need it but want a chance to play. You can either run it in the terminal or import it during an IPython session to navigate a DataFrame. Thanks Dave!
Georgina Roughsedge has shared some notes about her Data ETL data-checking library, written around financial data but applicable to many fields. She's actively after feedback, feel free to engage on GitHub with. Thanks Georgina! Here's a summary:
"When curating a data set there is always more data in the future. This data is often in the same format and should follow the same patterns and assumptions made on the historic set, but sometimes processes change at the source and you may not be informed. It needs to be checked and understood to make sure it's not meaningfully different before use. The main motivation for this package was so that data ingest could be set up with sensible blocks of checks at each step of a process and corrections could be objectively asked to be made by the data owners rather than data users. This was my solution to a problem that was taking up a lot of time."
Data owners may add extra fields, or relabel an existing field or overhaul the whole file / schema structure. Being able to test all assumptions for all users of the data becomes quite powerful if you can test it while the data is young and fresh to the data owners, going back months down the line people begin to forget what was done and why. Testing all assumption applies just as importantly to data coming from a system as to data in a static file as it also has opportunities for assumptions to deviate from the expected data behaviour and for manual errors to creep in."
Now some jobs...
Jobs are provided by readers, if you're growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers.
Consultant/Full-stack Developer
I'm looking for data-focussed software consultants on behalf of Sahaj.ai. Sahaj are a premium consultancy who focus on the intersection of data science, data engineering and platform engineering to solve complex problems for clients across a variety of industries. Their culture is built on trust and transparency, they have open salaries and there is a flat structure with no job titles or grades.
You can expect to work in small 2-5 people teams, working very closely with clients in iteratively developing and evolving solutions. You will play different roles and wear multiple hats, including analysis, solution design and coding.
You will have a passion for data and software engineering, craftsman-like coding prowess and great design and solutioning skills. You will be happy coding across the stack: front, back and DevOps, and have the desire and ability to learn new technologies and adapt to different situations. As a guide, you are likely to have 7 to 20 years+ experience.
- Rate: £60k-£120k dependent on experience
- Location: London (Tottenham Court Rd)
- Contact: davina@makeadifference.digital (please mention this list when you get in touch)
- Side reading: link
Data Scientist / Senior Data Scientist at FreeAgent, Permanent
FreeAgent removes the stress and pain of dealing with business finances, allowing business owners to focus on running their business. Our data science and data platform teams have created a machine learning model to categorise business banking transactions that’s currently applied to over 100,000 customers in production. We have big ambitions to further our use of machine learning and artificial intelligence and you could be a part of that! We primarily work with Python/pandas/scikit-learn and use AWS SageMaker to build and deploy our models and our regular company hack days and wiggle weeks provide a great opportunity for data scientists to pursue their own ideas.
- Rate: £34k - 65k + comprehensive benefits package
- Location: Edinburgh or hybrid/remote depending on experience
- Contact: jobs@freeagent.com (please mention this list when you get in touch)
- Side reading: link, link, link
Data Scientist
We are the fastest growing online travel agent in the UK, and we help people find their dream holiday. In the data science team, we collaborate with people across the company on business problems using programming, statistics, and machine learning.
If you love solving abstract problems, have an outstanding university degree, know SQL and Python, and have experience with machine learning, we want to hear from you!
- Rate: 50000 pa
- Location: Hammersmith or partly remote
- Contact: ben.auffarth@loveholidays.com (please mention this list when you get in touch)
- Side reading: link
Data engineer at NHS Test and Trace/ Joint Biosecurity Centre, permanent, grades from junior to senior
NHS Test and Trace and the Joint Biosecurity Centre are looking for data engineers to help the UK deal with COVID. We are looking for people at junior, mid and senior level to help the UK analyse COVID data to save lives and help the UK respond. Whilst we can't compete on salary, we have a modern cloud tech stack (AWS, Azure, Github), fascinating health datasets and some really interesting technical work that is helping the UK move forward.
We have access to the UK’s testing data and a variety of interesting related datasets, some of which pose challenges on scale and complexity for our data scientists. It is likely the challenges will grow as we get more granular data as we merge with PHE.
Technology stack (Azure SQL, Azure Devops, Azure Pipelines, AWS: Athena, Sagemaker. Most code is in Python, we use black for PEP8, Github actions for CI)
You will probably have seen some of our work in government communications over the last 12 months. These are permanent civil service roles with the associated benefits.
- Rate: £28-62k (HEO, SEO, G7)
- Location: Remote in the UK, there may be some contact time in offices in the future but very likely to be remote friendly
- Contact: Jaymin.mistry@dhsc.gov.uk (please mention this list when you get in touch)
- Side reading: link, link, link
Data scientist at NHS Test and Trace/ Joint Biosecurity Centre, permanent, grades from junior to senior
NHS Test and Trace and the Joint Biosecurity Centre are looking for data scientists and engineers to help the UK deal with COVID. We are looking for people at junior, mid and senior level to help the UK analyse COVID data to save lives and help the UK respond Whilst we can't compete on salary, we have a modern cloud tech stack (AWS, Azure, Github), fascinating health datasets and some really interesting technical work that is helping the UK move forward. You will probably have seen some of our work in government communications over the last 12 months. Examples of projects include: 1) identifying new clusters of cases using network Theory (network and GraphX), determining the effect of lockdowns (Causalimpact) and agent modelling for epidemiological models. The teams work in Python or R and there are roles that range from data analysis to running complex epidemiological models with academic partners and everything in between. These are permanent civil service roles with the associated benefits.
- Rate: £28-62k (HEO, SEO and G7)
- Location: Remote in the UK for now. Potential for some office contact time but very likely to be remote friendly
- Contact: Jaymin.mistry@dhsc.gov.uk (please mention this list when you get in touch)
- Side reading: link, link, link
Python Data Developer at Kindred Group, Permanent, London
Kindred's ambition is to be the most insight-driven gambling company and in the last few years we've invested heavily in our data and analytics capabilities. We are now at the next stage of our journey, embarking on an initiative to enhance our sports and racing modelling and quantitative analysis capabilities.
The Quantitative Team work closely with the existing data science function to play an important role in delivering a truly innovative and unparalleled experience for the customers of our sportsbook brands. This work builds upon a culture of “data as a product” to significantly extend our proof-of-concept efforts in this area.
We are looking for a software engineer with a strong interest in sporting applications and experience in building solutions to handle varied external data sources. On joining, you will be responsible for creating exceptional quality data products, primarily based on sports event and market odds data, for use within the Quantitative Team and the wider business. Your work will be integral in the team's delivery of market-leading probability and machine learning models to support our commercial and operational functions and decision making processes.
- Rate: Highly Competitive
- Location: Wimbledon, London
- Contact: jack.morrow@kindredgroup.com (please mention this list when you get in touch)
- Side reading: link, link
Quantitative Analyst - Kindred Group, Permanent
Kindred's ambition is to be the most insight driven gambling company and in the last few years we've invested heavily in our data and analytics capabilities. The quantitative team work closely with the existing data science function to play an important role in delivering a truly innovative and unparalleled experience for the customers of our sportsbook brands. The work will build upon a culture of “data as a product” to significantly extend our proof-of-concept efforts in this area.
We are now looking for a talented Quantitative Analyst to join our team to help shape our sport and racing modelling efforts. This role provides an exciting opportunity to be a pivotal part of the team. On joining, you will be responsible for performing data analysis and building probability and machine learning models to derive descriptive and predictive insight about sporting events. Your work will help to deliver market-leading tools and capabilities to support our commercial and operational functions and decision making processes.
- Rate: Highly Competitive
- Location: London, Wimbledon
- Contact: shanice.Tatter@KindredGroup.com (please mention this list when you get in touch)
- Side reading: link, link
Lead Data Scientist - Kindred Group
Kindred Group use data to build solutions that deliver our customers the best possible gaming experience and we have ambitious plans to get smarter in how we use our data. As part of these plans we’re looking to recruit a Lead Data Scientist to drive our advanced analytics initiatives and build innovative solutions using the latest techniques and technologies.
Key Accountabilities
• To lead, manage and deliver our advanced analytics initiatives using cutting edge techniques and technologies to deliver our customers the best online gaming experience.
• Working in cross functional teams to deliver innovative data driven solutions.
• Able to advise on best practises and keep the company abreast of the latest developments in technologies and techniques
• Building machine learning frameworks to drive personalisation and recommendations.
• Building predictive models to support marketing and KYC initiatives.
• Continually improving solutions through fast test and learn cycles
• Analysing a wide range of data sources to identify new business value
• Be a champion for advanced analytics across the business, educating the business about its capability and helping to identify use cases
- Rate: Competitive
- Location: London, Wimbledon
- Contact: Shanice.Tatter@kindredgroup.com (please mention this list when you get in touch)
- Side reading: link, link
Senior Machine Learning Engineer at GWI, Permanent, Athens, Greece
At GWI we are solving the problem of how we can enable users of different levels of data expertise to interpret and draw useful insights from market research data sets. We run the largest globally harmonised market research data set across nearly fifty countries and counting, as well as an increasing range of specialised data sets. Our machine learning engineers support the development of intelligent features in the next generation of our audience insights platform, and provide solutions for custom modeling and analytics projects requested by our clients.
We have an ambitious roadmap and are looking for senior machine learning engineers to bolster our ranks. The role involves a healthy mix of research, model training, coding and deployment, as well as communicating findings to various stakeholders both internal and external. Our culture and values ensure the team is well organised and always performing at a high level. We value learning and keeping abreast of the latest research very highly so we’re applying a wide range of techniques across various branches of machine learning to the services and features we build.
- Rate:
- Location: Athens, Greece
- Contact: rforte@globalwebindex.com (please mention this list when you get in touch)
- Side reading: link
Contract Senior Data Engineer, remote, at Realeyes (EU, no visa required)
We are working on a very exciting joint project with one of the largest tech companies in the world over the next 3-4 months and looking to top up our own data engineering skills with senior experts in this field. Our tech stack is a mix of AWS and GCP for historical and project requirement reasons. Predominantly working with Python, Spark, Kinesis, S3, Lambdas, Big Query.
We are looking for contractors who have skills in building scalable Big Data pipelines in the cloud, automating data quality testing, creating scalable architecture and then delivering its implementation. Our challenge is to strike the right balance between building scalable re-usable solution while iterating quickly and ensuring timely project delivery. For the duration of the project contractors would join and work remotely with existing teams.
- Rate: Highly competitive
- Location: Remote
- Contact: adam.bernat@realeyesit.com (please mention this list when you get in touch)
- Side reading: link, link
Data engineer
As the UK’s most trusted free complaints website, Resolver works hard to find the right resolution for everyone with fast, jargon-free issue resolution. We're looking for a data engineer to join our data team to work with other engineers and developers across the business with a focus on data modelling and database design. The ideal candidate will have experience working across teams and delivering designs which are adapted to the specific needs of different products. The role is perfect for a keen collaborator who'd like to make a broad impact across a business.
- Rate: competitive
- Location: Remote for now - offices in London
- Contact: edl@resolvergroup.com (please mention this list when you get in touch)
- Side reading: link
Senior/Lead Data Engineer at Ministry of Justice
Come and join the data engineering team at MoJ! - Growing team working on exciting challenges, unlocking the use of data in the justice system; - A great place to learn and improve your coding, cloud computing, and data skills, all of which will be thoroughly useful in your future career; - Roles at lead, senior and mid levels; - We welcome applicants from a range of technical backgrounds. Please see the job adverts for more details.
- Rate: £38k-£70k + excellent benefits
- Location: London / Remote
- Contact: samuel.tazzyman@digital.justice.gov.uk (please mention this list when you get in touch)
- Side reading: link, link
Senior & Lead Data Scientist positions at Abacai
Abacai is building the most customer friendly insurer in the world by blending a human touch with artificial intelligence.
We are currently hiring Senior and Lead Data Scientists. In the roles you will develop ML models for pricing, customer recommendations, fraud prediction, NLP digital servicing and image recognition for claims processing. You will work with large volumes of rich internal and external datasets while implementing cutting-edge ML/AI models using a modern AWS & Python tech stack. You will be part of a very experienced and collaborative team, work on innovative state-of-the-art projects and have significant influence and impact in a new, well-funded AI-driven Insurtech.
- Rate: Highly Competitive Salary
- Location: Shoreditch, London with flexible remote working
- Contact: dutoit.pierrej@gmail.com (please mention this list when you get in touch)
- Side reading: link, link, link
Data Architect
We are hiring a talented Data Architect. This is an exciting opportunity to join a winning company in a hyper-growth period for an exciting industry. Your central responsibility as a Data Architect, will be designing and maintaining the new enterprise Data Warehouse. To succeed in this role, you should know how to examine current data, identify data needs and work closely with the data reporting team – as well as the business - on understanding report requests to build a report friendly Data Warehouse.
The ideal candidate will also need to have proven experience in data analysis and be able to recommend database structures based on the data storage and retrieval needs within each company of the group. We require proven work experience as a Data Architect, Data Scientist or similar role, Advanced working SQL knowledge and experience working with relational databases and working knowledge of XML and JSON standards.