A new book on feature engineering, imminent Pandas 2 and more
A new book on feature engineering, imminent Pandas 2 and more
Further below are 4 jobs including Senior Data Science positions at companies like Made with Intent and Coefficient Systems...
Soledad Galli has written the 2nd edition of her book on feature engineering - it is aimed at anyone looking for new inspiration with a focus on tabular data. I give a review below. I've also noted several new open source packages including the imminent Pandas 2, a new version of Modin (to accelerate Pandas), Redframes for simpler Pandas work and an update to Altair.
Software Engineering for Data Science course in a month
On April 12-14 I’ll run my next Software Engineering for Data Scientists course - you should attend if you need to write tests and move from Notebooks to buildin g maintainable library code.
This course is aimed at anyone who is unconfident in testing their code with py.test
and Pandera
, wants to review code with colleagues and wants to understand the path
system behind how Python loads modules and your code. Please rep
ly directly to this if you have questions. After this course you'll be more confident talking to engineers and you'll know how to structure your projects for growth and collaboration.
Soledad Galli's Python Feature Engineering Cookbook (2nd ed)
A couple of issues back I noted I'd started to read Soledad's new book. Soledad is the author of the well-regarded feature_engine library and released the 2nd edition of Python Feature Engineering Cookbook around Christmas.
Why might you care? Typically on tabular or text data you don't have many useful raw features to work with. Tabular data is often encoded to fit a business process and text data always needs processing to create ML features. This book doesn't cover image processing, audio or video which we'd typically process using DNN techniques.
Most chapters use Soledad's feature engine
library to make features for scikit-learn
, plus a couple of other well-regarded tools. Eleven chapters cover:
- Imputing missing data
- Encoding categorical variables
- Transforming numerical variables
- Performing variable discretization
- Working with outliers
- Extracting features from date and time variables
- Performing feature scaling
- Creating new features (by combining existing features)
- Extracting features from relational data with featuretools
- Creating features from a time series with tsfresh
- Extracting features from text variables
Each chapter gives a useful introduction to the problems you might face and how you might solve them. Common datasets from sklearn are used which are easily retrieved as they're small, they've also been frequently referenced in the research domain for decades so there's a ton of material on them if you want to dig further. Around 70 recipes are demonstrated using code snippets which would be easily adapted to your problems.
In the "Creating new features" chapter a nice example focuses on combining features with decision trees. On the California housing dataset the two variables AveRooms
and AveBedrms
are combined using a Decision Tree to predict the target, including tuning the tree's depth. Next there's a couple of diagnostic graphs showing that the new feature has a far more linear relationship with the target variable than either of the two individual features. I noticed a lot of examples would work well if you're using linear models (and of course they work fine for non-linear ML) - and I rather like this.
Rather than rushing for complex models, the focus is on producing simple and clear features that can be understood. This is immensely useful if you're a) trying to build trust in your own data and b) trying to build trust with your business colleagues that your model isn't a complex uninterpretable bunch of math. This is what you need when you're building new models and you don't yet have institutional buy-in. You also need it when you need to get to the next level of data exploitation and you'll get a whole set of ideas from this book.
Elsewhere in the book you'll find a section on discretizing variables using decision trees (yeah, I like trees based systems). This example changes the MedInc
income column for the same Californian housing dataset into discrete buckets. I've had to do this by hand in the world of insurance to explain the relationship between variables and choosing the "right" bins is hard...why not let the machine make sensible choices - quickly - for us?
The further reading section points to Winning the KDD Cup Orange Challenge with Ensemble Selection where the technique was first described - there's a set of further reading links throughout the book. The KDD paper was a nice read as it documented how they solved a time-restricted challenge through R&D to submission - if you're early in your ML journey, even though it is old this story will have value to you.
The time-series data section looks very useful. I've built various time-series features by hand for specific problems (e.g. day of week, weekday vs weekend, day/night, lunch, bank holidays, features offset by time periods), hand crafted to fit the business logic behind the data. This chapter uses tsfresh
and it describes how it can extract 63 time series chatacterisation results and combine them to make up to 750 features automatically. I've not used this library and this chapter provides a clear introduction ending in using the new features in a pipeline
.
The text variables chapter is a gentle first introduction to text processing using NLTK and text tools in scikit-learn
like the CountVectorizer
. NLTK is the solid, if old, standard text processing library with a focus on splitting text, cleaning and stemming (finding the roots of word like "run" for "run", "runs", "running", "runner"). There are stronger libraries for NLP but if you're getting started, this chapter will open the first door for you.
Overall if you're actively working on ML across a variety of domains using tabular data and some text, you'll probably find lots of interesting ideas in this book. It is well written, easy to follow and has lots of explanations and pictures which makes it easy to dip into.
Open source - Pandas 2, Modin, Redframes and Altair
Pandas 2.0 Release Candidate
Pandas 2.0 is coming soon, it'll be a big upgrade which I believe will be backwards compatible. The what's new page is still a bit rough. Indexes change a bit - removing old specific types and generalising to numpy
standard types. Copy on write is introduced which means memory usage should be more predictable - arrays only get duplicated if a modification occurs, views on the data won't trigger copies. PyArrow strings and timestamps seem to be further supported.
The copy on write behaviour will be a huge change and I suspect might reveal some bugs in the early release. We still have the inplace=True
argument and there's a whole lot of ways to create or reference columns, blocks and dataframes. The CoW behaviour will hopefully give us some speed-ups (reducing unnecessary memory duplications which are slow), and reduce memory usage but maybe means tricks you've added for speed-ups might no longer be relevant or correct.
It feels like this might be a tricky upgrade for some - you might want to put a plan together to start testing how well your code migrates.
Modin 0.19
An update to Modin is up. When I used to evaluate Modin it really didn't offer many opportunities - at least for my laptop based work. Last year I found more improvements on my own work, notably by switching from the Dask backend (which I'd used by default) to the Ray backend (which I had much less experience with). Ray "just made stuff faster" with Modin. All you need to do is change one line (import modin.pandas as pd
) and maybe a bunch of your work will be parallelised.
For some operations on large data their papers shows significant gains - on 100s of GBs of data with many cores, but maybe you'll see useful performance gains. No word on if/when this is Pandas 2.0 compatible (1.5.3 is supported in this release). There's a nice summary of the difference between Modin and Dask.
The big change for 0.19 seems to be the introduction of numpy
support but I don't see any benchmarks. Have you come across any benchmarks?
Redframes 1.4.1
Redframes is a wrapper around Pandas which reduces and refines the interface. There's a short demo directly behind that link vs Pandas. Redframes uses pure functions (with no side effects) - verbs - to interact with the data. This sounds like a sensible idea if Pandas is confusing, but the project is young so it is hard to know if enough functionality is covered. Have you had any joy using Redframes?
Altair 5.0rc1
Altair is at release candidate 1 for v5.
Altair is one of my preferred plotting libraries. It is more limited than matplotlib and far cleaner. It uses a declarative coding approach on top of a Pandas dataframe for easy plotting and, with a little more code, easy interaction. Altair builds on Vega-Lite as the underlying rendering library.
If you're going to try it - be aware that by default only 5,000 rows of data are supported (the entire dataset is copied into the html output and that quickly gets verbose) - but that limit is easily raised. Also the DataFrame.index
isn't supported, you need to reset_index
to use it as a column. Aside from that it is really easy to get started and to use tooltips which is great for exploratory work.
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,500+ subscribers. Your first job listing is free and it'll go to all 1,500 subscribers 3 times over 6 weeks, subsequent posts are charged.
Scientist/Senior Scientist for Machine Learning at ECMWF
The European Centre for Medium-range Weather Forecasts (ECMWF) is looking to hire multiple (senior) scientists for machine learning. We're now in 3 locations across Europe (with adjusted salaries), working on improving weather forecasts for Europe.
If you have experience on the traditional HPC + ML or ML + Earth Systems Science side of things, or more of a ML coordinator, you may be a great fit. Especially, if you're part of a under-represented minority, please consider that the job post is written for both senior and regular scientists and you may not have to match every single bullet point to be a great fit. Our machine learning ecosystem: a lot of deep learning from CNNs to transformers and GNNs on lots a GPUs.
The ECMWF itself is an intergovernmental organisation created in 1975 by a group of European nations and is today supported by 35 Member and Co-operating States, mostly in Europe with the world's largest archive of meteorological data. So definitely a very unique place to work and push state-of-the-art machine learning these days.
- Rate: £68,374 GBP to €103,517 EUR NET of tax annual basic salary + other benefits
- Location: Reading, UK / Bonn, Germany / Bologna, Italy
- Contact: jesper.dramsch@ecmwf.int (please mention this list when you get in touch)
- Side reading: link, link, link
Senior Data Scientist at Made With Intent, Permanent, Remote
We’re a new, revolutionary platform that helps online retailers show a bit more care for their customers. We give retailers the ability to understand their customer intent by listening to hundreds of micro-behaviours on site, modelling them together, and creating a predictor metric between 0 and 1. We predict which customers will buy, and where they are in their journey, and serve them appropriate content that nudges their intent in a caring, nurturing and educating way.
The ideal data scientist candidate will have at least 2 years of hands-on programming experience in AI, a postgraduate degree, experience in NLP, Tensorflow/Pytorch, MLOps, AWS and advanced SQL skills. They must work with unstructured data, have excellent communication skills, and collaborate with stakeholders. We value those who learn new tools quickly and hustle when required.
- Rate: £70k-£90k
- Location: Remote
- Contact: tom@madewithintent.ai (please mention this list when you get in touch)
- Side reading: link, link
Data Scientist
M&G plc is an international savings and investments business, as at 30 June 2022, we had £348.9 billion of assets under management and administration.
Analytics – Data Science Team is looking for a Data Scientist to work on projects ranging from Quantitative Finance to NLP. Some recent projects include: - ML applications in ESG data - Topic modelling and sentiment analysis - Portfolio Optimization
The work will revolve around the following: - Build data ingestion pipelines (with data sourced from SFTP & third party APIs) - Explore data and extract insights using Machine Learning models like Random Forest, XGBoost and (sometimes) Neural Networks - Productionize the solution (build CI/CD pipelines with the help of friendly DevOps engineer)
- Rate:
- Location: London
- Contact: sarunas.girdenas@mandg.com (please mention this list when you get in touch)
- Side reading: link
Senior Data Scientist at Coefficient Systems Ltd
We are looking for an enthusiastic and pragmatic Senior Data Scientist with 5+ years’ experience to join the Coefficient team full-time. We are a “full-stack” data consultancy delivering end-to-end data science, engineering and ML solutions. We’re passionate about open source, open data, agile delivery and building a culture of excellence.
This is our first Senior Data Science role, so you can expect to work closely with the CEO and to take a tech lead role for some of our projects. You'll be at the heart of project delivery including hands-on coding, code reviews, delivering Python workshops, and mentoring others. You'll be working on projects with multiple clients across different industries, including clients in the UK public sector, financial services, healthcare, app startups and beyond.
Our goal is to promote a diverse, inclusive and empowering culture at Coefficient with people who enjoy sharing their knowledge and passion with others. We aim to be best-in-class at what we do, and we want to work with people who share that same attitude.