Scikit-learn v1.0 upcoming
Scikit-learn v1.0 upcoming
Did you now that Kindred are looking for a Lead Data Scientist along with quants and python developers? Details are below along with other Senior roles in 11+ job adverts.
Whilst working on a client project last week I started to see some sklearn
warnings about named arguments for the upcoming version 1.0
. Intrigued I dug a bit further and indeed there's an in-progess changelog noting this change.
The new HistGradientBoostingRegressor, inspired by LightGBM, is indeed very fast. XGBoost has the same trick for precomputed bins so all 3 approaches are a little less accurate but significantly faster (5-10x faster than the default exact
method in XGBoost, and the sklearn method is much faster than the equivalent GradientBoostingRegressor
). It also handles NaN
data and will internally encode categoricals.
As of 1.0 the new HistGradientBoostingRegressor and Classifier will no longer require the experimental flag activation before use, so the implementation is now stable. That's brilliant - in some circumstances you can avoid installing a 3rd party library like LightGBM or XGBoost if you don't need their extra functionality, you'll get almost as good fitting plus NaN support and categoricals directly from sklearn
.
I don't know when v1 is coming, I'll guess in the near months - do you know any better? Have any of you tried the HistGradientBooster
?
Somewhere along the line I'd also missed the introduction of handle_unknown
to the OneHotEncoder, this means on a prediction pass if a new category is seen you can ignore it (the corresponding feature rows is all-0s), sometime back this used to force an error which was an absolute pain.
And over in the world of Pandas - the built-in string datatype can now be replaced, using experimental code, with Arrow's more memory efficient string dtype. Matt Rocklin of Dask gives this short explainer video for the string[pyarrow]
dtype that's now available (be warned - cutting edge code - faster and more memory efficient but maybe we don't get quite the same results as we're used to).
See recent issues of this newsletter for a dive back in time.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it'll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Data Engineer/Scientist - Beatchain
Beatchain is a music distribution and social media marketing and analytics platform that works with up-and-coming artists as well as established record labels. We are looking for a data engineer/scientist working in Python to help users manage, understand and put their data into context. Data sources include social media and music platforms scraped over hundreds of thousands of accounts using Scrapy, APIs including over two million Spotify playlists, and large quantities of streaming data from our distribution and record label partners.
This is a junior to mid-level role, you would be working within a small back-end team alongside the lead data-scientist. While the day-to-day ingestion and transformation of data is maintained, we research ways of presenting data to users through visualizations and predictive analytics. Recently, we used graph embeddings to model relationships between artists and genres to recommend related artists for social media campaigns. We use the familiar PyData Python/Pandas/NumPy stack deployed via AWS Lambda, Step Functions and Batch. Data lives in AWS RDS, DynamoDB and Redshift, migrating to Google BigQuery.
- Rate: Up to £40K, subject to experience.
- Location: Central London / Remote from a suitable time-zone
- Contact: ed.godshaw@beatchain.com (please mention this list when you get in touch)
- Side reading: link
Data Scientist (various levels)
Here at Gousto, we are on a mission to become the UK's favourite way to eat dinner!
We're hiring for multiple Data Science positions: - Principal Data Scientist - (Menu) https://apply.workable.com/gousto/j/3C7165186A/ - Principal Data Scientist (Supply) https://apply.workable.com/gousto/j/4709837FEC/ - Data Scientist (Growth) https://apply.workable.com/gousto/j/C9F991E124/
If you want to work on some seriously interesting projects and get discounted Gousto boxes as part of the benefits package, please apply using the links above, mentioning that this newsletter sent you there!
See here https://www.gousto.co.uk/jobs for benefits and check out our blog: https://medium.com/gousto-engineering-techbrunch
- Rate:
- Location: London (partly/mostly remote if you wish)
- Contact: marco.gorelli@gousto.co.uk (please mention this list when you get in touch)
Python Data Developer at Kindred Group, Permanent, London
Kindred's ambition is to be the most insight-driven gambling company and in the last few years we've invested heavily in our data and analytics capabilities. We are now at the next stage of our journey, embarking on an initiative to enhance our sports and racing modelling and quantitative analysis capabilities.
The Quantitative Team work closely with the existing data science function to play an important role in delivering a truly innovative and unparalleled experience for the customers of our sportsbook brands. This work builds upon a culture of “data as a product” to significantly extend our proof-of-concept efforts in this area.
We are looking for a software engineer with a strong interest in sporting applications and experience in building solutions to handle varied external data sources. On joining, you will be responsible for creating exceptional quality data products, primarily based on sports event and market odds data, for use within the Quantitative Team and the wider business. Your work will be integral in the team's delivery of market-leading probability and machine learning models to support our commercial and operational functions and decision making processes.
- Rate: Highly Competitive
- Location: Wimbledon, London
- Contact: jack.morrow@kindredgroup.com (please mention this list when you get in touch)
- Side reading: link, link
Quantitative Analyst - Kindred Group, Permanent
Kindred's ambition is to be the most insight driven gambling company and in the last few years we've invested heavily in our data and analytics capabilities. The quantitative team work closely with the existing data science function to play an important role in delivering a truly innovative and unparalleled experience for the customers of our sportsbook brands. The work will build upon a culture of “data as a product” to significantly extend our proof-of-concept efforts in this area.
We are now looking for a talented Quantitative Analyst to join our team to help shape our sport and racing modelling efforts. This role provides an exciting opportunity to be a pivotal part of the team. On joining, you will be responsible for performing data analysis and building probability and machine learning models to derive descriptive and predictive insight about sporting events. Your work will help to deliver market-leading tools and capabilities to support our commercial and operational functions and decision making processes.
- Rate: Highly Competitive
- Location: London, Wimbledon
- Contact: shanice.Tatter@KindredGroup.com (please mention this list when you get in touch)
- Side reading: link, link
Lead Data Scientist - Kindred Group
Kindred Group use data to build solutions that deliver our customers the best possible gaming experience and we have ambitious plans to get smarter in how we use our data. As part of these plans we’re looking to recruit a Lead Data Scientist to drive our advanced analytics initiatives and build innovative solutions using the latest techniques and technologies.
Key Accountabilities
• To lead, manage and deliver our advanced analytics initiatives using cutting edge techniques and technologies to deliver our customers the best online gaming experience.
• Working in cross functional teams to deliver innovative data driven solutions.
• Able to advise on best practises and keep the company abreast of the latest developments in technologies and techniques
• Building machine learning frameworks to drive personalisation and recommendations.
• Building predictive models to support marketing and KYC initiatives.
• Continually improving solutions through fast test and learn cycles
• Analysing a wide range of data sources to identify new business value
• Be a champion for advanced analytics across the business, educating the business about its capability and helping to identify use cases
- Rate: Competitive
- Location: London, Wimbledon
- Contact: Shanice.Tatter@kindredgroup.com (please mention this list when you get in touch)
- Side reading: link, link
Software Engineer IV, Recommenders, Elsevier
Recommenders is Elsevier’s suite of recommendation systems, which uses Data Science and machine learning techniques to keep researchers appraised of developments in their field, new funding opportunities, finding peer reviewers and papers related to their work. We're looking for a data engineer to help us build the pipelines which extract features from the unparalleled collection of research data flowing through our systems.
You'll be working in a modern technology stack (AWS, Scala, Spark, Kafka, we're currently looking at SageMaker and Kedro) as part of a small cross-functional team. If you're interested in learning more, please contact Stuart White at the email address below.
- Rate:
- Location: London
- Contact: s.white.1@elsevier.com (please mention this list when you get in touch) (please mention this list when you get in touch)
- Side reading: link
Data Scientist (NLP) at Climate Policy Radar, Permanent, London
Climate Policy Radar is a not-for-profit climate AI startup on a mission to map the global policy landscape, harnessing machine learning to create the evidence base for informed decision-making. Our work helps governments, the private sector, researchers and civil society to advance effective climate policies rapidly, replicate successful approaches and avoid failed ones, enhance accountability and promote data democratisation.
We are building the capability to collect and structure climate policy documents from all around the world. Now, at the beginning of this exciting journey, we need an exceptional individual with broad practical experience of ML and NLP to extract information from large and complex unstructured documents. You will need the creativity and passion to write the playbook, and be comfortable working in situations where uncertainty is high, defining the problems as much as the solutions. You will be willing to roll up your sleeves and dive deep into working on a wide range of areas, including the design of data labelling strategies, stakeholder collaboration and model deployment.
- Rate: £50k - 60k depending on experience
- Location: London
- Contact: jobs@climatepolicyradar.org (please mention this list when you get in touch)
- Side reading: link
ML Engineer - Data at Lean Tech Ltd
Lean provides Payment and Data APIs to unlock the financial technology sector and enable financial innovation in the Middle East.
We launched our first products to market at the beginning of 2021 and now support over 90% of the retail banking market in the UAE. With ambitions to build an entire ecosystem for Fintech in the region we're now looking to expand to new regions and support stakeholders from end-users, to Fintechs, regulators and financial institutions.
As we collect more raw data and enable an increasing variety of use cases, our data science products and processes will play an important role in Lean's advancement within the Fintech ecosystem. We are looking for an ML Engineer with a software engineering background and a strong interest in innovative financial applications. Your role will be to extract exciting and scalable features from the river of data that flows through our system.
- Rate:
- Location: Shoreditch, London
- Contact: nadia@leantech.me (please mention this list when you get in touch)
- Side reading: link
Senior & Lead Data Science vacancies - M&S
Here at M&S the data science function builds end-to-end AI and machine learning solutions in retail and e-commerce and helps our colleagues in Food, Clothing & Home, Fashion, Marketing, Loyalty, Supply Chain, Growth, Customer Services etc. driving value from data and create personalised experiences for our customers. We apply state of the art machine learning techniques to solve a variety of problems such as outfit recommendations in fashion, personalised offers for our loyalty program, pricing optimization, demand forecasting for supply chain, product waste management for retail, and AI powered campaigns for our marketing. We are hiring at both Senior and Lead levels. If you would be interested in finding out more, please contact me on the below email address.
- Rate: Competitive
- Location: London / Remote
- Contact: craig.parke@marks-and-spencer.com (please mention this list when you get in touch)
- Side reading: link
Senior Data Scientist at OVO
The OVO Group is a collection of companies with a single vision: to power human progress with clean affordable energy for everyone. The data which we collect is multi-faceted and complex. There is an opportunity to become a truly AI business, with market changing innovation and finely optimised processes leading to zero carbon and low cost for our energy customers.
We are looking for a Senior Data Scientist with hands-on experience building end-to-end data science products in a production setting. Primarily you will work within cross functional, product teams, but might be required to contribute to specific data science initiatives. You will take a lead role in demonstrating the value data science can add to teams across OVO Energy, working with stakeholders to understand their data science needs, owning the delivery of projects from start to finish, and evaluating value post delivery. You will also be expected to coach junior Data Scientists and help to define data science best practices. Technology stack: SQL, Python, GCP (BigQuery, Composer, Cloud Functions, Dataflow, CloudRun), CircleCI for CI, Github for version control.