Higher quality data science code and derisking hard projects
Did you know that Netacea are hiring for a Head of DS + Data Engineer, Signal AI need a Senior Data Scientist and Aflorithmic need a Data Engineer and Software Engineer. Details for all of these and more are down below.
Better data science code and derisking projects
I’ve asked about your processes that improve your workflow. Further below I reflect on my use of Test Driven Development and linting tools and I’ll have some more pytest
thoughts to share soon. I also share thoughts for derisking anomaly projects plus I’ve been asked to share the AIQC library - all further below.
Dave Kirby has kindly shared his observations on Obsidian for note-taking - thanks Dave:
While doing my Masters degree last year I discovered Obsidian, a freemium tool that lets you write richly hyperlinked notes in markdown. It can display a network graph of your notes and the connections between them and has a strong ecosystem of plug-ins that let you do things like refactor notes, manage tasks, add a kanban board, edit diagrams, search the notes with a dedicated query language and many other things. Since it is all stored in markdown it can be version controlled and stored on github and is inherently future-proof. I used to use OneNote for all my notes, but found that Obsidian is much more flexible and useful. (note - there are mobile apps too).
I recommend also the UK’s Royal Statistical Society’s Data Science newsletter. This is a take on data science seen through our old and deeply experienced statistics society - not gushy unlike so many other sources and with a more thoughtful look at our industry. Definitely take a look.
If you’ve missed the recent faster-than-Pandas Polars interview or Mani’s Kaggle Competition Expert interview, do go back and take a look at those issues for great tips and new tools to try.
Successful Data Science Projects course for Feb 9th+10th
My next Success course will give you the tools you need to derisk projects and increase the likelihood that you get to deliver ontime and with happy clients. Send me an email back if you have questions? It runs on Feb 9th+10th virtually in UK hours.
We’ll look at good process to make new projects work well, work through common project failings, look at tools that make it easy to derisk new datasets and practice prioritisation and estimation. The project specification document is a key highlight in the course - this is especially useful if your team rarely writes things down or doesn’t write a document that actually supports the team.
Business strategy - what goals are you heading for?
For one of my clients we’ve gone from derisking a large set of projects to working on proof-of-concepts for a pair of well-scoped ideas. Both projects need to identify anomalies - but we lack good examples at this early stage. To get past this we’re developing good hypotheses about what a good anomaly looks like (and yes, we do have just enough examples to know this isn’t going to leave us wandering in the dark!). How have you handled this situation?
For us we’re building up hypotheses using our best knowledge about these two domains, using past examples from related projects and a couple of pertinent examples that business domain experts have worked on. Without good examples we’d be lost - you go fishing and “hope”. With good examples you can build a score function and rank your known examples with the score - along with other scored examples which might be good new true positives or unhelpful false positives, then you can iterate and improve the scoring function. If you’ve tackled this - do you have other ways through this challenge?
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
We’ll talk a little about tackling hard projects in my upcoming Success course.
Open Source - writing better code faster
I’ve run the first part of my next Software Engineering for Data Scientists course and we got to talk about better tooling that helps me (and you!) detect bugs early. I’ve found that spotting silly mistakes, missed variables, unused code and inconsistent use of the Pandas API helps me to stay focused - it is less easy to get sidetracked by bugs. For Notebook quality I’ve adopted flake8, pandas-vet, bugbear, variables-names and flake8-builtins driven by the excellent nbQA.
As noted last issue - respectively flake8
checks your general code quality (and is less whiney than PyLint
), Vet
checks Pandas idioms (“dont’ call it df
”! “avoid .ix
”), BugBear
helps you avoid silly things like using default constructors in function arguments and the last two help you write better variable names (e.g. avoid result
or list
for names). nbQA
lets you apply a usually-script-only tool like flake8
onto a Notebook. Black of course auto-cleans Notebooks now as well as scripts. When I play at Project Euler I’m surprised by the number of silly mistakes I make that flake8
and friends helps me spot, before I run my code, which helps me keep up momentum.
I’ve also ramped up my use of Test Driven Development - prototying a bit of code (that doesn’t work), using this to write a test (which will fail), then backfilling my code so the test passes. This inevitably quickly flushes out new edge cases which can go into tests. Just last night I learned that I can capture stdout from print
or display
for a pytest
test - brilliant. I’ve switched to Python 3.10 as that gives me more useful error messages too.
One tool I’ve yet to try is friendly which gives you a clue about what you did wrong - have you tried this? Have you got other tools to recommend? Please reply or tweet at me if you do!
Layne Sadler, author of AIQC, asks if anyone here would like to try his growing library for Quality Control:
AIQC is an open source framework for MLOps - “AI Quality Control”. It accelerates scientific research with an API for data preprocessing, experiment tracking, & model evaluation. It does so by providing high and low-level APIs for object-oriented machine learning (
Feature
,Label
,Algorithm
, etc.). While I was mining the biobanks with pharma, I was frustrated that association studies (Victorian Era statistics) were the only tool for such a challenging task. So I created AIQC to make rigorous & reproducible deep learning more accessible to researchers in all fields. Although the project received a small grant from the Python Software Foundation for time series analysis, it’s now time to establish real-world validation in the form of research collaborations & domain-specific integrations. So if you, your team, or anyone in your network would like help adopting deep learning in their research - I would be grateful for the opportunity to assist.
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you.
Footnotes
See recent issues of this newsletter for a dive back in time.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it’ll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Senior Data Scientist & Data Science Manager & Head of Data Engineering at Infogrid, Permanent, London
Infogrid is helping protect the planet and improve the lives of billions of people by making every building a smart building. Our goal is to be the global provider for connected devices in smart buildings. We already handle millions of events every day from tens of thousands of sensors and we’d like you to help us scale that by an order of magnitude over the coming months.
Sustainability is at our heart; buildings account for 39% of global carbon emissions and we’re creating real solutions to impact this! We are still early in our journey but have already achieved a lot; we raised a successful series A funding round, grew 5x in employee numbers within 12 months, and voted one of the top 10 most flexible places to work.
- Rate:
- Location: Remote UK
- Contact: myriam@infogrid.io (please mention this list when you get in touch)
- Side reading: link, link, link
Data Engineer at Netacea
Netacea is an industry-leading provider of bot detection & mitigation capabilities to business struggling with automated threats against their websites, apps and APIs. We ingest and predict on vast quantities of streamed real-time data, sometimes millions of messages per second. As a successful start-up that is now scaling up substantially, having robust and high quality data pipelines is more important than ever. We are looking for an experienced data engineer with a passion for technology and data to help us build a stable and scalable platform.
You will be part of a strong and established data science team, working with another data engineer and with our chief technical architect to research, explore and build our next generation pipelines & processes for handling vast quantities of data and applying our state-of-the-art bot detection capabilities. You will get the opportunity to explore new technologies, face unique challenges, and develop your own skills and experience through training opportunities and collaboration with our other highly skilled delivery teams.
- Rate: Up to £70k, dependent on experience.
- Location: UK-based remote, with office in Manchester
- Contact: katie.slater@netacea.com (please mention this list when you get in touch)
- Side reading: link, link, link
Lead Data Scientist and Data Scientist roles at Netacea
We have open positions for two mid-level data scientists on our team at Netacea. You will be joining a strong and established team of data scientists and data engineers, working on unique problems at a vast scale. You will be building an industry-leading bot detection product, solving new emerging threats for our customers, and developing your own skills and experience through training opportunities and collaboration with our other highly skilled delivery teams.
We also have two Lead Data Scientist roles with one of these specialised towards supporting long-term technical customer relationships. Both Lead roles will be fundamental to the success and growth of the data science function at Netacea. You will be a technical leader, driving quality and innovation in our product, and supporting a highly competent team to deliver revolutionary data science for our customers.
Application links: Lead Data Scientist (Commercial) - https://apply.workable.com/netacea-1/j/4B7ACCC80D/?utm_medium=social_share_link Lead Data Scientist - https://apply.workable.com/j/F3A4E8F82F/?utm_medium=social_share_link Data Scientist - https://apply.workable.com/j/D58EA8DCE2/?utm_medium=social_share_link
- Rate: Mid-level roles up to £55k dependent on experience. Lead roles between up to £85k dependent on experience.
- Location: UK-based remote, with office in Manchester
- Contact: katie.slater@netacea.com (please mention this list when you get in touch)
- Side reading: link, link, link
Head of Data Science at Netacea
Netacea is a Manchester based business providing revolutionary products including website queuing system to prevent traffic to websites that may cause failure and bot management solution that protects websites, mobile apps and APIs from heavy traffic and malicious attacks such as scraping, credential stuffing and account takeover. Netacea was recently categorised by Forrester as a leader in this rapidly expanding market.
We are looking for an outstanding leader to spearhead the growth and development of their data science team. As Head of Data Science, you will lead a department of skilled engineers to deliver outstanding solutions to the most interesting problems in cybersecurity. You will feel comfortable working in an agile way, taking ownership of data science strategy, effectiveness, delivery, and quality. You will grow, nurture, and develop your team and encourage them to explore their full potential. This is a mainly hands-off role, but you should feel confident talking about data science technology with internal and external stakeholders and partners. You will be passionate about data, and understand how it can be used to deliver value to customers.
- Rate:
- Location: UK-based remote, with office in Manchester
- Contact: katie.slater@netacea.com (please mention this list when you get in touch)
- Side reading: link, link, link
Senior Data Scientist (Platform) - Signal AI, Full-time, London (UK)
You will be a core player in the growth of our platform. You will work within one of our platform teams to innovate, collaborate, and iterate in developing solutions to difficult problems. Our teams are autonomous and cross-functional, encompassing every role required to build and improve on our products in whatever way we see best. You will be hands-on working on end-to-end product development cycles from discovery to deployment. This encompasses helping your team discover problems and explore the feasibility and value of potential ML-driven solutions; building prototype solutions and conducting offline and online experiments for validation; collaborating with engineers and product managers on bringing further iterations for those solutions into the products through integration, deployment and scaling.
This particular role will initially be within a team whose responsibilities include effectiveness and efficiency of our labelling processes and tool, training, monitoring and deployment of systems and models for entity linking, text classification and sentiment analysis, among others, across multiple data types. This team also works closely with the operation teams to ensure systems and models are properly maintained.
- Rate:
- Location: London (Old Street) - Hybrid model (2 days a week in the office)
- Contact: jiyin.he@signal-ai.com (please mention this list when you get in touch)
- Side reading: link, link, link
Software Engineer, Data Engineer and TPM at Aflorithmic Labs, London (Hybrid)
We’re an audio as a service startup, building an API first solution to add audio to applications. We have customers and we’re fast growing.
As Audio-As-A-Service API-first Voice Tech company our aim is to democratise the way audio is produced. We use AI and “Deepfake for Good” to create beautiful Voice and Audio from simple Text-to-speech - making creating beautiful audio content (from simple text) as easy as writing a blog. Join a 23 people strong international engineering, voice, R&D and business team made out of 13 nationalities (backgrounds include: Ex-University of Edinburgh, PhDs, European Space Agency, SAP, Amazon).
Looking for a data engineer to work on our core data pipelines for our voice-as-a-service and support our team growing. Our stack includes Kubernetes, Python, NodeJS and we use a lot of kubeflow and the serverless stack.