How much do you get done in a day? Derisking anomaly detection projects and useful tools to improve your engineering
Did you know that Signal AI need a Senior Data Scientist and Aflorithmic need a Data Engineer and Software Engineer. Details for all of these and more are down below.
Thoughts
I reflect on how much useful work I get done in a day, my recent Software Engineering class with a pile of tips for you and share a though on start new high-risk anomaly projects successfully.
David McIver recently wrote about how People Don’t Work as Much as You Think on his newsletter, also hn discussion. David is a smart cookie, author of the hypothesis testing library. He gives an honest opinion of how much “quality work” he gets done in a day. I’ve been using an Unschedule from The Now Habit (see pdf link to print), that shows me that I get 2-3 hours productive work a day around “everything else”. If you’re at all worried about how much time you do or don’t get to focus it is worth a read.
If you’ve missed the recent faster-than-Pandas Polars interview or Mani’s Kaggle Competition Expert interview, do go back and take a look at those issues for great tips and new tools to try.
Successful Data Science Projects course for Feb 9th+10th
My next Success course will give you the tools you need to derisk projects and increase the likelihood that you get to deliver ontime and with happy clients. Send me an email back if you have questions? It runs on Feb 9th+10th virtually in UK hours.
We’ll look at good process to make new projects work well, work through common project failings, look at tools that make it easy to derisk new datasets and practice prioritisation and estimation. The project specification document is a key highlight in the course - this is especially useful if your team rarely writes things down or doesn’t write a document that actually supports the team.
Business strategy - what goal are you heading for?
I’ve mentioned my anomaly detection project with a client recently. I’m very pleased to say that by following the strategy of “ignore the code, go find a business expert and learn their expertise” we’ve identified some really nice anomalous examples (up from a count of 0 a few weeks back) along with building a set of business-driven hypotheses about what they’re expecting. Since they’re using Excel & SQL for their current analysis we can easily turn those ideas into a couple of loops and quickly iterate to see if we have hits.
Sometimes data science comes down to having the right expert in the room, asking the “simple” questions about the business, then writing some loops to prove out the idea. Do you have other routes through this sort of process you’d like to share? If so please just reply.
We’ll talk a little about tackling hard projects in my upcoming Success course.
Open Source - writing better code faster
Having finished my private Software Engineering for Data Scientists course I’ll reflect on some of what we discussed and the tools that seemed to resonate. In the last couple of issues I’ve linked to the main tools (Pandera, PyTest, flake8, nbqa etc). How they feel useful is the interesting topic:
- Litter your Notebook with
assert
statements to automatically check certain assumptions you have - it takes no time and gives you high value wins when theassert
trips up something you weren’t expecting - Pandera lets you quickly add tests for data quality both at ingestion time and after you manipulate your dataframe - quickly catch mistakes at the source, not downstream in your code where it might hurt (and they’re dead easy to write too), it is even easier to write if you’re migrating
assert
statements into a Pandera schema checker - Flake8 & nbqa - nice tools that help you simplify your code (make sure you get some flake8 extensions) so mutual code reviews are easier
- PyTest lets you unit test functions - great for processing code that transforms raw data into something more useful
- Adding a first tests helps you avoid building monolithic code that’s later hard to test (always write 1 test early on!)
- Use a coverage check occasionally to see where you’re missing useful test coverage
- Always think in terms of utility - do I get more utility in the long term from writing tests, refactoring code out of notebooks for resue, adding new code, tidying up my diagrams etc - what actually gets you the longest term gain (‘cos a lack of testing will hurt in the long run, it always bites sooner or later)
This article on pytest tips includes gems like picking up from the last failing test (saving you time on the tests that run ok), using -l
to get a dump of local variables on failure, -v
to get more verbose output and stopping early on a failure if you’re not expecting any. The article is worth a read if (I hope!) you believe that unit and integration tests bring value to your project.
Using Python 3.10 gives me more useful error messages and the recent updates to IPython and Jupyter expand the tracebacks you get - they provide more debug info in your browser.
Have you got other tools to recommend? Please reply or tweet at me if you do!
If you find this useful please share this issue or archive on Twitter, LinkedIn and your communities - the more folk I get here sharing interesting tips, the more I can share back to you. I’d be happy if you could RT this tip please too.
If you’d like a notification for this course or you’d like to run it privately (see my course list), just reply to this email.
Random
The highway code in the UK has been updated to include “The Dutch Reach” (reach left hand over right to open the drivers door) to make you look into the curb when you open the door and to explain the new priority rules - pedestrians first, then cyclists, then motorbikes, cars and bigger stuff. The new rules are very sensible but require a bit of thought if you’re a long-time driver. Please take a read if you’re driving in the UK.
Footnotes
See recent issues of this newsletter for a dive back in time.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it’ll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Backend & Data Engineer
At Good With, you’ll work at the heart of a dynamic multidisciplinary agile team to develop a platform and infrastructure connecting a voice-enabled intelligent mobile app, financial OpenBanking data sources, state of the art intelligent analytics and real-time recommendation engine to deliver personalised financial guidance to young and vulnerable adults.
As a founding member, you’ll get share options in an innovative business, supported by Innovate UK, Oxford Innovation and SETsquared accelerator, with ambitions and roadmap to scale internationally.
Supported by Advisors: Cambridge / FinHealthTech, Paypal/Venmo & Robinhood Brand Exec, Fintech4Good CTO & cxpartners CEO.
Working with: EPIC e-health programme for financial wellbeing & ICO Sandbox for ‘user always owns data’ approaches.
• Rate: £50-65K + Share Options • Location: Flexible, remote working. Cornwall HQ
- Rate: £50-65K
- Location: Remote
- Contact: gabriela@goodwith.co (please mention this list when you get in touch)
- Side reading: link
Senior Data Scientist & Data Science Manager & Head of Data Engineering at Infogrid, Permanent, London
Infogrid is helping protect the planet and improve the lives of billions of people by making every building a smart building. Our goal is to be the global provider for connected devices in smart buildings. We already handle millions of events every day from tens of thousands of sensors and we’d like you to help us scale that by an order of magnitude over the coming months.
Sustainability is at our heart; buildings account for 39% of global carbon emissions and we’re creating real solutions to impact this! We are still early in our journey but have already achieved a lot; we raised a successful series A funding round, grew 5x in employee numbers within 12 months, and voted one of the top 10 most flexible places to work.
- Rate:
- Location: Remote UK
- Contact: myriam@infogrid.io (please mention this list when you get in touch)
- Side reading: link, link, link
Senior Data Scientist (Platform) - Signal AI, Full-time, London (UK)
You will be a core player in the growth of our platform. You will work within one of our platform teams to innovate, collaborate, and iterate in developing solutions to difficult problems. Our teams are autonomous and cross-functional, encompassing every role required to build and improve on our products in whatever way we see best. You will be hands-on working on end-to-end product development cycles from discovery to deployment. This encompasses helping your team discover problems and explore the feasibility and value of potential ML-driven solutions; building prototype solutions and conducting offline and online experiments for validation; collaborating with engineers and product managers on bringing further iterations for those solutions into the products through integration, deployment and scaling.
This particular role will initially be within a team whose responsibilities include effectiveness and efficiency of our labelling processes and tool, training, monitoring and deployment of systems and models for entity linking, text classification and sentiment analysis, among others, across multiple data types. This team also works closely with the operation teams to ensure systems and models are properly maintained.
- Rate:
- Location: London (Old Street) - Hybrid model (2 days a week in the office)
- Contact: jiyin.he@signal-ai.com (please mention this list when you get in touch)
- Side reading: link, link, link
Software Engineer, Data Engineer and TPM at Aflorithmic Labs, London (Hybrid)
We’re an audio as a service startup, building an API first solution to add audio to applications. We have customers and we’re fast growing.
As Audio-As-A-Service API-first Voice Tech company our aim is to democratise the way audio is produced. We use AI and “Deepfake for Good” to create beautiful Voice and Audio from simple Text-to-speech - making creating beautiful audio content (from simple text) as easy as writing a blog. Join a 23 people strong international engineering, voice, R&D and business team made out of 13 nationalities (backgrounds include: Ex-University of Edinburgh, PhDs, European Space Agency, SAP, Amazon).
Looking for a data engineer to work on our core data pipelines for our voice-as-a-service and support our team growing. Our stack includes Kubernetes, Python, NodeJS and we use a lot of kubeflow and the serverless stack.