Patterns for successful projects and a bumper load of text-fixing OCR tools
Patterns for successful projects and a bumper load of text-fixing OCR tools
Further below are 6 job roles including Senior roles in DS and DEng at organisations like Causaly and Medidata.
PyDataGlobal ran a week back with several thousand online attendees and several days of long-tracks. If you weren’t a ticket holder then you’ll have to wait until the start of January for the general video release (ticket holders got an email for video access for December). The schedule was pretty varied and talks were of high quality.
“Data Science Project Patterns that Work” at PyDataLondon 2022
I spoke on Data Science Project Patterns that Work (slides) where I encoded the lessons I presented earlier in the year at PyDataLondon 2022 (Building Successful Data Science Projects) as a set of “patterns”.
Design patterns on WikiPedia:
The elements of this language are entities called patterns. Each pattern describes a problem that occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice.
I talked through the following patterns that I believe really help teams unblock common mistakes in execution which can be really expensive and painful:
- Choosing good projects (which will effect positive change, and enable you to learn and open doors even if this project fails)
- Derisking data (by understanding what the data means in the business context)
- Productive research (quickly figuring out if the data and challenge are “doable”)
- Delivering value early (deploy something very quickly to derisk value delivery to the end user)
- Creating change (so people value your work and give you their time)
I also discussed anti-patterns for each of these, where the “easy” way through is likely to lead to unnecessary pain. I can’t share the video yet as they won’t be released until after Christmas but the slides are linked above and the London video from earlier in the year is online.
Learning valuable lessons from the Executives at PyData Global session
I ran another of my Executives session, this took 2 hours and involved the co-creation of answers to a set of questions. We gathered valuable answers to questions around “how to build remote teams?”, “building good backlogs when remote”, “is the T-shaped team the best?”, “how do DS teams work well with other business units?” along with new tool and process ideas.
The first set of questions were curated, the second were made up as we went for our 30 attending leaders. We also had special guest Douglas Squirrel (AKA “Squirrel”) who runs the Squirrel Squadron private discussion group for CTOs (you should join if you’re leadership-leaning). Squirrel sees the world from the CTO/CEO perspective so added a very nice set of observations on how and why DS teams can better work with organisations.
The video for this session will be public in January along with the rest of the conference videos. I’m writing up the results into a short report with co-conspirer and co-host Lauren Oldja of PyDataNYC. We hosted the event together and Squirrel shared views and stories from his many and varied adventures.
The report will be shared via NumFOCUS in the new year and will contain valuable advice for anyone in a leadership role. I will share the write-up here for you all as well.
Send me an email by replying if you’d like early access to this (no cost, you can just get yourself a preview copy soon if you reply - tell me what pains you face that you’re hoping to learn about too).
Open source - Fixing OCR issues with OCRmyPDF
and OCRfixr
and a bit of BERT
A colleague really tried to OCR messy invoices, to augment an existing business OCR process. By trying pytesseract
(the usual choice) they realised that a lot of work might be need to deskew, denoise and clean-up images. Thankfully another colleague had recently used OCRmyPDF and “it just works”. On a test set of hundreds of previously-rejected PDFs (due to image noise), they recorded useful data from 80% of the cases.
OCRmyPDF
includes features to automatically clean-up images, then it passes the result into pytesseract
. Clean-up includes automatic deskew (to deal with slightly rotated scans), denoise (to get rid of speckles which make an “l” look like an “i” or a “1” or a “t” for example). This process uses unpaper which is new to me.
Preprocessing a PDF before pushing it through OCR is a common issue and it can suck a surprising amount of time.
Do let me know by replying if this tool helps!
Another major time-suck is then post-processing the results (e.g. doing typo detection on “Oear” to “Dear”) and often this is very problem-specific.
OCRfixr looks for errors like “The birds flevv south” (double-v) and fixes it to “The birds flew south”, it works by looking for spelling corrections relative to the context of the word. It uses symspellpy for spelling correction and the BERT language model for context checking.
Do you have any libraries or processes to share that make this kind of data clean-up easier? I’d happily take a look if you’d reply to this email with some links.
New sklearn
release
I see that a Christmas release of sklearn is out as v1.2 (changelog). Highlights include a significantly faster k-nearest-neighbours, faster data-fitness checks, HistGradientBoostingClassifier now has class_weight
for uneven classes and there seem to be a lot of smaller speed improvements.
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,500+ subscribers. Your first job listing is free and it’ll go to all 1,500 subscribers 3 times over 6 weeks, subsequent posts are charged.
Web Scraper/Data Engineer
Looking to a hire a full time Web Scraper to join our Data Engineering team, scrapy and SQL are desired skills, as a bonus you’ll have an interest in art.
- Rate:
- Location: London, Remote
- Contact: s.mohamoud@heni.com (please mention this list when you get in touch)
- Side reading: link
Data Scientist at Kantar - Media Division
Our data science team (~15 people in London ~5 in Brazil) is hiring a Data Scientist to work on data science and machine learning projects in the media industry, working with broadcasters and advertisers, media agencies and industry committees. A major area of focus for this role involves the integration of multiple data sources, with a view of creating new Hybrid Datasets that leverage and extract the best out of the original ones.
The role requires advance programming skills with Python and related data science libraries (pandas, numpy, scipy, sklearn, etc) to develop new and enhance existing mathematical, statistical and machine learning algorithms; assist our integration to cloud-based computing and storage solutions, through Databricks and Azure; take the lead in data science projects, run independent project management; interact with colleagues, clients, and partners as appropriate to identify product requirements.
- Rate: £40k - £50K plus benefits
- Location: Hybrid remote and in Farringdon/Clerkenwell offices
- Contact: emiliano.cancellieri@kantar.com 07856089264 (please mention this list when you get in touch)
- Side reading: link
Data Scientist at Kantar - Media Division in London
Our data science team (~15 people in London ~5 in Brazil) is hiring a Data Scientist to work on cross-media audience measurement, consumer targeting, market research and in-depth intelligence into paid, owned and earned media. Redefining client requests into concrete items and best modelling choices is a crucial part of this job. Attention to details, critical thinking, and clear communication are vital skills.
The role requires advance programming skills with Python and related data science libraries (pandas, numpy, scipy, etc) to develop new and enhance existing mathematical, statistical and machine learning algorithms; assist our integration to cloud-based computing and storage solutions, through Databricks and Azure; take the lead in data science projects, run independent project management; interact with colleagues, clients, and partners as appropriate to identify product requirements.
- Rate: £40K-£50K + benefits
- Location: Hybrid remote and in Clerkenwell/Farringdon offices
- Contact: emiliano.cancellieri@kantar.com 07856089264 (please mention this list when you get in touch)
- Side reading: link
Senior Cloud Platform Applications Engineer, Medidata
our team at Medidata is hiring a Senior Cloud Platform Applications Engineer in the London office. Medidata is a massive software company for clinical trials and our team focus on developing the Sensor Cloud, a technology with capabilities in ingesting, normalizing, and analyzing physiological data collected from wearable sensors and remote devices. We offer a good salary and great benefits !!
- Rate:
- Location: Hammersmith, London
- Contact: kmachadogamboa@mdsol.com (please mention this list when you get in touch)
- Side reading: link
Natural Language Processing Engineer
In this role, NLP engineers will:
Collaborate with a multicultural team of engineers whose focus is in building information extraction pipelines operating on various biomedical texts Leverage a wide variety of techniques ranging from linguistic rules to transformers and deep neural networks in their day to day work Research, experiment with and implement state of the art approaches to named entity recognition, relationship extraction entity linking and document classification Work with professionally curated biomedical text data to both evaluate and continuously iterate on NLP solutions Produce performant and production quality code following best practices adopted by the team Improve (in performance, accuracy, scalability, security etc…) existing solutions to NLP problems
Successful candidates will have:
Master’s degree in Computer Science, Mathematics or a related technical field 2+ years experience working as an NLP or ML Engineer solving problems related to text processing Excellent knowledge of Python and related libraries for working with data and training models (e.g. pandas, PyTorch) Solid understanding of modern software development practices (testing, version control, documentation, etc…) Excellent knowledge of modern natural language processing tools and techniques Excellent understanding of the fundamentals of machine learning A product and user-centric mindset
- Rate:
- Location: London/Hybrid
- Contact: david.sparks@causaly.com 07730 893 999 (please mention this list when you get in touch)
- Side reading: link, link
Senior Data Engineer at Causaly
We are looking for a Senior Data Engineer to join our Applied AI team.
Gather and understand data based on business requirements. Import big data (millions of records) from various formats (e.g. CSV, XML, SQL, JSON) to BigQuery. Process data on BigQuery using SQL, i.e. sanitize fields, aggregate records, combine with external data sources. Implement and maintain highly performant data pipelines with the industry’s best practices and technologies for scalability, fault tolerance and reliability. Build the necessary tools for monitoring, auditing, exporting and gleaning insights from our data pipelines Work with multiple stakeholders including software, machine learning, NLP and knowledge engineers, data curation specialists, and product owners to ensure all teams have a good understanding of the data and are using them in the right way.
Successful candidates will have:
Master’s degree in Computer Science, Mathematics or a related technical field 5+ years experience in backend data processing and data pipelines Excellent knowledge of Python and related libraries for working with data (e.g. pandas, Airflow) Solid understanding of modern software development practices (testing, version control, documentation, etc…) Excellent knowledge of data processing principles A product and user-centric mindset Proficiency in Git version control