"Asking good questions about data science projects", ArcticDB, and Pandas in the browser
“Asking good questions about data science projects”, ChatGPT hallucinations, ArcticDB, and Pandas in the browser
Further below are 7 jobs including Senior/Lead/Head Data Science and Data Engineering positions at companies like Coefficient AI, MadeWithIntent, ECMWF, Global Canopy and Ocado…
Apologies to be slow on this issue - once again (infant == MASSIVE_TIME_SUCK) # True
.
I see on The Guardian that our UK government is investing 900M GBP in a new supercomputer plus annual 1M GBP awards for groundbreaking AI research. It is nice to see us starting to play catch-up, after so much dismal news in the UK (thankfully this week’s Budget statement didn’t send the market into a dive, unlike the last one) seeing some focus on AI innovation in the UK feels welcome.
Below I talk on some reflections on a recent keynote talk, Man Group’s new ArcticDB release for large datasets, confident hallucinations in ChatGPT (with an opportunity for DS teams), PyScript for “python and DS in the browser” and some developer tools.
Software Engineering for Data Science course in a month
On April 12-14 I’ll run my next Software Engineering for Data Scientists course - you should attend if you need to write tests and move from Notebooks to building maintainable library code.
This course is aimed at anyone who is unconfident in testing their code with py.test
and Pandera
, wants to review code with colleagues and wants to understand the path
system behind how Python loads modules and your code. Please reply directly to this if you have questions. After this course you’ll be more confident talking to engineers and you’ll know how to structure your projects for growth and collaboration.
Keynote talk for Man Group’s Investment Tech Conference
Recently I had the pleasure of speaking at the Man Group’s Investment Tech Conference, giving a keynote talk on “bad ways and good ways to build a Data Science team”. The audience was external partners and clients of the hedge fund, who were at various stages in their data science journey.
A takeaway message here was on “are we asking the right question?”. Having seen so many teams rush for cool tech (DNNs!), or insisting on doing ML because that’s what an exec asked for, or building something complex with no thought for how it’ll be used and will impact a business process - I’m increasingly cautious in how I approach a DS project now.
Are we asking the right question that leads to some sort of actionable outcome? Is there a quick and easy way to build a “v1” that’s not too smart, but can be diagnosed and offers some value that can be taken advantage of by users? Can we record and measure improvement? Can we see a reason to iterate to a better v2?
This led to some thoughtful discussion with teams about how to figure out why they might use the data that they have, how it might be stored and how they might build up a collection of “good questions” to ask to help them forward on their journey. As a speaker gift I got a rather lovely resin-encased research group Xeon chip (2009 vintage, well used over a decade):
One interesting note from one of the other talks was about the continuing rise in cost for cloud-based systems (Azure, AWS and the like), using evidence shared by other industry observers. Whilst “going to the cloud” was a common strategy, the price charged for the underlying hardware has hardly reduced in a decade. Once you’re locked in, there’s no need to compete.
The comparative price of rented hardware in a datacentre was far cheaper and a suggested strategy was to scale quickly using the commodity cloud infrastructure, but to prepare to move away to a private data centre once you’d stabilised and could bank your gains. I hadn’t given much thought to cloud costs (even though they’re eye-watering for a couple of my clients) - once you’re locked in, it is rather hard to move away, but maybe we need to rethink this architectural approach to scaling teams?
Man Group’s Arctic database
Also noted at the conference was a significant new release of the Man Group’s mostly-open Arctic database. Way-back-when it was a MongoDB based tick-data store. Over many years it has evolved and now is backend-agnostic and can serve a Pandas DataFrame very quickly on the client-side with no server component, from a variety of backends. I sat in on a demo and figured for large data retrieval loads this might interest some of you. Press release here.
The main code for ArcticDB is public with a Business Source License (you need an agreement for some use cases), some of the backend systems are private and will be licensed. There’s a simple animation on that github page that shows it in action. Some of the highlights noted include “Efficiently index and query time-series data across billions of rows”, “Travel back in time to see previous versions of your data”, “Accelerate analytics though concurrency in the C++ data-processing engine”.
If you have a large-scale dataset used with Pandas and you need quick read/write access, definitely for financial data but also for other arbitrary datasets, maybe this is worth checking.
PyData
PyDataLondon 2023 (June 2-4)’s CfP closed this week, we had nearly 200 submissions and the committee hope to make an announcement in a month or so. The PyData UK slack is building up, there’s been some nice chatter about ChatGPT recently.
ChatGPT and hallucinations
I’ve been playing with ChatGPT (sign-up here if you want to try, it is easier to play with free credit than it used to be) for a couple of tasks. Without much effort I hit two obvious points of error, maybe these are useful to keep in mind if your colleagues are overawed with “what might be possible” (it is impressive, but the limits feel tricky).
My wife had been asked to write a bio for her new senior role in a start-up (head of engineering, she’s a bit of a mover and shaker). She wondered if ChatGPT might do the job so I demo’d from my account:
Q: Write a biography in 1 paragraph for Ian Ozsvald
A: Ian Ozsvald is a data scientist, teacher, and entrepreneur who has made significant contributions to the field of machine learning and data science. With a PhD in Computer Science and an extensive background in software engineering, Ian has worked with some of the world’s leading companies and organizations to build data-driven solutions and help them make better decisions. He is also the founder of PyData London, one of the largest and most active data science communities in the world, and has authored several books and online courses on data science and machine learning. Ian’s passion for data science and commitment to sharing his knowledge and expertise with others have made him a respected and influential figure in the field.
I love the answer but I don’t have a PhD. Curious about why my High Performance Python book wasn’t called out more clearly I tried further:
Q: What books has Ian Ozsvald written?
A: Ian Ozsvald has authored several books on data science, machine learning, and related topics. Some of his notable publications include:
- High Performance Python: Practical Performant Programming for Humans # YES!
- O’Reilly’s “Intermediate Python” video course # NO (but I did make videos for Pluralsight, other colleagues did these ones)
- O’Reilly’s “Data Wrangling with Python” video course # NO (but I have done related past conference talks)
- O’Reilly’s “Real-time Image and Video Processing in Python” video course # NO (very much no, but I did run a computer vision startup in a past life)
Ian is also a co-author of the book “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, … # NO (I’m not Wes)
1 out of 5 isn’t so hot. It is darned impressive that it found my book - there’s a set of footnotes in Wikipedia for my book (conda, Anaconda, CuPY) but no page for the book or my co-author or I (and Wikipedia is a major source of raw text for the GPT models). Another common source of training data is Common Crawl and I’m guessing that has O’Reilly’s books and possibly my blog.
I then asked it to do the same for my wife in a fresh chat:
Q: Write a biography in 1 paragraph for Emily Ozsvald
A: Emily Ozsvald is a data scientist, consultant, and educator based in London, UK. She holds a Ph.D. in Computer Science from the University of Southampton and has worked on various data-related projects throughout her career, including machine learning, natural language processing, and data visualization. Emily is also the founder of PyData London, a community of data professionals who organize regular meetups and conferences to share knowledge and best practices in data science. She is a frequent speaker at international conferences and a co-author of the book “High Performance Python: Practical Performant Programming for Humans.”
So now my wife is my co-author (sorry Micha Gorelick!) - she isn’t. This time she has a PhD from Southampton University, on a previous fresh ask she had the same from Oxford (she has neither, nor do I). She’s not a data scientist and didn’t co-found PyDataLondon. She does share my surname and I’ve got a lot more written about me online than my wife does - I’m pretty sure the latent vectors around “Ozsvald” are wedded to me (this name is reasonably uncommon).
The previous attempt to ask for a bio did mention data science but didn’t include the other “other achievements”. An obvious improvement is likely here as Microsoft invested heavily in OpenAI and owns LinkedIn, so LI profiles would be an easy data source. It doesn’t change the confident-lie that has been told here.
Interestingly my bio generation was stable but asking for Emily’s resulted in lots of variation (e.g. she might have had a PhD or a Degree from a variety of UK institutions) - I wonder if the instability is useful as at least 1 measure of low-confidence?
Next I asked about aspell
for spell-checking a text file (actually - this email), ChatGPT was helpful at first:
Q: Write some shell code that makes aspell read a text file and print a list of unrecognised words
A: Here's an example shell code that uses aspell to read a text file and print a list of unrecognized words:
----
# Define the name of the text file
file_name="example.txt"
# Use aspell to find unrecognized words in the file
aspell --lang=en_US --mode=none list < "$file_name" | sort | uniq
----
(I've truncated ChatGPT's explanation which seemed correct)
----
However when I asked it a follow-up question it confidently gave me an answer that is impossible as, as best I can see, the argument does not exist (at least not on Linux and it is mostly a linux tool):
Q: How do I add a word to aspell?
A: To add a word to aspell, you can use the --add command-line option
followed by the word you want to add. Here's an example:
---
aspell --lang=en_US --add myword
---
This command adds the word "myword" to the English (US) dictionary used by aspell.
If you want to add a word to a different dictionary,
you can specify the appropriate language code instead of "en_US".
(I've truncated the rest of the ChatGPT explanation which was very confident)
However if I run this at the command line I get $ aspell --lang=en_US --add myword # Error: The key "add" is unknown.
- maybe there’s a different version of aspell
that does have this command, but searching online suggests that this isn’t the right solution. What does work is to follow the official instructions to make a personal dictionary and reference it with $ aspell --home-dir=. --personal=.aspell.en.pws ...
.
I’m still super impressed, but can’t help but feel that the confident-but-very-wrong answers are a continuing blocker for sane use.
How are you using it? I’ve asked in the PyDataUK Slack General channel and a couple of useful tasks have been shared. I’m cautious about using it with client projects due to the confident lies. Figuring out how to use it for idea generation (within a domain you can verify) or for knowledge extraction (where you cross your fingers and have a separate validation path) feels safe enough.
What are your thoughts? After years of web scraping challenges my friend Simon tweets about sending the DOM (or a screen cap of a website) to GPT4 to ask it questions - that feels both reasonable and bonkers.
I strongly suspect that there’s a whole subfield of work for data science teams here working with non-DS teams who are jumping on the bandwagon, helping them understand the risks and estimating the value of trials. DS teams often fail to properly estimate business value or run sensibly-sized experiments (else your confidence intervals are so large you can’t draw useful conclusions) and non-DS teams are likely to struggle even more with this.
I’ll have more to say on this in the next newsletter.
Are you engaging with non-DS teams to help them trial tools like these in sensible ways? I’d love to hear so if so.
Open source - PyScript
and some new tools
PyScript in the browser including pandas
!
So this blew my mind - apparently PyOdide has been extended into PyScript which lets you run a modern Python installation in the browser, including a lot of C-based libraries (i.e. numpy
and pandas
and matplotlib
and more) inside the browser, with JavaScript and all the usual web UI goodness.
Speaking to talented developer colleague at Anaconda (hi Nicholas!) he noted that the entire toolchain on the Python side is compiled from C into wasm (WebAssembly) using Emscripten via LLVM (as also used in the Numba compiler) - so numpy
and friends all get compiled into a browser-based virtual machine. As noted in wikipedia - Emscripten:
Emscripten allows applications and libraries written in C or C++ to be compiled ahead of time and run efficiently in web browsers, typically at speeds comparable to or faster than interpreted or dynamically compiled JavaScript. It even emulates an entire POSIX operating system, enabling programmers to use functions from the C standard library (libc)
Pyodide is “a Python distribution for the browser and Node.js based on WebAssembly.”, it is a port of normal CPython to WebAssembly (wasm) using Emscripten. PyScript goes a step further:
PyScript is a framework that allows users to create rich Python applications in the browser using HTML’s interface and the power of Pyodide, WASM, and modern web technologies. The PyScript framework provides users at every experience level with access to an expressive, easy-to-learn programming language with countless applications.
The list of supported packages is extensive including many of our DS tools (e.g. OpenCV, NLTK, Gensim).
Go to the examples and select “REPL2” and type the following plus shift-enter:
import numpy as np
np.__version__ # you'll get 1.22.4 which is pretty recent
On that same page you can select the altair (interactive - click the bars) and Matplotlib (non-interactive) visual demos. The NYCity Taxi Tips render with slides is really nice.
The write-up on python.land explains a bit more and gives a different build which also includes pandas
in the REPL. towardsdatascience has a deeper article if you want to dig in.
Currently you develop in text files which feels like it’ll be pretty painful, but there’s no reason this won’t turn into whatever tooling JS folks use when developing directly in the browser. I suspect this’ll be amazing for taking data analysis to the browser for interactive results. Right now it is very early stage, but if you want some more pretty demos to show in your organisation to help folk realise just how integrated Python is everywhere, this is a pretty good place to start.
I’m curious about whether doing large array operations (e.g. 10GB creations) inside a browser is as fast as doing it inside a regular OS process, I suspect there may be overheads at the wasm layers as it wasn’t designed for fast large-array creation (and I could totally be wrong too). I assume that deploying much of Python to the web will massively incrased adoption again and will bring more data science into the hands of more users. Crazy times.
Developer-focused tools for help with testing, documentation and configuration
My friend Chris Whithers has shared some of his very mature open source tools, maybe they’ll help in your engineering workflow:
- TestFixtures helps with automated testing with better diagnostics on nested/complex structures (sadly not for DataFrames but definitely for the likes of
dict
andlist
, plus help for logging, streams, exceptions, Twisted, Django and more - Sybil extends
doctest
and enables testable code in comments written in Markdown, MyST and more - Configurator enables layered configurations and as noted “It provides an easy interface for accessing configuration information sourced from overlaid config files or mapped in from environment variables or command line options.”
Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I’m also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,500+ subscribers. Your first job listing is free and it’ll go to all 1,500 subscribers 3 times over 6 weeks, subsequent posts are charged.
Scientist/Senior Scientist for Machine Learning at ECMWF
The European Centre for Medium-range Weather Forecasts (ECMWF) is looking to hire multiple (senior) scientists for machine learning. We’re now in 3 locations across Europe (with adjusted salaries), working on improving weather forecasts for Europe.
If you have experience on the traditional HPC + ML or ML + Earth Systems Science side of things, or more of a ML coordinator, you may be a great fit. Especially, if you’re part of a under-represented minority, please consider that the job post is written for both senior and regular scientists and you may not have to match every single bullet point to be a great fit. Our machine learning ecosystem: a lot of deep learning from CNNs to transformers and GNNs on lots a GPUs.
The ECMWF itself is an intergovernmental organisation created in 1975 by a group of European nations and is today supported by 35 Member and Co-operating States, mostly in Europe with the world’s largest archive of meteorological data. So definitely a very unique place to work and push state-of-the-art machine learning these days.
- Rate: £68,374 GBP to €103,517 EUR NET of tax annual basic salary + other benefits
- Location: Reading, UK / Bonn, Germany / Bologna, Italy
- Contact: jesper.dramsch@ecmwf.int (please mention this list when you get in touch)
- Side reading: link, link, link
Senior Data Scientist at Made With Intent, Permanent, Remote
We’re a new, revolutionary platform that helps online retailers show a bit more care for their customers. We give retailers the ability to understand their customer intent by listening to hundreds of micro-behaviours on site, modelling them together, and creating a predictor metric between 0 and 1. We predict which customers will buy, and where they are in their journey, and serve them appropriate content that nudges their intent in a caring, nurturing and educating way.
The ideal data scientist candidate will have at least 2 years of hands-on programming experience in AI, a postgraduate degree, experience in NLP, Tensorflow/Pytorch, MLOps, AWS and advanced SQL skills. They must work with unstructured data, have excellent communication skills, and collaborate with stakeholders. We value those who learn new tools quickly and hustle when required.
- Rate: £70k-£90k
- Location: Remote
- Contact: tom@madewithintent.ai (please mention this list when you get in touch)
- Side reading: link, link
Data Scientist
M&G plc is an international savings and investments business, as at 30 June 2022, we had £348.9 billion of assets under management and administration.
Analytics – Data Science Team is looking for a Data Scientist to work on projects ranging from Quantitative Finance to NLP. Some recent projects include: - ML applications in ESG data - Topic modelling and sentiment analysis - Portfolio Optimization
The work will revolve around the following: - Build data ingestion pipelines (with data sourced from SFTP & third party APIs) - Explore data and extract insights using Machine Learning models like Random Forest, XGBoost and (sometimes) Neural Networks - Productionize the solution (build CI/CD pipelines with the help of friendly DevOps engineer)
- Rate:
- Location: London
- Contact: sarunas.girdenas@mandg.com (please mention this list when you get in touch)
- Side reading: link
Senior Data Scientist at Coefficient Systems Ltd
We are looking for an enthusiastic and pragmatic Senior Data Scientist with 5+ years’ experience to join the Coefficient team full-time. We are a “full-stack” data consultancy delivering end-to-end data science, engineering and ML solutions. We’re passionate about open source, open data, agile delivery and building a culture of excellence.
This is our first Senior Data Science role, so you can expect to work closely with the CEO and to take a tech lead role for some of our projects. You’ll be at the heart of project delivery including hands-on coding, code reviews, delivering Python workshops, and mentoring others. You’ll be working on projects with multiple clients across different industries, including clients in the UK public sector, financial services, healthcare, app startups and beyond.
Our goal is to promote a diverse, inclusive and empowering culture at Coefficient with people who enjoy sharing their knowledge and passion with others. We aim to be best-in-class at what we do, and we want to work with people who share that same attitude.
- Rate: £80-90K
- Location: London/remote (we meet in London several times per month)
- Contact: jobs@coefficient.ai (please mention this list when you get in touch)
- Side reading: link, link, link
(Senior) Data Scientist, Ocado Technology, Permanent, Hatfield UK
Ocado technology has developed an end-to-end retail solution, the Ocado Smart Platform (OSP) which it serves a growing list of major partner organisations across the globe. We currently have three open roles in the ecommerce stream.
Our team focuses on machine learning and optimisation problems for the web shop - from recommending products to customers, to ranking optimisation and intelligent substitutions. Our data is stored in Google BigQuery, we work primarily in Python for machine learning and use frameworks such as Apache Beam and TensorFlow. We are looking for someone with experience in developing and optimising data science products as we seek to improve the personalisation capabilities and their performance for OSP.
- Rate:
- Location: Hatfield, UK
- Contact: edward.leming@ocado.com (please mention this list when you get in touch)
- Side reading: link, link, link
Request for proposals – Database Integration Project
Global Canopy is looking for a consultancy that will work with us on the second phase of development of our Forest IQ database. This groundbreaking project brings together a number of leading environmental organisations, and the best available data on corporate performance on deforestation, to help financial institutions understand their exposure and move rapidly towards deforestation-free portfolios.
- Rate: The maximum budget available for this work is NOK 1,000,000 (including VAT)., approximately £80,000 GBP.
- Location: Oxford/Remote
- Contact: tenders@globalcanopy.org (please mention this list when you get in touch)
- Side reading: link
Machine Learning Researcher - Ocado Technology, Permanent
We are looking for a Machine Learning Scientist specialised in Reinforcement Learning who can help us improve our autonomous bot control systems through the development of novel algorithms. The ideal candidate will draw on previous expertise from academia or equivalent practical experience. The role will report to the Head of Data in Fulfilment but will be given the appropriate latitude and autonomy to focus purely on this outcome. Roles and responsibilities will include: