LLM papers for robustness & the return of the 5-LOWer charity car

Further below are 6 jobs including:

                August 10, 2024

            LLM papers for robustness & the return of the 5-LOWer charity car

            Curated LLM papers, the return of the 5-LOWer charity car
Further below are 6 jobs including: 

Python AI Engineer at Qualis Flow Ltd
Data Scientist at ANNA
Machine Learning Engineer - Lantern AI (Private Equity tech)
Data Scientist (FTC - 12 Months)
Senior Data Scientist
Principal Data Scientist - MOPAC

Recently I've been working on the ARC AGI 2024 challenge, a set of geometric reasoning tasks that are "easy for humans but hard, maybe impossible, for LLMs". The goal of the task by the creators is to show that LLMs are a dead-end and that we need new approaches to move towards artificial general intelligence.
I'm less interested in the philosophical goals and more interested in a test bed for automatically making Python programs, based on rather hard tasks, to learn more about the limits of LLMs. A happy consequence is that I'm learning about robust strategies to get "better, consistent and correct" answers from the current crop of LLMs. I list a couple of papers below that have caught my eye and I'll share some more in another issue.
Further below I give an update on Team Hawaii Five-LOWer, our charity car raising money for Alzheimer's Society (last year we raised £4k for Parkinson's Research). We're driving off at the start of September and would welcome your donation, if you like the charities we support.
I've got an update on our High Performance Python (3rd ed) and have a course link for the September Software Engineering for Data Scientists class.
High Performance Python 3rd ed (early release)
Micha Gorelick and I have also been working on the 3rd edition of our High Performance Python. The first 5 chapters are out, we've just updated "Compiling to C". Cython's seen a 3.0 release with more flexibility including the use of Python types (with limitations, but an interesting development). Micha is overhauling the GPU side of that chapter. Tools covered include Cython, PyPy, Numba, PyTorch.
Work in the last couple of months included updating the profilers we cover (including Py-Spy, line-profiler, memory-profiler, scalene) plus a deep dive into the Python VM to look at hashing and related algorithmic design.
The manuscript should be updated for Christmas, out early in 2025.
ARC AGI & some interesting LLM papers
A month back at PyDataLondon I gave a short talk on my first attempts at the Kaggle ARC AGI challenge - using llama.cpp (see earlier 30 min talk) to locally run Llama 3 70B and 8B between my CPUs and external GPU. 
In the July lightning talk (linked) I spoke on the challenges of getting a textual representation that fitted this visual-like matrix task and what happens when you take raw Python emitted from an LLM and run it in your environment. 
The success rate of these generated programs is low so typically I generate many - often 100 (which can take hours or overnight on a large model), then I analyse what happened. 
"Defensive execution" became the word of the day as emitted Python can cover all sorts of sins (see the slide on the exception handler I wrote). One of the solutions appeared to "score well" - it turns out it instantiated a global variable that matched my score-keeping variable and it over-wrote it. Others imported modules that didn't exist, hallucinated functions from other modules or just packed generators of lists in containers and made nonsense. 
Running a fresh Python process via joblib for each evaluation protected against this, along with copies of anything being passed in to avoid modifications to the original problem setup (!).
Interestingly Llama 3 8B is "actually not bad" at writing code to solve the simpler challenges, the success rate is low (circa 8% on one challenge) but it is fast (coming up with complete solutions in 30 seconds through my laptop+external GPU). 70B is much better (circa 33% on the same challenge) with a far higher success rate but since it spills out of my 24GB VRAM it takes minutes to complete each answer. 
Analysing some of the results was the topic of my second lightning talk which I might share for the next issue.
"More Agents is All You Need", Li, Zhang, Yu, Fu, Ye, 2024
In More Agents Is All You Need Li et al. contrast single instances of an LLM vs ensembles of up to 35 using a sample-and-vote method to choose a final answer.
In the following diagram you can see that a Llama 2 13B models in an ensemble of 20+ queries will beat a Llama 2 70B (single) call and that an ensemble of 15+ Llama 2 70B will beat a single gpt-3-5-turbo instance. 
Llama 2 13B instances will comfortably run on a home GPU setup (like my NVIDIA RTX 3090 with 24GB of VRAM), enabling fast and private experimentation, useful when dealing with confidential client data.

This is the first paper where I'd realised that ensembles of models were likely to be a future path - if you've got a task where you can choose or merge the multiple answers to make a single final answer.
"Self-Consistentcy Improves Chain of Thought Reasoning in LMs", Wang, Wei, Schuurmans, Le, Chi, Narang, Chowdery, Zhou 2023
The paper above builds on the earlier Self-Consistency Improves Chain of Thought Reasoning in Language Models where Wang et al showed the novel idea of asking 1 LLM a question, then sampling completions by modifying the token sampling process:

A salient aspect of humanity is that people think differently. It is natural to suppose that in tasks requiring deliberate thinking, there are likely several ways to attack the problem. We propose that such a process can be simulated in language models via sampling from the language model’s decoder. ...  such solutions are less likely to arrive at the same answer. That is, we hypothesize that correct reasoning processes, even if they are diverse, tend to have greater agreement in their final answer than incorrect processes.

Overall their method (sampling 20 or 40 answers) beats Chain of Thought in the majority of their results by a large margin. You can see a diagram below which explains the idea - different reasoning paths can be generated and the majority answer is used. The reasoning paths are discarded and I wonder if they're useful when determining the "best" answer (maybe a majority vote isn't the only approach to be used).

They show results for 1-40 samples, typically 20-40 show strongly diminishing returns given the benchmarks used. For the ARC challenge I've been using 50-300 samples given the often low success rate, so I'm guessing that the number of samples needed varies with the difficulty of the task. 
Amongst other approaches they try an ensemble of 3 different prompts or vary the order of examples given to the prompt and note that sampling 40 answers and keeping the prompt fixed beat the other methods (though I can't help but think that the computational budget varied a lot between the experiments they showed and their results came from "the most expensive" computational path).
In the appendix they make ensembles of multiple different models (e.g. LaMBDA and PaLM and GPT) and show that this is typically worse than sampling multiple paths from the good model - they note that the "worse model drags the others down". 
A further result talks on how imperfect prompts, which'd result in lower success rates, can be improved using this majority vote method - and that consistency between the results correlates with the quality of the prompt. In effect - poor prompts show up as they'll have a higher variance in results compared to "better prompts":

For me this is another clue that sampling from many runs is a useful idea, both for getting a robust answer and for analysing the quality of my prompting process.
Have you got any LLM paper recommendations based on improving the quality and consistency of results that I might want to know about?
Return of the Hawaii Five-(now)LOWer
Last weekend had us do a lot of work on our charity car, you may remember that last year we drove for Parkinson's Research and this year (in a month) we drive for Alzheimer's Research. 
I got my hands dirty under the bonnet learning to diagnose the Mass Air Flow hot-wire system on a running car (fine) and the engine crank-shaft sensor (not so fine at all), found some more bare twisted wires which we soldered along with replacing brake calipers and CV joints.
By fixing the sensors and a bound brake caliper we got the mpg up from 20 to 32mpg - over 50% more efficient which will help with the fuel bill this year (we drive 2,500kms each way).
We say a huge thank you for our first donations. If you support charities like these, we'd love to receive your donation.

Last year we raised £4k for Parkinson's Research and we plan to do the same or better for the Alzheimers Society as there's a family impact there for one of my co-drivers. If you'd like to donate, see our pictures and progress here. 
We've got a Baywatch theme for this year, my co-drivers are choosing my outfit, I dread to think what I'm going to be wearing.

Training
The links on my training page show the September dates for my Software Engineering for Data Scientists class - if you fill in my training notification form I'll happily send you a 10% discount code valid for this year. In a few months I'll be running:

Software Engineering for Data Scientists July 8-10 - increase your speed of delivery by modularising, running code reviews, testing for increased confidence and preparing for production from early on, we'll also discuss types and how much testing you need.

Recent package updates from PyPI
This is a random sample from a set of popular projects that have been updated very recently.

pandera 0.20.3 A light-weight and flexible data validation and testing tool for statistical data objects.
hypothesis 6.110.1 A library for property-based testing
ruff 0.5.7 An extremely fast Python linter and code formatter, written in Rust.
plotly 5.23.0 An open-source, interactive data visualization library for Python
numpy 2.0.1 Fundamental package for array computing in Python
sktime 0.31.0 A unified framework for machine learning with time series
lightgbm 4.5.0 LightGBM Python Package
polars 1.4.1 Blazingly fast DataFrame library
pymc 5.16.2 Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with PyTensor
pytest 8.3.2 pytest: simple powerful testing with Python

Footnotes
See recent issues of this newsletter for a dive back in time. Subscribe via the NotANumber site.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,700+ subscribers. Your first job listing is free and it'll go to all 1,700 subscribers 3 times over 6 weeks, subsequent posts are charged.
Python AI Engineer at Qualis Flow Ltd
We are seeking a talented Python Engineer who is eager to contribute to building a sustainable future. If you are passionate about sustainability, believe that with cutting-edge technology we can address tangible issues, you value radical transparency, unstoppable tenacity and encourage collaboration and curiosity within your team, this opportunity is tailor-made for you.

Rate: up to £60k
Location: Remote with London Office once every 2 weeks
Contact: sam.joseph@qualisflow.com (please mention this list when you get in touch)
Side reading: link

Data Scientist at ANNA
The role is focused on helping the product team protect the customers from fincrimes. The first focus will be APP (Authorised Push Payment) Fraud prevention.

You will access the system in place, evaluate the performance of models offered by vendors, and create a roadmap for the system improvements to:

- uncover patterns and irregularities in data through statistical tools that could indicate fraud;

- use predictive modelling to spot possible fraudulent transactions and behaviours;

- develop concise reports explaining findings, risks, and recommended actions;

- team up with other departments to improve overall security and fraud detection.
Responsibilities

- Defining and evaluating key metrics for the AI part of our product and identifying levers to improve them

- Developing pipelines for data annotation using internal assessors as well as crowdsourcing platforms

- Implementing appropriate Machine Learning algorithms and deployable models

- Using BI tools to monitor key product metrics and performance of ML models

Rate: 
Location: London, Cardiff or Remote
Contact: Andrei Smirnov, Head of Data Science, asmirnov@anna.money (please mention this list when you get in touch)
Side reading: link

Machine Learning Engineer - Lantern AI (Private Equity tech)
We are seeking a machine learning engineer who will work on projects to reconcile data between dissimilar sources, build copilots for accountants, and prepare industry benchmarks. This role requires both data science expertise, and the ability to product production-quality Python code.

Rate: Up to £120k
Location: London Liverpool Street - hybrid 1 day per week
Contact: https://lantern.bamboohr.com/careers/56?source=aWQ9MTY%3D (please mention this list when you get in touch)
Side reading: link

Data Scientist (FTC - 12 Months)
We’re setting up a Data Science team at MOPAC. We’ll be building the capabilities as we go: establishing analytical best practice; setting up the infrastructure; and demonstrating data science potential. You’ll be unlocking knowledge into what causes the overall trends of crime in London and compiling the code to derive meaning from survey and consultation data…and a huge number of other things!
This post is ideally suited to someone who is keen to break into the world of data science, with excellent statistical, technical, and interpersonal skills. If you're passionate about using data for the benefit of all Londoners, apply today!

Rate: £39,604.00 - £45,411.00 per annum
Location: Remote (One day a month in Union Street, London)
Contact: anthony.duguay@mopac.london.gov.uk (please mention this list when you get in touch)
Side reading: link

Senior Data Scientist
We’re setting up a Data Science team at MOPAC. We’ll be building the capabilities as we go: establishing analytical best practice; setting up the infrastructure; and demonstrating data science potential. You’ll be unlocking knowledge into what causes the overall trends of crime in London and compiling the code to derive meaning from survey and consultation data…and a huge number of other things!
This post is ideally suited to someone with data science experience who wants to be hands on in a data role, is a curious flexible thinker, with excellent statistical, technical, and interpersonal skills. If you're passionate about using data for the benefit of all Londoners, apply today!

Rate: £46,597.00 - £53,209.00 per annum
Location: Remote (One day a month in Union Street, London)
Contact: anthony.duguay@mopac.london.gov.uk (please mention this list when you get in touch)
Side reading: link

Principal Data Scientist - MOPAC
We’re setting up a Data Science team at MOPAC. We’ll be building the capabilities as we go: establishing analytical best practice; setting up the infrastructure; and demonstrating data science potential. You’ll be unlocking knowledge into what causes the overall trends of crime in London and compiling the code to derive meaning from survey and consultation data…and a huge number of other things!
The Principal Data Scientist role is ideally suited to someone with management experience,  with excellent data science knowledge to lead the way on our journey into data science. If you're passionate about using data for the benefit of all Londoners, apply today!

Rate: £55,009.00 - £62,860.00 per annum
Location: Remote (One day a month in Union Street, London)
Contact: anthony.duguay@mopac.london.gov.uk (please mention this list when you get in touch)
Side reading: link

Don't miss what's next. Subscribe to NotANumber: