Interview Part 2 with Kaggle Competition Expert, more on the State of Higher Performance Python
Thoughts
Did you know that 2iQresearch are hiring for a Senior Dev and and a Quant Dev? See their ad and many more below.
We had PyDataGlobal a week back, we had over 2,000 attendees! The schedule was great, videos are available for attendees and the videos will be public in a couple of months.
I ran an Executives at PyData session to discuss pressing issues for team leaders. We discussed how a cross-functional embedded DS team was generally more productive than an isolated team. A thornier topic was "how do we build trust that open src tools will be around in 10 years - we can't buy a service contract on sklearn!". Hopefully NumFOCUS will have a growing part to play in this trust issue.
I had the pleasure of being interviewed by Douglas Squirrel and Jeff on their TroubleShootingAgile podcast. We talked on ways to "de-silo-ify a DS team" for 20 minutes and I shared some client experiences. De-silo-fication is super important! Their episodes are focused on making agile teams ship - they're solid and worth a listen.
On the 30th I get to have a chat with members of the Dask team and to talk about the state of higher performance Python. You can join us live to ask questions. I'm particularly keen to ask about how other teams migrate from Pandas to Dask.
Interview with Kaggle Expert Mani Sarkar - Part 2
In the last newsletter I shared the first part of an interview with Kaggle competition expert Mani Sarkar. Mani talked about knowledge acquisition and shares the tools he uses to investigate the data. Now Mani talks about how he gains trust in his models, his favourite modeling tools and shares some tips. If you found this interview useful please share it on the social sites, there's a lot of good information in here.
How do you go about gaining trust in your model?
When I have understood the problem statement and digested some of the discussions and the notebooks, I decide that it's time for my base model and I have a couple of approaches to go about with creating the model. Previously I used to use my wrapper library from the last issue which linked to other libraries for ML training, validation, scoring etc, but these days I have a new wrapper library that under the hood uses pycaret
and so far it does many things and it is in my view a better way forward. There is also the pytorch tabular library that does similar things but does Deep Learning on your data and can be used to build models from image, columnar, numeric and text data - I haven't explored this yet but it's on my list of things to investigate.
pycaret takes care of cross-validation steps, offers different types of CV methods and you can write your own custom processes and attach to any of its own processes, including cross-validation methods. This is a very important step to make sure we have a fair and more correct insight into our data and it's capabilities.
There is also the check-if-data-is-correct step I do more and more these days -- before I build my baseline model or sometimes after it -- it all depends. As Kaggle (and also real-world) datasets can contain incorrect/bad/wrong/dubious data - either by mistake or deliberately. Such data can lead to incorrect polarity for positive and negative tests, including models giving confusing or less accurate results plainly due to it learning from incorrect data.
There is some amount of work and analysis to split the wheat from the chaff. Once we are through with it, it means we have a better insight into the data and the model it can make and why we are getting the results we are getting. What is the upper boundary of the results we can get from it - how much higher can the score get and if not, why not?
Diagnosing predictions is something I do a lot, I have something called a submission analysis system, which compares a base-line submission to others including those from ensemble models or models from others. This allows me to see which ones are doing better and also help understand why. I also do other backtracking of model to data steps (like we would do with a variant of TDD: Test Driven Development) - I'm still working on this idea, but it does help me understand TP, FP, TN, FN in classification problems and similarly for regression problems.
Diagnosing predictions can be done both with training data (seen) and test data (unseen) but it can be a bit tricky with the latter although it's not impossible to get results from unseen as it may seem, just the conclusions drawn from it may have a lower probability of assurance.
What's your favourite machine learning tool and why?
Wow! This is a tough one to answer, I like Dataiku DSS, DSS stands for Data Science Studio -- it's flexible and programmable but libraries or packages like scikit-learn
are more or less a complete package for many ML tasks. Then we have pytorch
which is also a very useful library, growing and covering wider domains in the space. It's hard to have a favourite one hence I have been working on my wrapper library which brings in best of the breed in one library, and I can use a high-level DSL to talk to it and get the ML and DS processes working and move forward with my tasks.
There's one more tool to know about, it's called Valohai, I like their work, very developer friendly. I wrote some blogs about their service on how to go about using it from your CLI. You can do all your heavy duty training right from your keyboard even if your machine does not have ML standard high-specifications.
If there's one tip you could give to someone who is early in their Kaggle journey, what would it be?
Not to try to rig the system or take advantage of data leakage in Kaggle competitions, this can only dampen our learning opportunities, even if it may give us a sense of success, temporarily.
Focus on learning more than winning or medals or ranks or scores. Focus on sharing and collaboration, learning from other's work, giving credit to other's work. Taking notes and sharing notes. Leaving the place better than it was before you got there. All these principles will help in every walk of life, not just Kaggle.
Don't just do what the herd does also try out things that the herd or run-of-the-mill instances do not do - try to be different and unique in your approaches -- see the best tips that come from the grandmasters themselves.
NOTE From Ian - that video is very good, there's a nice set of tips shared including re-running XGBoost with different seeds, build an optimal model per cross-fold to get a number of high quality models, generally use XGB and aim to augment your data to expand the training set.
Footnotes
See recent issues of this newsletter for a dive back in time.
About Ian Ozsvald - author of High Performance Python (2nd edition), trainer for Higher Performance Python, Successful Data Science Projects and Software Engineering for Data Scientists, team coach and strategic advisor. I'm also on twitter, LinkedIn and GitHub.
Now some jobs…
Jobs are provided by readers, if you’re growing your team then reply to this and we can add a relevant job here. This list has 1,400+ subscribers. Your first job listing is free and it'll go to all 1,400 subscribers 3 times over 6 weeks, subsequent posts are charged.
Senior Python Developer (Lisbon)- 2iqresearch
We are looking for an experienced Python Developer with a strong background in Finance to join us as one of our first engineers in the core team.
You will play a key role in designing and maintaining analytics/predictions and visualizations for our new data platform, “Alpha Terminal.” It bundles 2iQ’s data and analytics into one easy-to-use product, offering fundamental investors a range of powerful insights.
Responsibilities: Working with the Quant and Product teams by designing, building and managing critical infrastructure while automating everything with code. Initially, this role will be based in our Lisbon office. However, there is the potential for flexible working arrangements in the future. The role may suit an individual that is looking for a change of scenery or better work-life balance.
Requirements: Experience in a DevOps or software engineering role Strong background with Linux, K8s and Docker (or other container) High proficiency in a language such as Python, Java, or Go
Nice to have Cloud or Big Data experience (Elastic, Aerospike, ClickHouse, KDB+, …) Experience with message buses Spark and/or Dask knowledge
- Rate: 80-90k
- Location: Lisbon, Portugal
- Contact: jobs@2iqresearch.com (please mention this list when you get in touch)
Quantitative Developer – Python (Lisbon) - 2iqresearch
We are seeking highly talented Quantitative Developer with a solid background in Python to join our platform analytics team. In this role, you will help implement, support, and run the hybrid compute infrastructure that manages all research and production workloads.
Working closely with the Quant and Product teams, to support and develop code that is running in our production systems. These systems are the building blocks of the “Alpha Terminal”, a tool for fundamental investors to explore the market. You will also build and optimise data analytics services as well as integrating the data to support the quantitative team. Adapting research prototypes of models to the production environment, is also a key responsibility of this role. This role is to be fulfilled in our Lisbon office. However, flexible working arrangements as well as a hybrid model transition period are available for all candidates.
Requirements: Experience in numerical Python and SQL Working knowledge of Pandas / NumPy libraries Dask and/or Spark knowledge CI/CD knowledge
Nice to have: Docker (or other containerization) knowledge Cloud or Big Data experience (Parquet, PyArrow, Aerospike, ClickHouse, KDB+, …) Knowledge of AI/ML libraries (Tensorflow, PyTorch, SciKit, ..)
- Rate:
- Location: Lisbon, Portugal
- Contact: jobs@2iqresearch.com (please mention this list when you get in touch)
Product Analyst at JW Player
Over half a billion videos are watched across millions of websites on a JW Player video player every day. Our product teams leverage data coming from our player to measure success, prioritize our next steps, and envision new possibilities for the thousands of video publishers we serve daily across the web. We iterate quickly, conduct frequent experiments as part of product development, and seek to be data driven in everything we do.
As a Product Analyst on the JW Player Data Science & Product Analytics team, you will work closely with product managers, engineers, and data scientists to develop insights that inform product decisions and strategy. Your findings will impact the next generation of JW Player products, from our flagship video player and video platform to our video recommendations service and other data products. You’ll play a critical role in improving these products and guiding our future development efforts.
- Rate:
- Location: Remote within the United States
- Contact: olga@jwplayer.com (please mention this list when you get in touch)
- Side reading: link
Senior Data Scientist at JW Player
JW Player powers billions of video plays every week across a wide spanning web of broadcasters and video publishers with a diverse set of audiences and content types. Leveraging the vast stream of data sent by our flagship player, the Data Science team works in close collaboration with adjacent teams to improve our existing products, drive sound decision making, and develop new data products that bring value to our customers in both the video publishing and video advertising spaces. We iterate quickly, conduct frequent experiments, and seek to be data driven in everything we do.
As a Senior Data Scientist at JW Player, you will be joining a collaborative, creative, multidisciplinary team of scientists, engineers, and data analysts responsible for research and development, product analytics, and running production machine learning models that make tens of millions of predictions every day.
- Rate:
- Location: Remote within the United States
- Contact: olga@jwplayer.com (please mention this list when you get in touch)
- Side reading: link, link, link
Research Advocate - Rasa
At Rasa we're hiring for a bunch of engineering roles. We're a friendly, remote company with many interesting problems to solve. We're building open-source tools that are used globally to build virtual assistants. Want to invest in developer experience, Non-English NLP and scalable machine learning? Then there's a lot to do!
Feel free to reach out to Vincent @fishnets88 if you have any questions.
- Rate: https://rasa.com/careers/#jobs
- Location: EU Remote
- Contact: vincentwarmerdam@gmail.com (please mention this list when you get in touch)
- Side reading: link, link
Senior Software Engineer (Full Stack) at Carbon Re
Carbon Re is an AI research and development company dedicated to removing Gigatons of CO2 (equivalent) from humanity’s emissions each year. We aim to do so by optimizing production processes, redesigning manufacturing systems, developing new control processes, and accelerating the development of new climate-friendly materials and systems.Carbon Re is an equal opportunity employer. We are still a small team and are committed to growing in an inclusive manner.
- Rate:
- Location: London Bridge, London
- Contact: careers@carbonre.tech (please mention this list when you get in touch)
- Side reading: link, link
Principal Data Scientist at National Grid
As Principal Data Scientist, your key role will be to establish, define and implement data science solutions in order to deliver business value by making the optimal decisions to ensure efficient and cost-effective performance. You will build data science tools, providing business experts throughout Gas Transmission (GT) with the technology and expertise to unlock and exploit the information we hold to support the effective running of the business.
- Rate:
- Location: Warwick
- Contact: adnan.fiaz@nationalgrid.com (please mention this list when you get in touch)
- Side reading: link
Data Scientist at Ripjar, Permanent, Remote
We're looking for experienced, highly motivated Data Scientists to support the research and development of Ripjar's analytics and data products. You will carry out data analysis tasks to develop Ripjar’s understanding of relevant data and will develop, train and evaluate machine learning models that can be integrated into Ripjar's software products and data processing pipelines.
You will have a strong technical and theoretical background, with a strong understanding of statistics and statistical models. You will be proficient in at least one programming language, preferably Python. You will have a good understanding of machine learning and large-scale data analysis, and will be comfortable working with complex data at scale.
- Rate: £50,000 - £75,000
- Location: Cheltenham. Bristol or Remote
- Contact: anthony.birleybrown@ripjar.com 07498 778 597 (please mention this list when you get in touch)
- Side reading: link
Data Scientist @ Good With
As the founding team data scientist, you'll develop Good With's intelligent data analysis and recommendation engines, supporting voice and natural language interaction with users.
Python and open source technologies are the overarching strategic choice for the data processing, analysis, machine learning and recommendation engines.
You’ll work at the heart of a dynamic multidisciplinary agile team to develop a platform and infrastructure connecting a voice-enabled intelligent mobile app, financial OpenBanking data sources, state of the art intelligent analytics and real-time recommendation engine to deliver personalised financial guidance to young and vulnerable adults.
As a founding member, you’ll get shares in an innovative business, supported by Innovate UK and Oxford Innovation, with ambitions and roadmap to scale internationally.
Supported by Advisors: Cambridge / FinHealthTech, Paypal/Venmo & Robinhood Brand Exec, Fintech4Good CTO & cxpartners CEO.
Working with: EPIC e-health programme for financial wellbeing & ICO Sandbox for ‘user always owns data’ approaches.
- Rate: £50-65K + Shares in the company
- Location: Flexible, remote working. Cornwall HQ
- Contact: gabriela@goodwith.co (please mention this list when you get in touch)
- Side reading: link
Researcher in Surrogate-Based Optimisation
The Computational Optimisation Group has a two-year research opening (either pre- or post-doctoral) in surrogate-based optimisation. The role intersects computational optimisation, machine learning, and open-source software.
- Rate: £41,593- £49,210 (postdoctoral); £36,694 - £39,888 (predoctoral)
- Location: South Kensington, London
- Contact: r.misener@imperial.ac.uk (please mention this list when you get in touch)
- Side reading: link, link, link
Senior Python Engineer @ Semantic Partners
Senior Python Engineer - Knowledge Graph project for a major European Bank - Semantic Partners are seeking several skilled engineers with the following skillset - Python, Django, Flask etc, RESTful API's, CI/CD, Containerisation, Docker, Kubernetes, NoSQL, BDD.
You'll be joining a project team focusing on building a Knowledge Graph so an interest in Graph technologies and any experience of specific triple store systems would be a big plus, but more important is a desire to get into semantic engineering.
- Rate: 500-650/day
- Location: Remote
- Contact: dan.collier@semanticpartners.com (please mention this list when you get in touch)
- Side reading: link