How good will your forecast be?

Backtesting several different forecasting methods with different transforms and regressors

                June 28, 2021

            How good will your forecast be?

            Hi All,
It’s been a long time - I’ve been busy with client work and other non-Forecast Forge projects. This email contains two things:

A case study on forecasting for PR and PPC with BJ Enoch
How to estimate how good your forecast is going to be; this is very important when managing expectations for your bosses and clients.

PR and PPC forecasting with BJ Enoch
You can watch a video of my conversation with BJ at https://www.youtube.com/watch?v=fNKUqX4BSqs

    We are within 1.5% of the forecasted values for the month so far. And that
    gives us a high degree of confidence to use Forecast Forge as our primary
    method of evaluating spend and conversions and revenue.

— BJ Enoch

BJ Enoch is the director of digital marketing at The CE Shop which provides career education for realtors. He has also used Forecast Forge in his previous role as a Director of SEO and a Digital Director at an agency.

He has been a Forecast Forge customer for a little under six months now and has been using it for a variety of things; from forecasting PR metrics to PPC and everything in between!

            We knew that we would have an article featuring us going out in ESPN
            on [e.g.] Tuesday and we had enough historical data that we were
            able to forecast how much traffic and how many leads we would get
            from it. And that would tell us how many staff we'd need to handle
            the demand.

BJ’s first use for Forecast Forge was for a sports marketing group.
            He was able to forecast the traffic and lead flow that would be
            generated from their PR activity.
The client had a long history of getting themselves mentioned in
            places like ESPN so BJ was able to use this historical data to
            predict the impact upcoming articles would have. This was important
            to the client because it enabled them to better prepare staffing
            levels to handle the increased demand.
To do this for yourself you will need a database of historical press
            mentions for your client or business so that you can match the dates
            up with business data like traffic, conversions or revenue. Not all
            mentions are created equal so some kind of estimate of the “impact”
            of a press piece (e.g. reach, audience etc.) will help to make your
            forecast better.
BJ has also used Forecast Forge to predict the return from PPC campaigns. He started off with a simple forecasting model using the date and
            amount of media spend to predict the conversion value. The results
            were not great.
But one of the advantages of using Forecast Forge is that if the
            forecast looks wrong or seems to be missing something then you can
            dive right in and start making improvements. BJ started to add more
            regressors to the forecast and things improved. Things improved a lot

            We had to get a little creative with how we come up with those
            regressor values.
            Now our forecasted values are within 1.5% of the actuals

The regressors BJ used were, for the most part, very specific to the
            business sector he was forecasting for. One of the regressors he was
            using that has more general applications was to help the forecast
            with all the events that usually happen annually except for 2020
            because of Covid.
I have written previously about forecasting
            the effect of lockdowns but BJ’s situation was significantly
            more challenging because different US states have had different
            lockdown rules at different times.
To solve this problem BJ collated lockdown dates from multiple
            sources so that he could use a regressor column for lockdown rules
            in each state. This is a great example of the kind of data work that
            will add value for years to come and that, once it has been done,
            can be used in many different places.
How good is my forecast?
One of the challenging things of using a machine learning system like Forecast
Forge is learning how much you can trust the results. Obviously the system isn’t
100% infallible so it is important to have an idea of where it can go wrong and
how badly it can go wrong when you are talking through a forecast with your boss
or clients.
One ways that people familiarise themselves with Forecast Forge is to run a few
backtests. A backtest is where you make a forecast for something that has
already happened and then you can compare the forecast against the actual
values.
For example you might make a forecast for the first six months of 2021
using data from 2020 and earlier. Then you can compare what the forecast said
would happen against the real data.
         Use data from this period      To predict here
  ________________________________________~~~~~~~~~~~~~
  |------------|------------|------------|------------|--------????|?????????
2016         2017         2018         2019         2020       Then use the same
                                                               methodology here

Sometimes when you do this you will see the forecast do something dumb and you
will think “stupid forecast. That obviously can’t be right because we launched
the new range then” or something like that - the reason will probably be very
specific to your industry or website. If you can convert your reasoning here
into a regressor column then you
can use this to make the forecast better.

Backtesting several different forecasting methods with different transforms
and regressors
Once you start doing this then there is a temptation to start adding and
adjusting regressor columns to make your backtesting fit as good as possible.
Figuring these things out and seeing your error metrics improve is one of the
most fun things about data science (and have no doubt that, even if you’re doing
it in a spreadsheet, this is data science) but it also opens the door to the bad
practice of overfitting. If you overfit then the forecast will perform a lot
better on a backtest than it will in the future.
An extreme example of this is if you were trying to make a revenue forecast and,
for your backtest, used the actual revenue as a regressor column. In this case
Forecast Forge would learn to use that column to make an extremely good
prediction! But when it came to making a real forecast you’ll have to use a
different method to fill in the revenue values for the regressor columns and
then either your real forecast will be rubbish or you’ll have figured out a way to
forecast revenue without using Forecast Forge - good for you!
There are a few different techniques you can use to help avoid this trap and to
estimate how good your forecast will be in the real world outside of a backtest.
1. Separate test and validation
The idea here is that you test out different regressors and transforms to see
which performs well on your test data. Then, once you have picked the best, you
run a final test on validation data to decide whether or not to proceed with
using the forecast.
For example, if you want to make a 6 month forecast you could use data up until
July 2020 to make a forecast through to December 2020. July to December 2020 is
your test data to where you can play around with different
regressors and transforms to see what gives you the best result. And then use
the same method with data through to December 2020 to make a forecast for
January 2021 to June 2021. This is your validation data where you check that
your forecasting method gives good enough results on out of sample data and
gives a “best guess” estimate of how well the forecast will perform on unseen
data. For this reason you must not peek at the validation data before the
final test. If you do, then it isn’t really validation but just a poorly used
form of test data.
This is a good method but it does have two main flaws:

For longer forecasts you end up needing a lot of training data. For example,
   if you want to make a 12 month forecast then you’d use June 2020 to June 2021
   as your validation data, June 2019 to June 2020 as your test data and then
   you’d still need at least a couple of years training data before that for
   you to use when you’re trying to figure out the best model. So, in this
   example, you’d need good quality data going back to at least June 2017 and
   June 2016 would be even better. This is a long time to have a consistent,
   non-broken analytics implementation; according to surveys done by Dipesh
   Shah, less than 20% of businesses have this
   data (see a case study on
   how he uses Forecast Forge with one of his ecommerce
   clients).
At the moment people care a lot about how well their forecasts model things
   like the effect of the covid-19 pandemic. But it is likely that all or most
   of the pandemic will occur only within your testing and validation periods
   so the model will perform very poorly during testing because there is no
   pandemic data during the training period. There is no way around this because
   the validation period has to be after the test period and the test period has
   to come after the training data.

2. Make models for similar things at the same time
If you have to make forecasts for a variety of similar things (e.g. similar
brands in one market or the same brand in different countries) then you can use
some of them for testing and improving your methodology and some of them for
validating how well you expect it to perform on unseen data.
For example, if you are forecasting for five similar brands (e.g. they are
different faces for the same conglomerate) you can find the forecasting method
that works best with three of them and then save the final two for validation -
checking how well your method is likely to perform on unseen data.
This method only works well if the things you are forecasting are similar enough
that the same method will work well for all of them. For example if you try this
with brands that are completely different then the performance of a forecast for
brand A will not tell you very much about how it will perform for brand C.
Whether things are “similar enough” for this technique to work is quite a tough
question to answer. In may ways the typical use case for Forecast Forge makes
this even harder; if you were trying to forecast 1000 metrics then you’d code the
model in Python or R and it would be obvious that you wouldn’t have time to
customise the model so you’d just use whichever method performed best on
average. But part of the point of Forecast Forge is to make it easier for people
to customise their forecasts and Forecast Forge users tend to be making fewer
than ten forecasts at once so the temptation to customise and overfit forecasts
is a lot harder to avoid.
Both of these methods and techniques rely on the patterns and features learned
by the Forecast Forge algorithm in the past also continuing into the future. As
we’ve all seen in 2020 sometimes crazy things happen and then your forecast
will be rubbish unless you knew about the craziness in advance.

Don't miss what's next. Subscribe to The Anvil: