Forecast monthly values over time

Question

Forecast monthly values over time

tacy.highland

Hall of Fame

Points: 3319
More actions
October 9, 2018 at 10:09 pm

#365175

I've been reading through the many "forecast" keyword articles and forum posts here to see if I can find anything that might help.
I'm trying to forecast a value over time, with a simple two-column table (YearMonth, Value) using just SQL code. I've found a couple of queries, including the awesome article by Mark Wojciechowicz about Forecasting with SQL, (http://www.sqlservercentral.com/articles/T-SQL/69334/), but the forecasts I've come across and managed to test don't seem to be accurate enough and I don't understand the code well enough to tweak anything (I wouldn't even know where to start!).
A word on the data I'm working with: This data follows a definite trend for each month across the year, (if you chart it it's pretty clear). Generally speaking, the overall values increase year over year but the trend for each month (increasing in certain months and decreasing in others) remains pretty consistent. I've run some various forecasting queries I've come across against some of this historical data to see if they could produce accurate forecasts, but they often end up a few thousand off from the actual values.
The test data:
DECLARE @test_tbl TABLE (
[YearMonth] [int] NULL,
[TripCount] [int] NULL
)
INSERT INTO @test_tbl VALUES ('20160101', '72291')
INSERT INTO @test_tbl VALUES ('20160201', '73976')
INSERT INTO @test_tbl VALUES ('20160301', '81585')
INSERT INTO @test_tbl VALUES ('20160401', '77522')
INSERT INTO @test_tbl VALUES ('20160501', '77872')
INSERT INTO @test_tbl VALUES ('20160601', '75993')
INSERT INTO @test_tbl VALUES ('20160701', '71395')
INSERT INTO @test_tbl VALUES ('20160801', '77687')
INSERT INTO @test_tbl VALUES ('20160901', '73784')
INSERT INTO @test_tbl VALUES ('20161001', '77384')
INSERT INTO @test_tbl VALUES ('20161101', '73313')
INSERT INTO @test_tbl VALUES ('20161201', '69423')
INSERT INTO @test_tbl VALUES ('20170101', '72828')
INSERT INTO @test_tbl VALUES ('20170201', '66676')
INSERT INTO @test_tbl VALUES ('20170301', '80455')
INSERT INTO @test_tbl VALUES ('20170401', '74544')
INSERT INTO @test_tbl VALUES ('20170501', '80017')
INSERT INTO @test_tbl VALUES ('20170601', '76022')
INSERT INTO @test_tbl VALUES ('20170701', '72148')
INSERT INTO @test_tbl VALUES ('20170801', '77558')
INSERT INTO @test_tbl VALUES ('20170901', '70504')
INSERT INTO @test_tbl VALUES ('20171001', '80011')
INSERT INTO @test_tbl VALUES ('20171101', '75109')
INSERT INTO @test_tbl VALUES ('20171201', '70697')
INSERT INTO @test_tbl VALUES ('20180101', '78972')
INSERT INTO @test_tbl VALUES ('20180201', '72280')
INSERT INTO @test_tbl VALUES ('20180301', '80601')
INSERT INTO @test_tbl VALUES ('20180401', '79107')
INSERT INTO @test_tbl VALUES ('20180501', '84227')
INSERT INTO @test_tbl VALUES ('20180601', '76772')
INSERT INTO @test_tbl VALUES ('20180701', '77206')
INSERT INTO @test_tbl VALUES ('20180801', '81608')
INSERT INTO @test_tbl VALUES ('20180901', '74658')
Ultimately what I'm trying to achieve is just to forecast the next month's number, not 3 or 6 or 12 months into the future. I'm wondering if there's a specific type of forecast algorithm I should be searching for to suit this type of data, or if there are some pointers on, say Mark's script, which I could use to tweak it to get better more accurate results.
Anyone have any ideas out there? Any help would be much appreciated.
Thanks

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

sgmunson SSC Guru Points: 110639 More actions · Answer 1

Well, you don't have a lot of data, so I wouldn't expect anything remotely resembling accurate. The predictive power of statistics relies on truly large numbers of observations, and this data is not anywhere near large in quantity, and as such, any predictive power it may have may be quite limited. So don't expect any algorithm to do very well with just 2 and 3/4 years of data. Short term trends could easily make any predictions particularly inaccurate.

However, that said, one can still use statistics to at least build a model. Take the differences between a given month and the next month, and average those for each month of the year. Then you have a MOM (month-over-month) average difference that you could use, and here's some code to do just that:
CREATE TABLE #test_tbl ( YearMonth date NOT NULL PRIMARY KEY CLUSTERED, TripCount int NULL ); INSERT INTO #test_tbl (YearMonth, TripCount) VALUES ('2016-01-01', '72291'), ('2016-02-01', '73976'), ('2016-03-01', '81585'), ('2016-04-01', '77522'), ('2016-05-01', '77872'), ('2016-06-01', '75993'), ('2016-07-01', '71395'), ('2016-08-01', '77687'), ('2016-09-01', '73784'), ('2016-10-01', '77384'), ('2016-11-01', '73313'), ('2016-12-01', '69423'), ('2017-01-01', '72828'), ('2017-02-01', '66676'), ('2017-03-01', '80455'), ('2017-04-01', '74544'), ('2017-05-01', '80017'), ('2017-06-01', '76022'), ('2017-07-01', '72148'), ('2017-08-01', '77558'), ('2017-09-01', '70504'), ('2017-10-01', '80011'), ('2017-11-01', '75109'), ('2017-12-01', '70697'), ('2018-01-01', '78972'), ('2018-02-01', '72280'), ('2018-03-01', '80601'), ('2018-04-01', '79107'), ('2018-05-01', '84227'), ('2018-06-01', '76772'), ('2018-07-01', '77206'), ('2018-08-01', '81608'), ('2018-09-01', '74658');

WITH MOM_DIFFS AS (

SELECT TT.YearMonth, DATENAME(month, TT.YearMonth) AS Month_Name, MONTH(TT.YearMonth) AS MonthNumber, TT.TripCount - LAG(TT.TripCount, 1, NULL) OVER(ORDER BY TT.YearMonth) AS MonthDiff FROM #test_tbl AS TT --ORDER BY TT.YearMonth ) SELECT MD.MonthNumber, MD.Month_Name, AVG(MD.MonthDiff) AS AVG_Diff_MOM FROM MOM_DIFFS AS MD WHERE MD.MonthDiff IS NOT NULL GROUP BY MD.MonthNumber, MD.Month_Name ORDER BY MD.MonthNumber;

DROP TABLE #test_tbl;

Don't expect much accuracy. But it is at least a method with some amount of logic to it. Statistical validity might not be there, however.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

tacy.highland Hall of Fame Points: 3319 More actions · Answer 2

Thank you, Steve. How much of a dataset would be best to have for something like this? Also, would it be better if I looked at this on a day by day basis to predict counts per day (of week)? Obviously I have a much larger dataset per day but I could go back even farther in years as well.

What I've been doing to estimate the overall monthly numbers is just take the day of week average over the last rolling 45 days. It's not quite accurate as clearly each month goes up and down alternating so the previous month isn't such a great predictor for the next month's numbers, (which is why I thought I should look into forecasting).

Also, with these results, how do I apply them to predict the following months numbers? Do I add the next month's Avg_Diff_MOM to the previous month's monthly count?

sgmunson SSC Guru Points: 110639 More actions · Answer 3

tacy.highland - Thursday, October 11, 2018 8:44 AM
Thank you, Steve. How much of a dataset would be best to have for something like this? Also, would it be better if I looked at this on a day by day basis to predict counts per day (of week)? Obviously I have a much larger dataset per day but I could go back even farther in years as well.
What I've been doing to estimate the overall monthly numbers is just take the day of week average over the last rolling 45 days. It's not quite accurate as clearly each month goes up and down alternating so the previous month isn't such a great predictor for the next month's numbers, (which is why I thought I should look into forecasting).
Also, with these results, how do I apply them to predict the following months numbers? Do I add the next month's Avg_Diff_MOM to the previous month's monthly count?

If I were standing right in front of you, I'd be asking just how many years of daily data do you have? I would also be asking what the major factors are that afffect the numbers, on both a daily, weekly, monthly, quarterly, and annual basis. Statistics, in order to be useful in forecasting, have to demonstrate some kind of pattern. The fact that you already perceive such a pattern is a good thing. Your key is to create a model for the kind of variation you usually see. I presumed there was some kind of monthly pattern, solely because you mentioned monthly, and thus an extremely simple model is to just average the differences between any given month and the next month, and then use that as a guide based on the pattern repeating annually. The idea was simply to get you thinking about a model. If you have any kind of obvious pattern that emerges when you look at daily data, then you may have a potentially more accurate model based on the day of the week, or whatever period in which the pattern becomes visible. Calendar tables are often darn handy for use with such models, as long as said table can clearly indicate exactly what period a given observation belongs to, beyond merely the specific date. For example, if you were to use 2 weeks as a period, you can have 26 of those during a given year, and if you don't really care about years because that's too long a timeframe, then you can just start numbering the two week periods, and for each day in a given two week period, every such row in the calendar table has the same number. You'll want to choose a "0 date" as a basis for your model. Once the model is conceptually accurate, then you build your data structure. Get your co-workers to understand the basis for the model, and ask them to think about it for a couple of days and then get together and try to poke holes in it. Get everyone's feedback, even if it's stuff you also poke holes in, and get it written down and summarized into categories. Then analyze that in a document that you write that either justifies the model as is, or as modified based on the analysis. It could be extremely valuable to have that document hanging around when, a few years down the road, someone new comes aboard and needs to understand the model and no one really has time to "s'plain, Lucy"....

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

sgmunson SSC Guru Points: 110639 More actions · Answer 4

tacy.highland - Thursday, October 11, 2018 8:44 AM
Thank you, Steve. How much of a dataset would be best to have for something like this? Also, would it be better if I looked at this on a day by day basis to predict counts per day (of week)? Obviously I have a much larger dataset per day but I could go back even farther in years as well.
What I've been doing to estimate the overall monthly numbers is just take the day of week average over the last rolling 45 days. It's not quite accurate as clearly each month goes up and down alternating so the previous month isn't such a great predictor for the next month's numbers, (which is why I thought I should look into forecasting).
Also, with these results, how do I apply them to predict the following months numbers? Do I add the next month's Avg_Diff_MOM to the previous month's monthly count?

Almost forgot to answer your question about quantity. You need a model first because how much data it requires is going to depend on how often your perceived pattern repeats itself. The question then is just how much daily data do you have, ,,, are there any known gaps in it, is it 100% accurate or something less, etc...

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

tacy.highland Hall of Fame Points: 3319 More actions · Answer 5

Wow, thanks Steve. Very detailed explanation there.

I would say the data has a pattern week over week with each day of week typically having a trend compared to other days of the week. There just also happens to be patterns for the year (by month) as this data is affected by seasons and other factors. This of course excludes holidays and bad weather days, (which I've always wondered, how does that work with a model when those days can't be included or else they'll skew the numbers, but models need to have uninterrupted data...?). If I have counts by day of week and week number for several years, I'm guessing that would suffice here.

However, I have no idea how to build a model based on the data I have, (I don't know R or ML), so while I have plenty of data to work with I'm at a loss as to where to start. You mentioned that you build "a model based on the kind of variation you usually see". How exactly does one do that? Do you know of any resources which speak to this specifically? (Math has never been my strong suit so I've been somewhat reluctant to dive into true statistics.)

Again, thanks for your insight and help on this.

sgmunson SSC Guru Points: 110639 More actions · Answer 6

tacy.highland - Friday, October 12, 2018 10:02 AM
Wow, thanks Steve. Very detailed explanation there.
I would say the data has a pattern week over week with each day of week typically having a trend compared to other days of the week. There just also happens to be patterns for the year (by month) as this data is affected by seasons and other factors. This of course excludes holidays and bad weather days, (which I've always wondered, how does that work with a model when those days can't be included or else they'll skew the numbers, but models need to have uninterrupted data...?). If I have counts by day of week and week number for several years, I'm guessing that would suffice here.
However, I have no idea how to build a model based on the data I have, (I don't know R or ML), so while I have plenty of data to work with I'm at a loss as to where to start. You mentioned that you build "a model based on the kind of variation you usually see". How exactly does one do that? Do you know of any resources which speak to this specifically? (Math has never been my strong suit so I've been somewhat reluctant to dive into true statistics.)
Again, thanks for your insight and help on this.

Typically, you can create a Calendar table that identifies holidays, and you can always use ROW_NUMBER() to order all the remaining rows in that table with a business day number within a given year. However, as you are typically going to have the same holidays year over year, even having a holiday affected week isn't "outside of normal", because it's going to happen every year. The problem with excluding holidays isn't that it's necessarily going to skew the stats - it's that not including them is where the skew is coming from. Holidays happen every year. Exclusion would distort any true patterns in your data, so I wouldn't be a fan of that idea. That said, finding resources to help you build a model is not something you easily find with a simple web search. You just need to agree on some basics for how you are going to measure your data, and the first thing to do is to start writing down what has actually been observed in the existing data, and include any queries that were used to allow you to see the pattern claimed. Include any supporting doc, such as Excel spreadsheets. You very much want other folks to have input on the overall process. Your objective is to first determine exactly what needs to be measured, and why. There needs to be a tangible business goal that can be achieved by making the measurements. You need to have some idea of the benefit of measuring the data is likely to be. Having that information, you then have to decide what the units of measure are, including the time periods over which it will be measured. You need to not worry about holidays. That's a normal every year occurrence, and messing with the calendar for that purpose is going to be much harder than not doing so. The difficulty in finding much help online is two-fold. One is the inherent economic value of this kind of analysis. That leads IT pros into wanting to have the SSAS and cube or data analytics skill sets, which can pay particularly well. Two is that no two sets of data are going to get the same treatment. Often, two different companies in the same industry sector will take entirely different views of what amounts to the same kind of data. And neither will necessarily be right... it will just be their perspective. And perhaps most important of all, don't be dead set on finding a pattern in your data. There's no guarantee that a pattern has to exist simply because you have a lot of data. Sometimes, there's simply no story to be told, and you really MUST be prepared to accept that. It doesn't necessarily mean that the analysis is a complete waste of time - just that you needed to do the analysis in order to know if there was a pattern or not. Finally, be sure your pattern in your data is actually "statistically significant". If you're not familiar with that term, look it up on Wikipedia. Can't tell you how often I've seen a BI team spend months tracking down some 10,000 rows of data that were "outliers", and forgot to be sure they knew how many rows it would need to be before it was "statistically significant". They had billions of rows, and needed considerably more than just 10,000 outliers to have any kind of useful pattern. Such teams often never talk to the folks closest to the equipment taking the measurements to see if there are any manual logs showing any kinds of operational outages that might account for data deviations. You have to be extremely thorough with this and cover every angle. And even anally so.. Helping someone build such a model is not a forum help kind of task. There's just way too much involved. Don't be afraid to shop the market for consultants that can assist

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

tacy.highland Hall of Fame Points: 3319 More actions · Answer 7

Thank you for all the info, Steve. I'll be checking out how we'll be approaching this moving forward.

Much appreciated.

sgmunson SSC Guru Points: 110639 More actions · Answer 8

tacy.highland - Monday, October 15, 2018 1:55 PM
Thank you for all the info, Steve. I'll be checking out how we'll be approaching this moving forward.
Much appreciated.

Glad I could help out.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)