Introduction

Introduction This manual describes the statistical techniques, statistics, and strategies that are implemented in the Forecasting Module. It is not necessary that you fully understand, or even read, this manual in order to produce accurate forecasts with the product. That way users with little statistical knowledge and statistical super users can find what they are looking for. In the ever-changing business environment it is essential to keep an ongoing review of the forecasting process. Automatic forecasting enables your company to reduce the time spent selecting forecasting methods, as the Forecasting Module dynamically selects the most appropriate forecasting model to suit the sale pattern for your products. When the sale pattern changes the forecast changes with it.

The forecasting methods used in the Forecasting module can be grouped into the following five categories:

Simple moving average
Exponential smoothing
Croston’s intermittent demand
Discrete distributions
Box-Jenkins (ARIMA)

The automatic selection is determined by which method gives the least forecasting error, keep in mind that every forecasting model available will produce errors, since while the world is changing the future will remain unpredictable. In order to estimate the lowest forecasting error, thus the most fitted forecasting model, the Forecasting Module does a repeated search through the historic data to minimize the sum of squared errors.

Expert Selection

Expert Selection Expert selection allows the Forecasting Module to select an appropriate univariate forecasting technique automatically.

Expert selection operates as follows. If the data set is very short, Forecasting Module defaults to simple moving average. Otherwise Forecasting Module examines the data for the applicability of the intermittent or discrete forecast models. Although the forecasts produced from such models are just straight horizontal lines, they often provide forecasts superior to those from exponential smoothing for low-volume, messy data.

If neither of these models are applicable to the data, the choice is now narrowed down to different forms of exponential smoothing and Box-Jenkins models.

Forecasting Module next runs a series of tests on the data and applies a rule-based logic that may lead to a model selection based on data characteristics. If the rule-based logic does not lead to a definitive answer, Forecasting Module performs an out- of-sample test to choose between an exponential smoothing model and a Box-Jenkins model.

Simple Methods

Simple Methods Forecasting Module supports three variants of the n-term simple moving average, which we symbolize as SMA(n). The essence SMA(n) is to estimate the current level S of the series as the average of the last n observations. The level of the series is defined as the value that the observation would take if it were not obscured by noise.

The forecast for time t+m from the forecast base t is simply a horizontal line at the level S t.

Confidence limits for SMA(n) are determined by assuming that the true underlying process is a random walk with observation error. SMA(n) has one purpose—to decrease the effect of noise on the estimated true value of the series. It cannot pick up the effects of seasonality or trending. Thus its capabilities are very similar to those of simple exponential smoothing, except that the model has no parameters that need to be fitted to the data.

SMA(n) should be used only when the historical data record is so short and so noisy that it is meaningless to try to extract patterns from the data or even to estimate a smoothing weight. In any other circumstance, one of the exponential smoothing models will outperform SMA(n).

Forecasting Module offers three versions of SMA(n)—Automatic, Moving average and Random walk. Automatic determines the number of terms n in the moving average by determining the n that minimizes error over the historic sample. Moving average lets the user set n. Random walk sets n to 1, so that the forecast consists of the last observed value.

Exponential Smoothing

Exponential Smoothing Exponential smoothing is the most widely applicable of the univariate time series methods to business data. In the absence of information to the contrary, it is probably the best choice for the typical user.

Although exponential smoothing was first developed over thirty years ago, it is still very much a hot topic in research circles. If anything, its reputation as a robust, easy to understand methodology has increased in recent years, often at the expense of Box- Jenkins. The main reason for this is that Box-Jenkins models are built upon the abstract statistical concept of autocorrelation, while exponential smoothing models are built upon clear- cut features like level, trend, and seasonality.

Exponential smoothing models are therefore less likely to be influenced by purely statistical quirks in the data. Harvey [1984, 1990] has extended the exponential smoothing approach in his development of so-called structural models. Structural model forecasts are generated from a Kalman filter built upon a formal statistical model involving the same features as exponential smoothing-level, trend and seasonality. We now recognize exponential smoothing for what it really is approximate Kalman filters fitted directly to the data.

This establishes a framework for extending the basic exponential smoothing methodology.

You will see two such extensions in the methodological descriptions below.

Proportional error models extend exponential smoothing to the case where errors tend to be proportional to the level of the data. The majority of business data seem to exhibit this trait.
Event adjustment models extend exponential smoothing to include the estimation of, and adjustment for, promotional or other non-periodic events.

Conceptual Overview

Exponential smoothing is based upon a structural model of time series data. We assume that the time series process manifests some or all of the following structural components.

Level: The level of a time series is a smooth, slowly changing, non-seasonal process underlying the observations. We cannot measure the level directly
because it is obscured by seasonality, promotional events and irregularity (noise). It must be estimated from the data.
Local Trend: The local trend is the smooth, slowly changing rate of change of the level. We call it local to emphasize the fact that at each point in time it undergoes a small but unpredictable change. Forecasts are based on the local trend at the end of the historic data, not the overall global trend. We cannot measure the trend directly. It must be estimated from the data.
Seasonal Effects: Additive or multiplicative seasonal indexes represent periodic patterns in the time series, like the annual patterns in retail sales. Like the level and the trend, seasonal indexes must be estimated from the data. They are assumed to undergo small changes at each point in time.
Event Effects: Promotional events influence sales in much the same way as seasonality but they are not usually periodic. Additive or multiplicative event indexes are estimated from the data in much the same way as seasonal indexes. They are assumed to undergo small changes at each point in time.
Random Events: The level, local trend, seasonal and event indexes are all stochastic-that is their values change unpredictably from point to point in time. These changes are caused by unpredictable events like the amount by which a company’s actual profit or loss differs from what was expected. These are often called random shocks.
Noise: All of the features described so far are components of an ongoing historical process. Our measurements of the process, however, are usually corrupted by noise or measurement error. For instance, chewing gum shipments or chewing gum orders are noisy measurements of chewing gum consumption.

Three of these features—level, random events and noise—are present in every exponential smoothing model. The remaining three—local trend, seasonal indexes and event effects—may be present or absent.

We identify a model by determining which of these features should be included to describe the data properly.

Originally, exponential smoothing models were built informally on these features, with little attention paid to the underlying statistical model. Exponential smoothing equations were merely plausible means at estimating time series features and

extrapolating them. There was no way to estimate confidence limits properly, since they depend upon the underlying statistical model.

Some software developers responded to the need for confidence limits with little or no theoretical justification. While the point estimates from such software have been good, the confidence limits have been nearly unusable.

Forecasting Module takes a more modern approach to exponential smoothing. Each variant of exponential smoothing is based upon a formal statistical model which also serves as a basis for computation of confidence limits. The actual smoothing equations are based upon the Kalman filter for the formal statistical model. Of course, all of this is under the hood, and you need not know the details.

Models of the Exponential Smoothing Family

Here we will provide an overview - without equations - of the models that make up the exponential smoothing family.

Every exponential smoothing model involves at least the following three components.

Level
Random events
Noise

Simple exponential smoothing involves only these components. The data are assumed to consist of the level, slowly and erratically changing as random events impact it, and corrupted by noise. Simple exponential smoothing cannot capture the effects of seasonality or trending.

The remaining components Trend, Seasonal Indexes, and Event Indexes are optional. They model features that may or may not be present in the data. The trend can enter in four ways - untrended, linear, damped or exponential.

The forecasts from an untrended model are flat, except perhaps for the effects of seasonal or event indexes. The forecasts from a linear trend model extrapolate the last estimate of the trend without limit. The forecasts eventually become positively or negatively infinite.

The forecasts from a damped trend begin almost linearly but die off exponentially until they reach a constant level This may be appropriate for data influenced by business cycles. Damped trend models produce forecasts that remain finite.

The forecasts from an exponential trend begin almost linearly but increase as a percentage of themselves. This explosive growth model should only be used when the data are truly growing exponentially. The Holt model includes a linear trend but does not accommodate seasonal or event effects. The level of the data changes systematically because of the trend. It is also impacted by random events. The trend varies randomly from point to point as it too is impacted by random events. Observations are obscured by noise. Seasonal indexes can enter in three ways - none, additive or multiplicative. If the indexes are multiplicative, the seasonal adjustment is made by multiplying the index into the deseasonalized series. Thus, the effect is proportional to the level of the time series. December sales are adjusted upwards by 20% if the seasonal index is 1.2. This is the most common form of seasonality, but it applies only to positive, ratio scale data.

If the indexes are additive, the seasonal adjustment is made by adding the index onto the deseasonalized series. Thus, the effect is independent of the level of the time series. December sales are adjusted upwards by 1000 if the seasonal index is 1000. The multiplicative (additive) Winters exponential smoothing model extracts the level, trend, and multiplicative (additive) seasonal indexes. The underlying non-seasonal model is the same as Holt. Event indexes can also enter in three different ways - none, additive or multiplicative. The adjustments are analogous to those for seasonal indexes. The difference is that the adjustment is made each time a certain event occurs rather than tying the adjustment to the calendar. Event index models extend the Holt-Winters family of exponential smoothing models, which includes only the four trend options and three seasonality options, or twelve models in all. The following figure portrays the forecast profiles of these twelve models.

Forecast profiles of Exponential Smoothing Models (Gardner [1985])

These forecast profiles are created by extrapolating the level, trend and seasonality index estimates from the end of the historic data. They depict the underlying patterns of the data as these patterns exist at the end of the data. They do not and cannot include the effects of future random events or noise, so they are much smoother than the actual future will turn out to be. Exponential smoothing works as its name suggests. It extracts the level, trend and seasonal indexes by constructing smoothed estimates of these features, weighting recent data more heavily. It adapts to changing structure but minimizes the effects of outliers and noise. The degree of smoothing depends upon parameters that must be fitted to the data. The level, trend, seasonal index and event index estimations require one parameter each. If the trend is damped (or exponential), the damping (or growth) constant must also be estimated. The total number of parameters that must be fitted to the data depends on the components of the model.

Implementation of Exponential Smoothing in Forecasting Module

This section presents some details about the Forecasting Module implementation of exponential smoothing.

Model selection

To select a smoothing model automatically, Forecasting Module tries all of the candidate models and chooses the one that minimizes the Bayesian information criterion (BIC). The BIC is a goodness-of-fit criterion that penalizes more complex models, i.e., those that require fitting more parameters to the data. Research has shown that this leads to the model that is likely to forecast most accurately (Koehler and Murphree [1986]). To determine the candidate models, Forecasting Module applies the following rules: 1. Automatic model selection does not consider exponential trend models due to their ability to grow explosively in the forecast period. 2. If there are less than 5 data points, then Forecasting Module does not attempt to fit a Holt-Winters model to the data. A simple moving average model, which does not require parameter estimation, is substituted. 3. If there is less than two years worth of data, then Forecasting Module does not consider seasonal models.

If the data contain negatives or zeroes, multiplicative index models are not considered. 5. For items that are strictly seasonal, i.e. are only sold in a season, shown in image below, a form of exponential smoothing that is referred to as an NA- Constant Level model or a “salt” model is used. The model omits the trend term and the level smoothing weight is constrained to a small value. This leads to a model that enforces a constant level and the seasonal component models the departures from the constant level. The model works particularly well for data that exhibit a “selling season”. Two conditions need to apply for this model to be used. Firstly, the item must be categorized as seasonal by the forecast engine (label shown in Item Card) and secondly the zero ratio of an item must be larger than 60%, e.g. if 24 months of history are used to calculate monthly forecast, at least 15 months must have no sale at all which would give a zero ratio of 62.5%.

Parameter Optimization

To estimate model parameters, the program uses an iterative search (simplex method) to minimize the sum of squared errors over the historic data. The search begins at default values set by the program. Theoretically, the search could yield a local, rather than the global, minimum. In practice, the authors know of almost no instances where this has occurred or where the algorithm has failed to converge.

Confidence Limits Forecasting Module outputs lower and upper confidence limits for exponential smoothing forecasts.

The confidence limits for non-seasonal and additive seasonal models are computed by making the assumption that the underlying probability model is the specific Box-Jenkins model for which the exponential smoothing model is known to be optimal (see Yar and Chatfield [1990]). The confidence limits for multiplicative seasonal models are computed as described by Chatfield and Yar [1991]. The error standard deviation is assumed to be proportional either (1) to the corresponding seasonal index or (2) to the corresponding seasonal index and the current estimate of the level.

For the non-seasonal models, the error standard deviation is assumed either (1) constant or (2) proportional to the current estimate of the level. For the additive seasonal models, it is assumed either (1) constant or (2) proportional to the current estimate of the seasonalized level. In each case, Forecasting Module decides which option to use by determining which fits the historical data more closely. These confidence limits are useful guides to expected model performance, but they are not perfect, since the actual underlying probability model of the data is not known. Their usefulness for multiple-step forecasts deteriorates when the historical errors appear to be correlated. Notice that the Chatfield-Yar confidence limits differ somewhat from those based on the underlying Box-Jenkins models.

Statistical Description of Exponential Smoothing

Each of the smoothing techniques uses recursive equations to obtain smoothed values for model components. Simple uses one equation (level), Holt uses two (level and trend), Winters uses three (level, trend and seasonal). Event index models require an additional equation. Each equation is controlled by a smoothing parameter. When this parameter is large (close to one), the equation heavily weights the previous values in the series - i.e., the smoothing process is highly adaptive. If the parameter is small (close to zero), the equation weights previous values decreasingly far into the past - i.e., the smoothing process is not highly adaptive.

The following table defines the notation that will be used in the detailed discussion of exponential smoothing. It is adapted from that of Gardner [1985]. m Forecast lead time p Number of periods per year

Y t Observed value at time t St Smoothed level at end of time t

Tt Smoothed trend at end of time t t Smoothed seasonal index at end of time t Jt Smoothed event index at end of time t α Smoothing parameter for level of series γ Smoothing parameter for trend δ Smoothing parameter for seasonal indexes λ Smoothing parameter for event indexes φ Damped/exponential trend constant

The Forecast module output calls α the level parameter, the trend parameter, the seasonal parameter, the event parameter and the decay/growth constant. 4.4.1 General Additive Index Model There are twelve exponential smoothing models, so it would not be practical or interesting to discuss each individually. We will instead discuss the most fully featured model and how it relates to simpler models.

The most complex additive index model involves the level S tt, trend Tt, the seasonal index It and the event index Jt. The trend is assumed to decay at the rate φ 1. The observations Yt are assumed to be composed of these components as follows. Y t =St +It +Jt +et The components S It and Jt in this equation are the true values for the level, seasonal t, and event indexes at the time t. However, they cannot be observed directly but, rather, must be estimated from the data. This done by using the following recursive equations, which comprise an approximate Kalman filter for the underlying model. The italicized symbols now refer to estimates of the true values.

The symbol I ~ refers to the most up-to-date prior estimate of the seasonal index for the t month (quarter, week) that occurs at time t. If t refers to December, 1993, then this estimate will have been last updated in December, 1992. The symbol J ~ refers to the most up-to-date prior estimate for an event of the type that occurs at time t.

These equations update the prior estimates S ~ and J~ to reflect the last observation. t-1, Tt-1, I The posterior estimates are the quantities on the left hand side of the equations—S Ttt, It and Jt. All the simpler additive models are, in a sense, contained in these equations. If there is no event at time t, or if event indexes are not wanted, then and the last equation is discarded.

These equations involve a decaying trend. In this case the decay constant φ is usually a little less than one. To convert the model to a linear trend model, just set φ to 1.0. This is equivalent to erasing it from the equations. To convert the model to an exponential trend model, just set φ to a value greater than 1.0.

If seasonal indexes are not wanted, discard the third equation and set S tot0 elsewhere. If a trend is not wanted, discard the second equation and set Ttto 0 elsewhere. These equations clearly show how exponential smoothing actually works.

Let us look carefully at the first. The quantity Y I J represents the current observation, adjusted ttt for seasonal and event effects by subtracting off their last available prior estimates. The adjustment yields an estimate of the current level. The quantity S + φ T represents the t t forecast of the current level S tased on information available previous to the last observation. The first term, based on the current observation, is weighted by α and the second, based on previous information, is weighted by (1-α).

Each smoothed estimate of the level is computed as a weighted average of the current observation and past data. The weights decrease in an exponential pattern. The rate of decrease depends on the size of the smoothing weight α , which thus controls relative sensitivities to newer and older historic data. The larger the value of the smoothing parameter, the more emphasis on recent observations and the less on distant.

The parameters γ, φ, δ and λ are fitted to the data by finding the values that minimize the sum of squared forecast errors for the historic data. To compute the sum of squared errors for trial values of γ, φ, δ and λ, the following steps are performed.

This procedure is iterated with new trial values of the parameters until the values that minimize the sum of squared errors are found. The trial parameter values are determined by the simplex procedure, an especially stable algorithm for nonlinear minimization. Once the parameters have been estimated by fitting to the data, the model is used to compute the forecasts. The equation for the forecast of Y T+m from the forecast base YT (last historic data point) is as follows.

General Multiplicative Index Model

The general multiplicative model looks almost the same as the additive, except that multiplication and division replace addition and subtraction. The multiplicative equations are as follows.

Simpler models are obtained from these equations in much the same way that they are for the additive case. If there is no event at time t, or if event indexes are not wanted, then and the last equation is discarded. These equations involve a decaying trend. In this case the decay constant φ is usually a little less than one. To convert the model to a linear trend model, set φ equal to 1.0 or simply remove all references to φ. If seasonal indexes are not wanted, discard the third equation and set S to t 1.0 elsewhere. If a trend is not wanted, discard the second equation and set T to 0 elsewhere. T

Now that the full additive and multiplicative smoothing equations have been presented, we will examine some of the simpler models that they contain as special cases.

Simple Exponential Smoothing

The simple exponential smoothing model is used for data that are untrended, non- seasonal and not driven by promotional events. We can get its equation from either the general additive or general multiplicative model by discarding the last three equations and eliminating the seasonal and event indexes from the first. We are left with the following.

(1) Notice that when α = 1. 0 the equation becomes S =Y t t i.e., there is no “memory” whatsoever of previous values. The forecasts from this model would simply be the last historic point. On the other hand, if the parameter is very small, then a large number of data points receive nearly equal weights, i.e., the memory is long.

The other exponential smoothing models use additional smoothing parameters in equations for smoothed values of trend and seasonality, as well as level. These have the same interpretation. The larger the parameter, the more adaptive the model to that particular time series component. Equation (1) shows how the smoothed level of the series is updated when a new observation becomes available. The m step forecast using observations up to and including the time t is given by

(2) i.e., the current smoothed level is extended as the forecast into the indefinite future. Clearly, simple exponential smoothing is not appropriate for data that exhibit extended trends.

Holt Exponential Smoothing Holt’s [1957]

Exponential smoothing model uses a smoothed estimate of the trend as well as the level to produce forecasts. The forecasting equation is

The current smoothed level is added to the linearly extended current smoothed trend as the forecast into the indefinite future.

The smoothing equations are St = αYt + (4) St +Tt T t = γ(St St Tt (5) where the symbols were defined previously. Equation (4) shows how the updated value of the smoothed level is computed as the weighted average of new data (first term) and the best estimate of the new level based on old data (second term). In much the same way, equation (5) combines old and new estimates of the one period change of the smoothed level, thus defining the current linear (local) trend.

Multiplicative Winters

In multiplicative Winters, it is assumed that each observation is the product of a deseasonalized value and a seasonal index for that particular month or quarter. The deseasonalized values are assumed to be described by the Holt model. The Winters model involves three smoothing parameters to be used in the level, trend and seasonal index smoothing equations. The forecasting equation for the multiplicative Winters model is (6) i.e., the forecast is computed similarly to the Holt model, then multiplied by the seasonal index of the current period. The smoothing equations are obtained from the general multiplicative equations by setting φ to 1 and discarding the parts that involve event indexes.

The level smoothing equation (7) is similar to equation (4) for the Holt model, except that the latest measurement is deseasonalized by dividing by the seasonal index calculated one year before. The trend smoothing equations of the two models are identical. The seasonal index is estimated as the ratio of the current observation to the current smoothed level, averaged with the previous value for that particular period.

Additive Winters

In additive Winters, it is assumed that each observation is the sum of a deseasonalized value and a seasonal index. The deseasonalized values are assumed to be described by the Holt model. The equations for additive Winters are nearly identical to those of multiplicative, except that deseasonalization requires subtraction instead of division. The forecasting equation for the additive Winters model is

The smoothing equations are obtained from the general additive equations by setting φ to 1 and discarding the event indexes.

Croston’s Intermittent Demand Model

5 Croston’s Intermittent Demand Model Much product data, especially for lower volume items, are intermittent in nature. For many, or even most periods, there is no demand at all. For periods where there is some demand, it is randomly distributed independently or nearly independently of the demand interval.

This might be the case for spare parts that are usually ordered in batches to replenish downstream inventories. The Poisson and negative binomial distributions do not usually fit such data well because they link the zeroes and non- zeroes as part of the same distribution.

Croston [1972] proposed that such data be modeled as a two-step process. He assumed that the demand intervals were identically independently distributed (iid) geometric. This is equivalent to assuming that the probability that a non-zero demand occurs in any given period is iid Bernoulli, as though by the flip of an unfair coin. He further assumed that the distribution of successive demands was iid normal. The alternative model for data this messy is usually simple exponential smoothing. This yields horizontal forecasts at a level fitted adaptively to the data.

Willemain et al. [1994] examined the performance of a variant of the Croston model relative to that of exponential smoothing, and found it markedly superior in forecast accuracy, both for simulated and real data, even in the presence of autocorrelation and cross-correlation between demand size and demand interval.

The variation that Willemain et al. introduced was the substitution of the log normal distribution for the normal distribution for successive order sizes. This is sensible for most data because the probability of non-positive demand size is zero. However, it cannot be applied to demand data that sometimes goes negative, as it sometimes does when a company registers returns as negative demand.

Implementation

Two basic models are implemented in Forecasting Module

the Croston Model as originally implemented and the Willemain variant. The Willemain variant is always selected unless there are occasional negatives in the historic data. If numerous data points are negative, the Croston model is tagged as inappropriate, as it also is when the vast majority of the historic periods have a non-zero demand.

Croston’s Intermittent Demand Model

The quantities that must be estimated from the data include the following.

Probability of a demand in a given period, adaptively estimated to reflect conditions at the end of the historic data via simple exponential smoothing of the demand interval. The smoothing parameter is optimized, as is that for the mean order size, by minimizing the sum of squared fitting errors.
Mean order size, adaptively estimated in the same way.
Standard deviation of the demand size, estimated globally over the historic data set. The forecasts are computed as the product of demand probability and demand size.

All three of the estimated quantities are used to compute the overall distribution function, from which the confidence intervals are computed.

Discrete distributions

Most statistical forecasting models are based on interval data, i.e., data for which zero has no special meaning. Forecasts and data can be negative as well as positive, and the interval from zero to one is statistically equivalent to the interval from 100 to 101. Very little business data are interval in nature but, for the most part, interval data forecast models still perform well.

But there are exceptions. For instance the data might consist entirely of zeroes and small integers. Infrequently used spare parts often fall into this class. The forecasts from simple exponential smoothing for such items may be perfectly reasonable and useful, but the confidence limits are usually unusable. This is due to the confidence limits from a standard model being symmetric. They do not take into account that sales of these types of items cannot go negative but might become very large. The discrete distributions forecast model produces the same point forecasts but produces much more accurate confidence limits.

Forecasting Module tries two different discrete distributions to fit the data the Poisson distribution and the negative binomial distribution. Forecasting Module selects the distribution that fits the data better and uses that distribution to compute the forecasts.

Poisson Distribution

The Poisson distribution ranges over integers in the range {0,1,2,...}. It applies to such processes as the number of customers per minute who arrive in a queue, the number of auto accidents per month on a given road, or sales of a particular spare part per month. The probability that exactly x events occur is given by the following formula.

The Poisson distribution has a single parameter λ that equals both the mean number of events per unit of time, and the variance around the mean. This parameter λ is a positive real number.

Forecasting Module chooses the Poisson distribution when the ratio of the sample mean to the sample variance is near unity. It is likely that the mean number of events per unit of time is actually changing over time. Therefore we must estimate λ as a time series in its own right. It has been shown by Harvey [1989] that this is optimally done via simple exponential smoothing. The current estimate of the level is also an estimate of the current value of λ.

Therefore Forecasting Module performs the following steps. 1. Use simple exponential smoothing to estimate and forecast λ. The forecasts are equal to the value of λ at the end of the series. 2. Use the final value of λ to determine from the equation for the distribution the probabilities of 0, 1, 2, ... events per unit of time. These in turn are used to compute integer confidence limits. The advantage to using a discrete distribution is not an improved point forecast but improved confidence limits, and the availability of a formula to compute the probability of zero events, one event, etc.

Negative Binomial Distribution

The variance of many integer series runs higher than can be modeled by the Poisson distribution, where the ratio of the variance to the mean is unity. For instance, the mechanical failures that require a certain spare part may be accurately modeled by a Poisson distribution, but orders for the part may not reflect current failures. The parts may be inventoried by an intermediate distributor, or the end user may buy more than is needed immediately. The result is that, while the mean of the parts orders matches the mean of the failures, the orders are more variable.

The negative binomial distribution allows us to model such data. The negative binomial distribution ranges over the integers {0,1,2,...}. The probability that exactly x events occur is given by the following formula.

This is the formula for the probability of x failures before the yh success in a sequence of Bernoulli trials in which the probability of success at each trial is p. In Forecasting Module, we regard the negative binomial distribution in a more empirical way. It is flexible enough to model many discrete business series that are not modeled well by the

Poisson distribution. The parameters to be fitted to the data consist of y, an integer which assumes values in the range {1,2,3,...}, and p, which lies in the interval from 0.0 to 1.0. These two parameters are fitted to the data by using the facts that the mean of the distribution is y(1- p)/p 2. while its variance is y(1-p)/p Thus the ratio of the mean to the variance is p.

The mean of the series is estimated via simple exponential smoothing, as with the Poisson distribution. We assume that the ratio of the variance to the mean is a constant, which we also estimate from the data.

This gives us the distribution of x at the end of the historic data. The point forecasts equal the final estimate of the mean. The confidence limits are computed by using the formula for f(x) as a function of x and the two fitted parameters.

Box-Jenkins Statistical Models

Box-Jenkins is a powerful forecasting technique which, for appropriate data, often outperforms exponential smoothing. Traditionally, however, Box-Jenkins models have been difficult and time consuming to build. This has kept them from widespread acceptance in the business community. However, automatic algorithms such as those found in Forecasting Module, now allow forecasters to build Box-Jenkins models quickly and easily.

As a result, they are now candidates for more widespread use. In the largest empirical studies of forecasting accuracy to date (Makridakis [1982], Makridakis [2000]), exponential smoothing outperformed Box-Jenkins overall, but in many specific cases, Box-Jenkins outperformed exponential smoothing. Ideally, a forecaster would switch between Box-Jenkins and exponential smoothing models, depending on the properties of the data. This is precisely what the Forecasting Module expert system is designed to do.

Box-Jenkins models are built directly on the autocorrelation function (ACF) of a time series variable. Therefore, a prerequisite for Box-Jenkins is that the data should possess a reasonably stable autocorrelation function. If the autocorrelations are not stable, or the data are too short (say, fewer than 40 points) to permit reasonably accurate estimates of the autocorrelations, then exponential smoothing is the better choice. This avoids the principal pitfall of Box-Jenkins: fitting a complex model to chance historic correlations or outliers. Univariate Box-Jenkins cannot exploit leading indicators or explanatory variables. If these are important, then a multivariate method such as dynamic regression is a better choice.

Forecasting Module implements the univariate ARIMA (AutoRegressive Integrated Moving Average) procedure described by Box and Jenkins [1976]. The models can be identified completely automatically by the program, or the user can interactively build a model, or test variations on the model selected by the Forecasting Module expert system.

The program supports the multiplicative seasonal model described by Box and Jenkins. This section is intended to provide a background to the statistical methodology used in the program. Those who would like a more thorough coverage should consult Box and Jenkins’ classic theoretical textbook or Applied Statistical Forecasting [Goodrich 1989].

Implementation

Implementation of Box-Jenkins in Forecasting Module Automatic identification

The program begins by performing a range-mean test to determine whether the log or square root transform should be applied to the data. Next the program determines the simple and seasonal differencing necessary to render the data stationary. It uses an adaptation of the Augmented Dickey-Fuller test (see Goodrich [1989]). Then it computes approximate values for the parameters of a group of candidate models.

Forecasting Module tests each model, and selects the one that minimizes the BIC criterion. Exhaustive fitting and examination of all low order ARIMA models would take an inordinate amount of computer time. Forecasting Module actually overfits a state space model, and uses it to generate approximate Box-Jenkins models quickly. Sometimes this method misses the minimum BIC by a slight amount, but it virtually never selects a bad model.

Business Forecast Systems has compared its Automatic Box-Jenkins models with the published results from the M-competition, where an expert spent 20 minutes to identify each ARIMA model manually. Forecasting Module outperformed the Box-Jenkins expert at every forecast horizon.

Business Forecast Systems recommends that you use Automatic identification routinely. Initialization Forecasting Module uses the method of back-forecasting to initialize Box-Jenkins models. This technique is described in Box and Jenkins [1976].

Parameter estimation

Forecasting Module uses the method of unconditional least squares to obtain final parameter estimates. If necessary, the parameters are adjusted to ensure stationarity or invertibility. Constant term By default, Forecasting Module uses a constant term only when an ARIMA model does not involve differencing. This is to avoid imposition of deterministic trends, which can lead to large forecast errors at longer horizons.

Box Jenkins Background

Two statistical concepts are pivotal for understanding the Box-Jenkins modeling and dynamic regression, stationarity and autocorrelation.

Stationarity

A time series process is stationary (wide sense) when it remains in statistical equilibrium with unchanging mean, unchanging variance and unchanging autocorrelations. A stationary process can be represented as an optimal autoregressive moving average (ARMA) model.

Unfortunately, most business and economic time series are not stationary. There are many forms of non stationarity, but the following forms are especially important.

Non stationarity in the Mean

The mean is not constant but drifts slowly, without consistent direction. The time series is trended or cyclical. The trend is not constant but slowly drifts up and down.

Nonstationarity in the Variance

The time series is heteroscedastic, i.e. the variance of observations around the mean is changing. One treats these cases by transforming the data to stationarity. Nonstationarity in the mean is removed by differencing. Nonstationarity in the variance is removed by applying a Box-Cox power transform.

Autocorrelation function

According to ARIMA statistical theory, a time series can be described by the joint normal probability distribution of its observations Y Y , ... , YN. This distribution is 1, 2 characterized by the vector of means and the autocovariance function. The autocovariance of Y and its value Yt+m at a time m periods later is defined by t

where the operator E denotes statistical expectation, cov denotes the covariance, and µ is the expected value of Y t.tice that the autocovariance function is a function of the time separation m, not the absolute times. This is an implicit assumption that the autocovariance function does not depend on the time origin t. In other words, the time series is stationary. If it is not, then its autocovariance function is not defined. Notice that γ is the same as the variance σ 2 .

The autocorrelation function is 0 Y computed by dividing each term of the autocovariance function by the varianceσ2 : Y

The autocovariance function is a theoretical construct describing a statistical distribution. In practice, we can only obtain estimates of the true values. The generally accepted formula is where Y is the sample mean. The sample autocorrelation function is then given by

The sampling error of this estimate can be large, especially when the autocorrelations are themselves substantial. The estimates are also highly intercorrelated. Because of this, one must use caution in labeling particular correlations significant by visual examination of the sample autocorrelation function.

The sample autocorrelation function displayed in Forecasting Module includes dashed lines at the 2σ limits, where σ is the approximate standard error of the sample autocorrelation coefficient, computed via the Bartlett [1946] approximation.

The rate at which σ expands depends on the sample values of lower order autocorrelations.

Description of the ARIMA Model

Box-Jenkins models the autocorrelation function of a stationary time series with the minimum possible number of parameters. Since the Box-Jenkins dynamic model includes features (moving average terms) that dynamic regression does not, Box-Jenkins theoretically will produce the optimum univariate dynamic model. Therefore, even when a dynamic regression model might ultimately be selected, a preliminary Box-Jenkins analysis provides a useful benchmark for model dynamics.

Since the procedure is quick and automatic, this puts very little analytic burden on the user. The Box-Jenkins model uses a combination of autoregressive (AR), integration (I) and moving average (MA) terms in the general AutoRegressive Integrated Moving Average (ARIMA) model.

This family of models can represent the correlation structure of a univariate time series with the minimum number of parameters to be fitted. Thus these models are very efficient statistically and can produce high performance forecasts. The notation we will use is consistent with that used for exponential smoothing. N Number of historic data points m Forecast lead time (horizon) p Number of periods in a year Y t Observed value at time t

t Differencing operator s Seasonal differencing operator B Backward shift operator φi Autoregressive coefficient (lag i). In Forecasting Module this term is displayed as a[i]. φ(B) Autoregressive polynomial of order p Φ i Seasonal autoregressive coefficient (lag i) In Forecasting Module this term is displayed as A[i]. s Φ(B ) Seasonal autoregressive polynomial of order ps Θ i Moving average coefficient (lag i). In Forecasting Module this term is displayed as b[i].

θ(B) Moving average polynomial of order q

Θi Seasonal moving average coefficient (lag i). In Forecasting Module this term is displayed as B[i]. s Θ(B ) Seasonal moving average polynomial of order qs

Forecast for time t+m from origin t e t One-step forecast error Y t Yt ε t Normally independently distributed random shock.

Differencing

If a time series is not stationary in the mean, then the time series must first be transformed to stationarity by the use of differencing transforms. To describe differencing transforms we use the backward shift operator B, defined as follows.

This operator will be used in our discussion of ARMA processes. For instance, the differencing operator is defined as follows. B)

Autoregressive Processes

The AR(p) model is specified by the equation

(1) in which the dependent variable appears to be regressed on it own lagged values. This equation can also be represented in terms of the backward shift operator B as

(2) and, if we adopt the notation φ(B) for the polynomial in B, it can be written succinctly in the form φ(B)Y t =εt (3)

Moving Average Processes

The Moving Average process MA(q) is given by Y = ε ε ε ε t 1 t 2 t q t q (4) or, alternatively, in the polynomial form Y t =θ(B)εt (5) The pure moving average process MA(q) is virtually never observed in real world data. It describes the unlikely process whose autocorrelations are nonzero for q lags, and zero thereafter.

Moving average terms are, in practice, used only in conjunction with differencing or autoregressive terms. In that case, they are invaluable. They induce data smoothing just like that of exponential smoothing.

ARMA and ARIMA Processes

The Auto Regressive Moving Average process ARMA(p,q) combines the features of the AR(p) and MA(q) processes. In polynomial form, it is given by φ(B)Yt =θ(B)εt (6)

Thus the AR(p) process is the same as ARMA(p,0) and the MA(q) process is the same as ARMA(0,q). Any stationary time series can be modeled as an ARMA(p,q) process. Any time series that can be made stationary by differencing d times can be modeled as an ARIMA(p,d,q) process. The ARIMA(p,d,q) model is given by the following equation. φ(B B)dY t =θ(B)εt (7a) This is the most general non-seasonal Box-Jenkins model.

Deterministic Trends

By default, Forecasting Module does not include a constant term in an ARIMA model except when the model does not involve differencing. If you dictate that a constant term be used the equation for the model now takes the form shown in equation (7b). dY φ(B B) t =θ(B)εt +c (7b)

The effect of a constant term is to introduce deterministic trending into your model, in addition to its other properties. If you have differenced once, the trend is linear; if you have differenced twice, it is quadratic. This is usually undesirable because it extrapolates the global trend of the historic data indefinitely into the future, even when the current trend is slight. This usually produces poor forecast accuracy for longer horizons. Business Forecast Systems has confirmed this effect by testing over the 111 Makridakis time series.

7.4 Seasonal Models

Equation (7a) is adequate to model many seasonal series, provided that the polynomials “reach back” one or more seasonal periods. This means that either p or q (or both) must equal or exceed the seasonal period s. Since all intervening terms would also be included, such a model is not parsimonious, i.e., it would contain unnecessary coefficients to be estimated. This is often damaging to predictive validity of the model. On the other hand, we might consider a seasonal version of equation (7) in which the s backward shift operator B is replaced by its seasonal counterpart B .

The resulting equation is s Bs)DY s)ε Φ(B t =Θ(B t (8)

where the polynomials θ and Θ are of orders P and Q respectively. This is the ARIMA(P,D,Q)s model. It relates the observation in a given period to those of the same period in previous years, but not to observations in more recent periods. The most general seasonal model includes both seasonal and simple ARIMA models at once. The following equation describes the multiplicative seasonal ARIMA model. φ(B)Φ(B s B)d Bs)DY s)ε t =θ(B)Θ (B t (9)

It is usually symbolized as ARIMA (p,d,q)(P,D,Q). 7.5 Selecting Model Orders

The hardest part of Box-Jenkins modeling is deciding which ARIMA(p,d,q) model fits the data best, i.e. in identifying the degree of differencing d, the AR order p, and the MA order q.

Much of Box and Jenkins [1976] is devoted to this so-called “identification” problem. The Forecasting Module expert system, in fact, identifies the model automatically . Therefore, the remainder of this section, in which the original Box- Jenkins procedure is presented, is inessential.

The original Box-Jenkins procedure depends upon graphical and numerical analysis of the autocorrelation function and the partial autocorrelation function. It is a pattern recognition procedure that requires skill and patience to learn. We will discuss only the non-seasonal case.

Degree of Differencing

The identification procedure begins by determining the degree of differencing d that is required to make the original data Y stationary. This is done through examination of the t autocorrelation function rk. The first few lags of the autocorrelation function of the raw data Y are inspected; if t these die out relatively quickly, then no differencing is required, i.e. d=0. If not, then the original data are replaced by its first difference Yt and the process is repeated. If the autocorrelation function of the differenced data dies out quickly, d=1. If not, the data are differenced a second time to obtain 2 Y .

This process is repeated until, for some d, the autocorrelation function of the multiply differenced data does die out quickly. In practice, d is rarely greater than 2.

Once the degree of differencing is determined, the remainder of the analysis deals with the stationary series dY . If d is zero, these are the original data.

Autoregressive Order

The autoregressive order p is determined by inspection of the sample partial φ$ kk . autocorrelation function We will motivate the odd notation and the definition of this function through a thought experiment.

Suppose that the process is thought to be purely autoregressive (q=0). Then a rational strategy to determine p would be to compute a regression of Y on tts first lag, then on its first two lags, and so on until the last lag introduced into the regression turns out not to be statistically significant.

This is determined by a statistical test on which is kk , defined as the coefficient of Y in a regression on Y ,Y ,...,Y , i.e. the k’th coefficient of the k’th regression. Actually, a fast recursive algorithm is used instead of performing so many regressions. A graph is presented of the first forty-eight lags so that the AR order can be determined. If the process is ARIMA(p,d,0), then the partial autocorrelation function dies abruptly after p lags.

Moving Average Order

A pure moving average process ARIMA(0,d,q) exhibits the same behavior in the autocorrelation function that the autoregressive process ARIMA(p,d,0) does in the partial autocorrelation function. In other words, if the process is ARIMA(0,d,q), then the sequence r ks large for kq. Thus the autocorrelation function is used for MA processes in the same manner as the partial autocorrelation function for AR processes. The functions r k and φ$ kk also exhibit similar behavior for ARIMA(p,d,0) and ARIMA(0,d,q) processes, respectively. Instead of abruptly cutting off at p and q, respectively, these functions tail off smoothly in exponential decay or exponentially damped sine waves.

By examining both functions, the forecaster can determine the orders of pure AR and MA processes. Mixed processes ARIMA(p,d,q) are more complex. Neither the partial autocorrelation function nor the autocorrelation function abruptly dies out.

Instead, the autocorrelations remain large for k and die out exponentially thereafter. The partial autocorrelations remain large for k p.

Manual identification of mixed ARIMA processes is often very difficult. There are two severe problems with this procedure for order identification. 1. First, even when the data really does fit an ARIMA process, the sample autocorrelations used to identify the process can be very different from the theoretical ones due to sampling variation. 2. Second, the actual data usually contain outliers and other unmodelable features that can significantly distort the autocorrelation and partial autocorrelation functions. It is our judgment that the Box-Jenkins procedure should be used only as the very roughest guide.

We recommend that you fit an automatic Box-Jenkins model first. Then, if you suspect that you can find a better model, you can try variations of the automatic model. You can use the BIC criterion to make a final decision.

Note that the Forecasting Module automatic identification method has bested human experts in several academic studies.

Curve fitting

Curve fitting is generally used to model the global trend of a data set. Although curve fitting is not as sophisticated as some of the other Forecasting Module forecasting methodologies, it can still be quite useful. Unlike some of the other methods, curve fitting may be used with short time series (the suggested minimum length is ten data points). Also, the program provides a quick and easy way to identify the general form of the curve your data are following.

Be aware however, that curve fitting methods do not accommodate for or project seasonal patterns in a series.

The curve fitting routine supports four types of curves—straight line, quadratic, exponential and growth (s-curve). You can let Forecasting Module choose the form of the curve or select it yourself. The automatic tries the four curves and selects the one that minimizes the BIC for the historic series. The equations for each curve are shown below (t=time). All of the coefficients of the model are selected to minimize the sum of the squared errors.

Model Statistics

Within-sample statistics are displayed each time a model is fitted to the data. Out-of- sample statistics are displayed whenever a hold out sample is used. Each statistic is listed below: Sample size. The number of historical data points used to fit the model. Operations that discard data points, (e.g., differencing, inclusion of lagged variables, etc.) can reduce this statistic. Number of parameters. The number of fitted parameters (coefficients) in the model. Mean: The sample mean (average) for the historical data.

Standard deviation: A measurement of the dispersion of the historical data around its mean.

R-square: R-square is the fraction of variance explained by the model.

Adjusted R-square: The adjusted R-square is identical to the R-square except that it is adjusted for the number of parameters (k) in the model.

Durbin-Watson: The Durbin-Watson d-statistic is used to test for correlation of adjacent fitted errors, i.e. for first-lag autocorrelation. If T is the number of sample points and et is the fitted error at point t, then d is computed as follows.

While the d-statistic is easy to compute, it is hard to interpret.

The null hypothesis is that the first-lag autocorrelation is zero. One looks up the Durbin- Watson bounds dL and dU for sample size T and significance α in a table. The null is accepted if d dU.IfdL

The statistic is a weighted sum of squared autocorrelations, so it is zero only when every autocorrelation is zero. The more autocorrelation, the greater the size of Q. The weights are selected to make Q approximately Χ 2 ( L n) , i.e. Chi-square with L-n degrees of freedom.

Forecast error: The standard forecast error is the root mean square of the fitted errors adjusted for the number of parameters (k) in the model. It is used to compute the confidence limits of the forecasts, but, realistically, it is usually an overly optimistic estimate of true out-of-sample error.

BIC: The AIC (Akaike Information Criterion) and the BIC (Bayesian Information Criterion) are the two order estimation criteria in most common use. A specific model is selected from a model family by finding the model that minimizes the AIC or BIC. Either statistic rewards goodness-of-fit, as measured by the mean square error s, and penalizes for complexity, i.e. the number of parameters n. Koehler and Murphree [1986] showed that, for series from the M-competition, the BIC leads to better out-of-sample forecast performance and, for this reason, Forecasting Module uses and displays the BIC.

There are several equivalent versions of the BIC, related to each other by transforms. In Forecasting Module, we use the following equation, in which T represents the sample size.

This version of the BIC is scaled the same as the standard forecast error. It can very loosely be interpreted as an estimate of out-of-sample forecast error. The BIC can be used to compare different models from the same family, and for the same data. Since it is scaled to the standard forecast error, it is meaningless as an absolute criterion of merit.

MAPE: The MAPE (Mean Absolute Percentage Error) is used to measure within sample goodness-of-fit and out-of-sample forecast performance. It is calculated as the average of the unsigned percentage errors. RMSE: The RMSE (Root Mean Square Error) is used to measure within sample goodness-of-fit. It is calculated as the square root of the average of the squared errors.

MAD: The MAD (Mean Absolute Deviation) is used to measure within sample goodness-of-fit and out-of-sample forecast performance. It is calculated as the average

Model Statistics of the unsigned errors. GMRAE: The GMRAE (Geometric Mean Relative Absolute Error) is used to measure out- of-sample forecast performance. It is calculated using the relative error between the naïve model (random walk) and the currently selected model. A GMRAE of 0.54 indicates that the size of the current model’s error is only 54% of the size of the error generated using the naïve model for the same data set.

Safety Stocks

Outlier Detection

An outlier is a data point that is an unusually large or small data point and falls outside of the expected range and can have a significant impact on the forecast.

The process of screening the historical data for outliers and replace them with more typical values prior to generating forecasts is referred to as outlier detection and correction.

Outlier correction can often improve the forecast. However, if the outlier is not truly severe, correcting for it may do more harm than good. When outliers are corrected, the history is rewritten to be smoother than it was which will change the forecasts and narrow the confidence limits. This can result in poor forecasts and unrealistic confidence limits when outlier detection is not necessary.

Outlier detection should be performed sparingly, and it should be reviewed to determine whether a correction is appropriate.

An automated algorithm detects and corrects outliers as follows:

Each time a forecast is calculated, the fitted errors are generated and their standard deviation is calculated.
If the size of the largest error exceeds the outlier threshold, the point is flagged as an outlier and the historic value for the period is replaced with the fitted value.
The procedure is repeated using the corrected history until either no outliers are detected or the specified maximum number of iterations is reached.

The sensitivity of the outlier detection algorithm is defined by the standard deviation threshold that is set to two.

This means that if a given fitted error exceeds this defined threshold and it is the largest error detected during the current iteration, it will be flagged as an outlier.

Maximum number of iterations permitted during outlier detection are three. Since only one outlier is identified per iteration, this defines the maximum number of outliers that can be detected for a given item.

Safety stock

Forecasting Module generates safety stock calculations in addition to point forecasts and confidence limits.

This capability is most often used in setting inventories which are replenished only at certain variable or fixed intervals. Inventory control analysis requires the manager to balance inventory holding costs, reorder costs and other factors to determine economic order sizes and reorder points to maintain a desired service level at minimum cost.

The analysis must take into account the lead time between placing an order and placing the units in stock.

Although such analyses can become very complex, most of them require answering a question similar to the following. How much stock do I need at each point in time to maintain a service level of 95% if the reorder lead time is six weeks? At each point of time, the manager needs enough stock so that the total sales for the next six weeks will exceed the stock level only five percent of the time.

It is easy to compute median sales over the six week period, just add the forecasts over the six weeks. The difficulty lies in computing the probability that sales will exceed the cumulative forecast by some certain amount.

To determine this mathematically is a complex problem that depends upon the details of the statistical forecast model.

Most MRP systems use a very crude approximation to solve this problem and really must, since Forecasting Module does not know anything about the statistical forecast model. The difficulty of the calculation lies in taking into account the serial correlations of sales from point to point over the reorder cycle.

Forecasting Module is unique in providing a rigorous statistical solution to this problem. It does so by converting the model to an equivalent but different form called the Wold representation. This is the key to determining the statistical distribution of the cumulative forecast. Consult Wold[1938] for details of the computation.

But there is one caveat, the computation assumes that the statistical distribution of future sales is in fact correctly captured by Forecasting Module. This is never absolutely true, so the safety stocks will be in error to the extent that the Forecasting Module model does not actually capture the true model.

Forecasting Module provides the safety stock in the form of the excess stock needed, above and beyond the cumulative forecast, to maintain the service level specified for the upper confidence limit percentile.

The safety stocks are output for each lead time up to and including the forecast horizon. Thus, to determine the stock required for a six-week lead time, you would add the forecasts for the next six weeks and add on the six-week safety stock as well.

Consider the following example: The forecast for each week represents the mean of all possible futures. Put another way, it is equally likely, according to the model, that the actual value will be above or below the forecast.

The upper and lower confidence limits provide information about the spread around the forecast for a given period. The 97.5 percent upper confidence limit for week 40 is 770. Thus, according to the model, actual sales for week 40 should fall at or below 770, 97.5 percent of the time.

The safety stock takes into account the lead time. It is defined as the amount in addition to the forecast that you will need to meet demand at the specified percentage for a specified number of periods. In the above example, in week 35 (lead time equals one week) you would need the forecast for the first week (738) plus the one-week safety stock (143) to meet demand for the one-week period (notice that for a one-week horizon the safety stock equals the upper limit minus the forecast).

To cover a two-week period, you would need the forecast for the first week, plus the forecast for the second week, plus the two-week safety stock (738+747+209). To calculate the stock needed to cover a six- week period you would need to sum the forecasts for the six weeks and add the six- week safety stock (738+747+736+664+690+642+403).

Methodology of Automatic Forecasting

This chapter describes special methodological considerations that apply to automatic forecasting of hundreds or thousands of items.

Introduction

Much of the forecasting literature (and much of the available software) concentrates on forecasting time series in an R&D environment. The literature envisions the forecaster as intensely interested in just one or two complex time series, perhaps for long forecast horizons, and often with highly significant consequences. Forecasting Module is based on the fairly scanty published research and upon BFS research and experience.

The forecasting methods that succeed best are relatively simple ones. Product data is often so volatile that more complex models, no matter how well they fit the historical data, yield inferior forecasts. When historic records are relatively long and not very noisy, there is a substantial payoff in matching the forecast model to each data set individually. On the other hand, many business time series are extremely noisy.

The information in an individual historic record may not be sufficient to choose the best method reliably. Frequently, method A may appear to be superior to method B at one time, and inferior to it at another. In these cases, overall accuracy may be improved by selecting a model at the group level. This can be done by using the Forecasting Module out-of-sample evaluation procedure. Forecasting performance and goodness-of-fit of a model to the historic data is certainly related, but the relationship is much looser than one would expect.

Forecasting Module includes extensions (discrete distributions, Croston’s intermittent data model and multiplicative error model) to the standard confidence limit methodology to make confidence limits more accurate. Nevertheless, confidence limits are only rough guides to real out-of-sample forecast performance. You can evaluate real out-of-sample performance by using the Forecasting Module evaluation methodology.

Classification of Time Series

From the earliest days of forecasting systems for production and inventory control, it has been recognized that there are essentially three types of product time series. This stratification is based mainly on sales or production volume.

Type A series are very high volume. These series are usually fairly regular, so statistical forecasting methods like those in Forecasting Module perform well. However, these high volume items are also of great importance to the firm, and the consequences of forecast error can be significant. Thus, if there are not too many of them, it is wise to examine them interactively, and to make judgmental adjustments as appropriate.

Type B series are of medium volume. Ordinarily, these series can be forecasted fairly accurately by the methods in Forecasting Module. Since these items are not separately as crucial to the firm, they lend themselves well to automatic forecasting. Human intervention is usually required only when the forecasting software marks them as exceptional.

Type C series are of lowest volume, and may include as many as 50% of the total. Many of these series will be mostly zeroes, with occasional small sales and, more rarely, a large sale. The percent error of forecasts of Type C series is often quite large, but the consequence of error is usually small. When automatic forecasting first emerged, Type C series were not usually forecasted at all. Instead, a default forecast (say zero or one) was used, to save computer time (then a scarce resource). Now that computation is cheap, methods like those in Forecasting Module are likely to provide increased accuracy.

Any of these groups can include rogue series, i.e., series that are so irregular as to be virtually unforecastable. Obviously, most rogue series are of Type C, so that their practical significance is usually not high. However, their influence on forecast performance evaluations can be great, since evaluations usually depend upon percent error, not absolute error.

The meaningfulness of forecast performance evaluation, discussed below, will be greatly enhanced if the evaluation is based upon a classification of series, and the identification of rogue series.

Incorporation of Additional Information

It will often be appropriate to adjust forecasts from Forecasting Module judgmentally. This may be done to include information about orders that have been received, or which are expected; the effects of promotions; knowledge about external economic trends; product mix changes, etc.

These are particular cases where the information available to the user is greater than that available to Forecasting Module.

In other cases, i.e., when the additional information possessed by the manager is not clear, the effect of subjective intervention by the user may be counterproductive. Studies have shown that, although most managers believe that their subjective assessment is superior to quantitative projection, that is not always the case.

As Fildes [1990] has put it, manager intervention is a “mixed blessing.” We advise the user to be cautious. In any case, user intervention is time consuming.

In those research studies where subjective intervention provided improved accuracy, the managers typically used graphical analysis and fairly intensive consideration as their tools. Quick “eyeball” adjustments are not likely to contribute much to accuracy.

Forecasting Module offers a forecast adjustment facility to permit subjective intervention by the user. Both individual and group level forecasts can be adjusted in a variety of ways.

Selection of Forecasting Method

Forecasting Module offers the user five basic forecasting methods, and a wide range of variations on these methods. The methods that are included were selected from among those few univariate forecasting models that are supported by the research, and which can be automated without excessive computational burden. Although true multivariate methods are not feasible in automatic systems, Forecasting Module does support event modeling which can accommodate promotional schedules and business interruptions.

The five basic methods are simple moving average, exponential smoothing, Croston’s intermittent demand, discrete distributions and Box-Jenkins (ARIMA). Simple moving average is included only for use on very short data series, where it is infeasible to fit more complex models to the data. The Croston’s model is designed for data with numerous zeros. Discrete distributions (negative binomial and Poisson distributions) are for use on data whose values are small integers. The other two “methods,” exponential smoothing and ARIMA, are not actually single methods but, rather, families of methods.

The member methods differ mainly in their accommodation of structural characteristics in the data like trend, seasonality and random noise. Choosing a method thus involves two steps: choosing a family and choosing a method from within the family. Empirical research studies such as the M-Competition have shown that there is no one single forecasting methodology that is most accurate in all cases. Fildes [1990] has shown that improvements in accuracy of 20% or more can be obtained by selecting the method that is most appropriate for a given data set. Thus the task of selecting a method is crucial for accuracy.

Features such as data length, stability, trending and seasonality will often lead the experienced forecaster to favor one method over another. These factors may help the experienced forecaster form a “hunch” about the best method for a particular group of data. However, as Fildes [1990] has demonstrated, the best and most reliable way to fit the method to the data is through testing.

That is why Forecasting Module includes an expert selection algorithm to automatically select a method for each series and a facility for testing over a hold-out sample.

Forecasting Module’s expert selection is easy to use and works extremely well. The only disadvantages is that the algorithm is time consuming.

The following sections describe these processes in greater depth. However, they do not cover the statistical foundations of the forecast methods themselves.

They concentrate instead on implementation as automatic methods, and on implications for use of Forecasting Module. Background material on the five methodologies (simple moving average, exponential smoothing, Croston’s intermittent demand, discrete distributions and Box- Jenkins) is presented in the next chapter (glossary).

Glossary

This glossary contains definitions of the technical terms used in the main body of the text. A particular definition may involve terms that are defined elsewhere in the glossary.

ACF The ACF consists of the autocorrelations for lags 1, 2, 3, ... N. Forecast Pro displays the ACF as a correlogram, i.e. a bar chart of the (Autocorrelatio n Function) autocorrelations arranged by lag. ARIMA model A family of sophisticated statistical models used by Box and Jenkins (AutoRegressiv to describe the autocorrelations of a time series data. The symbol ARIMA(p,d,q) indicates a model involving p autoregressive terms and e Integrated q moving average terms, applied to data that have been differenced d Moving Average) times. The Box-Jenkins technique involves (1) Identification of a particular ARIMA model to represent historic data; (2) Estimation of ARIMA model coefficients, (3) Statistical validation of the model; and (4) Preparation of forecasts. Autocorrelatio The correlation of a variable and itself N periods later, and hence a n measure of predictability Base The forecast base is the time point from which forecasts are prepared. BIC (Bayes A model selection criterion proposed by Schwarz [1978]. Within a model family (e.g. exponential smoothing or Box-Jenkins), the model Information Criterion) that minimizes the BIC is likely to provide the most accurate forecasts. Since models with many parameters often fit the historical data well, but forecast poorly, the BIC balances a reward for goodness-of-fit with a penalty for model complexity.

If your current model yields the lowest BIC out of the models you have tested, Forecast module marks it with “Best thus far.” Box-Cox Power Logarithmic or power transform of the data. Used to reduce or Transform eliminate dependence of the local range of a time series on its local mean. Box-Jenkins Strictly speaking, the statistical technique developed by Box and Jenkins to fit ARIMA models to time series data. More loosely, the term refers to the ARIMA models themselves.

Confidence A forecast is generally produced along with its upper and lower confidence limits. Each confidence limit is associated with a certain limits percentile. If the upper confidence limit is calculated for 97.5% and the lower for 2.5%, then actual values should fall above the upper confidence limit 2.5% of the time, and below the lower confidence limit 2.5% of the time. These are often called the 95% confidence

limits to indicate that the actual value should fall inside the confidence band 95% of the time. In practice, confidence limits tend to overstate accuracy. You can set the confidence limit percentiles in Configure. Dependent The variable you want to forecast. Strictly speaking this term only variable applies to regression modeling, where there are independent variables as well, bu it is sometimes convenient to use it for the variable in univariate models as well. Differencing To difference a timeseries variable is to replace each value (except for the first) by its difference from the previous value. The seasonal difference replaces each value (except for those in the first year) by its difference from the value one year previously.

Durbin-Watson This statistic checks for autocorrelation in the first lag of the residual Test errors. It should be about 2.0 for a perfect model. Forecast module computes the Durbin-Watson d-statistic, which is, strictly speaking, applicable only for regressions that include a constant intercept term, but do not include lagged dependent variables. Exogenous An exogenous variable is an explanatory variable that can be treated Variable as a time series of ordinary numbers. Practically speaking, independent variable means the same thing. Exponential A robust forecasting methodthat extrapolates smoothed estimates of Smoothing level, trend, and seasonality of a time series Fit Set The historic data set used to fit the parameters of a model, and as the base of extrapolation for the forecasts.

Forecast Error Standard error of the within-sample forecasts, computed by running the forecast model through the historic data. Used as an estimate of the one-step forecast error. Forecast Number of periods you wish to forecast. Horizon Forecast A forecast scenario extends the historic series of independent Scenario variables into the future. Dynamic regression forecasts are dependent on the forecast scenario. Lag The time difference between a time series value and a previous value from the same series.

Ljung-Box Test Checks for autocorrelation in the first several lags of the residual errors. If the Ljung-Box test is significant for a correlational model (Box-Jenkins or dynamic Regression) then the model needs improvement. The test is significant if its probability is > .99, in which

case it is marked with two asterisks in the standard diagnostic output. Local Level See local mean.

Local Mean The average level of a time series in the general neighborhood of a given point in time. Sometimes called the local level.

Local Trend The average rate of increase of a time series in the general neighborhood of a given point in time. MAD (Mean This measure of goodness-of-fit is calculated as the average of the Absolute absolute values of the errors. It is an important statistic in rolling Deviation) simulation analysis.

MAPE (Mean A statistic used to measure within sample goodness-of-fit and out- Absolute of-sample forecast performance. It is calculated as the average of the Percentage unsigned percentage errors. Error) Model A forecasting model is an equation, or set of equations, that the forecaster uses to represent and extrapolate features in the data.

Model Model complexity is measured by the number of parameters that Complexity must be fitted to the historic data. Overfitting, i.e., using too many parameters, leads to models that forecast poorly. The BIC can help to find the model that properly trades off goodness-of-fit in the historic fitting set, and its model complexity.

Multivariate Involving more than one variable at a time. Dynamic regression is a multivariate technique. Residual Error The difference between a predicted value and a true value in the fitting set, i.e. the fitted error. Robust A robust method is insensitive to moderate deviations from the underlying statistical assumptions.

RMSE (RootA statistic that is used as an indication of model fit. It is calculated by Mean Squaredtaking the square root of the average of the squared residual errors. Error) Seasonality Periodic patterns of behavior of the series. For instance, retail sales exhibit seasonality of period 12 months.

Usually the forecaster must take seasonality explicitly into account during the model fitting process. Stochastic A process is said to be stochastic when its future cannot be predicted exactly from its past. In a stochastic process, new uncertainty enters at each point in time. Univariate Involving only one variable at a time. Exponential smoothing and Box-Jenkins are univariate techniques.

Bibliography

J. S. Armstrong [1983] Relative Accuracy of Judgmental and Extrapolative Methods in Forecasting Annual Earnings, Journal of Forecasting, 2, pp. 437-447. J. S. Armstrong [1985] Long Range Forecasting From Crystal Ball to Computer (2nd ed.), New York: Wiley.

J. S. Armstrong [2001] Principles of Forecasting: A Handbook for Researchers and Practitioners, Norwell MA: Kluwer Academic Publishers. R. Ashley [1983] On the Usefulness of Macroeconomic Forecasts as Inputs to Forecasting Models, Journal of Forecasting, 2, pp. 211-223. R. Ashley [1988] On the Relative Worth of Recent Macroeconomic Forecasts, International Journal of Forecasting, 4, pp. 363-376. B. L. Bowerman and R. T. O' Connell [1979] Time Series and Forecasting: An Applied Approach, Boston: Duxbury Press. G. E. P. Box and G. M. Jenkins [1976] Time Series Analysis: Forecasting and Control, Revised Edition, San Francisco: Holden-Day. R. G. Brown [1963] Smoothing, Forecasting and Prediction of Discrete Time Series, Englewood Cliffs, NJ: Prentice-Hall. D. W. Bunn and A. I. Vassilopoulos [1993] Using Group Seasonal Indices in Multi- item Short-term Forecasting, International Journal of Forecasting, 9, 4, pp. 517-526.

Caceci and Cacheris [1984] Fitting Curves to Data, Byte Magazine, May, 1984. C. Chatfield [1978] The Holt-Winters Forecasting Procedure, Applied Statistics, 27, pp. 264-279. C. Chatfield and M. Yar [1988] Holt-Winters Forecasting: Some Practical Issues, The Statistician, 37, pp. 129-140. C. Chatfield and M. Yar [1991] Prediction Intervals for the Holt-Winters Forecasting Procedure, International Journal of Forecasting, 6, 1, pp. 127-137. G. C. Chow [1960] Tests of Equality between Subsets of Coefficients in Two Linear Regressions, Econometrica, 37, pp. 591-605. D. Cochrane and G. H. Orcutt [1949] Application of Least Squares Regression to

Bibliography 62

Relationships Containing Autocorrelated Error Terms, Journal of the American Statistical Association, 44, pp. 32-61. J. D. Croston [1972] Forecasting and Stock Control for Intermittent Demands, Operational Research Quarterly, 23(3), pp. 289-303. E. B. Dagum [1982] Revisions of Time Varying Seasonal Filters, Journal of Forecasting, 1, 2, pp. 173-187. D. A. Dickey, W. R. Bell, and R. B. Miller [1986] Unit Roots in Time Series Models: Tests and Implications, The American Statistician, 40, pp. 12-26. D. A. Dickey and W. A. Fuller [1979] Distribution of the Estimators for Autoregressive Time Series with a Unit Root, Journal of the American Statistics Society, Vol. 74, pp. 427- 431.

D. A. Dickey and W. A. Fuller [1981] The Likelihood Ratio Statistics for Autoregressive Time Series with a Unit Root, Econometrica, 49, pp. 1057-1072. R. F. Engle [1984] Wold, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics, Handbook of Econometrics Vol. II, Eds. Z. Griliches and M.D. Intriligator, North-Holland Publishers BV. R. Fildes [1979] Quantitative forecasting -- the state of the art: extrapolative methods, Journal of the Operational Research Society, 30, pp. 691-710. R. Fildes and S. Howell [1979] On Selecting a Forecasting Model, Forecasting, Studies in the Management Sciences, Vol. 12, S. Makridakis and S. C. Wheelwright (Eds.), Amsterdam: North-Holland. A. D. Flowers and M. H. Finegan [1988] A Comparison of Microcomputer Forecasting Packages, ORSA/TIMS.

W. A. Fuller [1976] Introduction to Statistical Time Series, New York: Wiley. E. S. Gardner, Jr. [1983] Automatic Monitoring of Forecast Errors, Journal of E. S. Gardner, Jr. [1985] Exponential Smoothing: The State of the Art, Journal of Forecasting, 4, 1, pp. 1-38.

R. L. Goodrich [1989] Applied Statistical Forecasting, Belmont, MA: Business Forecast Systems, Inc.

C. W. J. Granger [1989] Combining Forecasts—l Twenty Years Later, Journal of Forecasting, 8, pp. 167-173. C. W. J. Granger and G. McCollister [1978] Comparison of Forecasts of Selected Series by Adaptive, Box-Jenkins, and State Space Methods, ORSA/TIMS Meeting, Los Angeles, Summer, 1978. C. W. J. Granger and P. Newbold [1977] Forecasting Economic Time Series, New York: Academic Press.

G. W. C. Granger and R. Ramanathan [1984] Improved Methods of Combining Forecasts, Journal of Forecasting, 3, pp. 197-204. A. C. Harvey [1981] Time Series Models, Oxford, England: Phillip Allen. G. Hill and R. Fildes [1984] The Accuracy of Extrapolation Methods; An Automatic Box-Jenkins Package Sift, Journal of Forecasting, 3, pp. 319-323. C. C. Holt [1957] Forecasting trends and seasonals by exponentially weighted moving averages, O .N. R. Memorandum, No. 52, Carnegie Institute of Technology (now Carnegie-Mellon University).

A. B. Koehler and B. L. Bowerman [1989] Prediction Intervals for ARIMA Models, Ninth International Symposium on Forecasting, Vancouver. A. B. Koehler and E. S. Murphree [1986] A Comparison of the AIC and BIC on Empirical Data, Sixth International Symposium on Forecasting, Paris.

R. Lewandowski [1982] Sales Forecasting by FORSYS, Journal of Forecasting, 1, pp. 205- 214. J. Ledolter and B. Abraham [1984], Some Comments on the Initialization of Exponential Smoothing, Journal of Forecasting, 3, pp. 79-84.

G. Libert [1984] The M-Competition with a Fully Automatic Box-Jenkins Procedure, Journal of Forecasting, 3, pp. 325-328. G. M. Ljung and G. E. P. Box [1978] On a Measure of Lack of Fit in Time Series Models, Biometrika, 65, pp. 297-303.

E. J. Lusk and J. S. Neves [1984] A Comparative ARIMA Analysis of the 111 Series of the Makridakis Competition, Journal of Forecasting, 3, pp. 329-332.

S. Makridakis and M. Hibon [1979] Accuracy of Forecasting: An Empirical Investigation (with discussion), Journal of the Royal Statistical Society, A 142, Part 2, 97- 145. Reprinted in Makridakis et al. [1982].

S. Makridakis and M. Hibon [1989] Exponential Smoothing: The Effect of Initial Values and Loss Functions on Post-sample Forecasting Accuracy, Ninth International Symposium on Forecasting, Vancouver. S. Makridakis and M. Hibon [2000] The M3-Competition: Results, Conclusions and Implications, International Journal of Forecasting, 16, 4, pp. 451-476. S. Makridakis and S. C. Wheelwright [1979] Interactive Forecasting, Second Edition, San Francisco: Holden-Day. S. Makridakis, S. C. Wheelwright and R.J. Hyndman [1998] Forecasting Methods and Applications, Third Edition, New York: Wiley. S. Makridakis et al. [1982] The Accuracy of Extrapolation (Time Series) Methods: Results of a Forecasting Competition, Journal of Forecasting, 2, pp. 111-153. S. Makridakis et al. [1984] The Forecasting Accuracy of Major Time Series Methods, Chichister: Wiley. E. McKenzie [1986] ErrorAnalysis for Winters' Additive Seasonal Forecasting System, International Journal of Forecasting, 2, pp. 373-382. D. C. Montgomery and L. A. Johnson [1976] Forecasting and Time Series Analysis, New York: McGraw-Hill. J. S. Morrison [1995] Life-Cycle Approach to New Product Forecasting, The Journal of Business Forecasting, 14, 2, pp. 3-5. H. L. Nelson and C. W. J. Granger [1979] Experience with Using the Box-Cox Transformation when Forecasting Economic Time Series, Journal of Econometrics, 10, pp. 57-69. C. R. Nelson and H. Kang [1981] Spurious Periodicity in Inappropriately Detrended Time Series, Econometrica, 49, pp. 741-751.

C. R. Nelson and H. Kang [1983] Pitfalls in the Use of Time as an Explanatory Variable in Regression, Journal of Business and Economic Statistics, 2, pp. 73-82. P. Newbold [1983] ARIMA Model Building and the Time Series Analysis Approach to

Bibliography 65

Forecasting, Journal of Forecasting, 2, 1, pp. 23-35. P. Newbold and T. Bos [1990] Introductory Business Forecasting, Cincinnati: South- Western.

A. Pankratz and U. Dudley [1987] Forecasts of Power-transformed Series, Journal of Forecasting, 6, pp. 239-248. A. Pankratz [1991] Forecasting with Dynamic Regression Models, New York: Wiley. R. Ramanathan [1992] Introductory Econometrics with Applications, Second Edition, Fort Worth: Harcourt Brace Jovanovich College Publishers. G. Schwarz [1978] Estimating the Dimension of a Model, Ann. Statist. 6, 2, pp. 461-464. J. Shiskin, A. Young, and J. Musgrave [1967] The X-11 Variant of the Census Method II Seasonal Adjustment Program, Technical Paper No. 15, Bureau of the Census.

P. A. Texter and J. K. Ord [1989] Forecasting Using Automatic Identification Procedures, International Journal of Forecasting, 5, 2, pp. 209-215. H. Theil [1971] Principles of Econometrics, New York: Wiley. N. T. Thomopoulos [1980] Applied Forecasting Methods, Englewood Cliffs, NJ.: Prentice-Hall. S. C. Wheelwright and S. Makridakis [1985] Forecasting Methods for Management, New York: Wiley. T. R. Willemain, C. N. Smart, J. H. Shockor and P. A. DeSautels [1994] Forecasting Intermittent Demand in Manufacturing: a Comparative Evaluation of Croston's Method, International Journal of Forecasting, 10, 4, pp.529-538. G. T. Wilson [1979] Some Efficient Computational Procedures for High Order ARMA Models, J. Statist. Comput. Simul., 8, pp. 301-309.

P. R. Winters [1960] Forecasting Sales by Exponentially Weighted Moving Averages, Management Sci., 6, pp. 324. H. O. Wold [1938] A Study in the Analysis of Stationary Time Series, Uppsala: Almquist and Wicksell.

M. Yar and C. Chatfield [1991] Prediction Intervals for Multiplicative Holt-Winters, International Journal of Forecasting, 7, 1, pp. 31-37.

Bibliography 66

G. U. Yule [1927] On a Method of Investigating Periodicities in Disturbed Series, with Special Reference to Wolfer's Sunspot Numbers, Phil. Trans., A226, pp. 267.

Welcome to AGR!

Forecast profiles

Statistical Forecast overview

AGR's Order Suggestion: How to change it

AGR Forecast Engine

Introduction

Expert Selection

Simple Methods

Exponential Smoothing

Models of the Exponential Smoothing Family

Forecast profiles of Exponential Smoothing Models (Gardner [1985])

Implementation of Exponential Smoothing in Forecasting Module

Model selection

Parameter Optimization

Statistical Description of Exponential Smoothing

General Multiplicative Index Model

Simple Exponential Smoothing

Holt Exponential Smoothing Holt’s [1957]

Multiplicative Winters

Additive Winters

Croston’s Intermittent Demand Model

Croston’s Intermittent Demand Model

Discrete distributions

Poisson Distribution

Negative Binomial Distribution

Box-Jenkins Statistical Models

Implementation

Parameter estimation

Box Jenkins Background

Description of the ARIMA Model

Moving Average Processes

ARMA and ARIMA Processes

Deterministic Trends

7.4 Seasonal Models

Degree of Differencing

Autoregressive Order

Moving Average Order

Curve fitting

Model Statistics

Safety Stocks

Outlier Detection

Safety stock

Methodology of Automatic Forecasting

Introduction

Classification of Time Series

Incorporation of Additional Information

Selection of Forecasting Method

Glossary

Bibliography