How to Forecast using Regression Analysis
Introduction
Regression is the study of relationships among variables,
a principal purpose of which is to predict, or estimate the value of one
variable from known or assumed values of other variables related to it.
Variables of Interest: To make predictions
or estimates we must identify the effective predictors of the variable
of interest: which variables are important indicators and can be measured
at the least cost, which carry only a little information, and which are
redundant.
Predicting the Future: Predicting a
change over time or extrapolating from present conditions to future conditions
is not the function of regression analysis. To make estimates of the future,
use time series analysis.
Experiment: Begin with a hypothesis
about how several variables might be related to another variable and the
form of the relationship.
Types of Analysis
Simple Linear Regression: A regression
using only one predictor is called a simple regression.
Multiple Regression: Where there are
two or more predictors, multiple regression analysis is employed.
Data: Since it is usually unrealistic
to obtain information on an entire population, a sample which is a subset
of the population is usually selected. The sample may be either randomly
selected for a researcher may chose the xvalues based on the capability
of the equipment utilized in the experiment or the experiment design.
Where the xvalues are preselected, usually only limited inferences can
be drawn depending upon the particular values chosen. When both x and
y are randomly drawn, inferences can generally be drawn over the range
of values in the sample.
Scatter Diagram: A graphical representation
of the pairs of data called a scatter diagram can be drawn to gain an
overall view of the problem. Is there an apparent relationship? Direct?
Inverse? If the points lie within a band described by parallel lines we
can say there is a linear relationship between the pair of x and y values.
If the rate of change is generally not constant, then the relationship
is curvilinear.
The Model: If we have determined there
is a linear relationship between t and y we want a linear equation stating
y as a function of x in the form Y = a + bt + e where a is the intercept,
b is the slope and e is the error term accounting for variables that affect
y but are not included as predictors, and/or otherwise unpredictable and
uncontrollable factors.
Least Squares Method: To predict the
mean yvalue for a given tvalue, we need a line which passes through
the mean value of both t and y and which minimizes the sum of the distance
between each of the points and the predictive line. Such an approach should
result in a line which we can call a "best fit" to the sample
data. The least squares method achieves this result by calculating the
minimum average squared deviations between the sample y points and the
estimated line. A procedure is sued for finding the values of a and b
which reduces to the solution of simultaneous linear equations. Shortcut
formulas have been developed as an alternative to the solution of simultaneous
equations.
Solution Methods: Techniques of Matrix
Algebra can be manually employed to solve simultaneous linear equations.
When performing manual computations, this technique is especially useful
when there are more than two equations in two unknowns.
Several wellknown computer packages are widely available and can be utilized
to relieve the user of the computational problem, all of which can be
used to solve both linear and polynomial equations: the BMD packages (Biomedical
Computer Programs) from UCLA; SPSS (Statistical Package for the Social
Sciences) developed by the University of Chicago; and SAS (Statistical
Analysis System). Another package that is also available is IMSL, the
International Mathematical and Statistical Libraries, which contains a
great variety of standard mathematical and statistical calculations. All
of these software packages use matrix algebra to solve simultaneous equations.
Use and Interpretation of the Regression Equation:
The equation developed can be used to predict an average value over the
range of the sample data. The forecast is good for short to medium ranges.
Measuring Error in Estimations: The
scatter or variability about the mean value can be measured by calculating
the variance, the average squared deviation of the values around the mean.
The standard error of estimate is derived from this value by taking the
square root. This value is interpreted as the average amount that actual
values differ from the estimated mean.
Confidence Intervals: Interval estimates
can be calculated to obtain a measure of the confidence we have in our
estimates that a relationship exists. These calculations are made using
tdistribution tables. From these calculations we can derive confidence
bands, a pair of nonparallel lines narrowest at the mean values which
express our confidence in varying degrees of the band of values surrounding
the regression equation.
Assessment: How confident can we be
that a relationship actually exists? The strength of that relationship
can be assessed by statistical tests of that hypothesis such as the null
hypothesis which are established using tdistribution, Rsquared, and
Fdistribution tables. These calculations give rise to the standard error
of the regression coefficient, an estimate of the amount that the regression
coefficient b will vary from sample to sample of the same size from the
same population. An Analysis of Variance (ANOVA) table can be generated
which summarizes the different components of variation.
When you want to compare models of different size (different
numbers of independent variables and/or different sample sizes) you must
use the Adjusted RSquared, because the usual RSquared tends to grow
with the number of independent variables.
The Standard Error of Estimate (i.e. square root of error mean square)
is a good indicator of the "quality" of a prediction model since
it "adjusts" the Error Sum of Squares (EMS) for the number of
predictors in the model as follow:
EMS = Error Sum of Squares/(N  Number of Linearly Independent
Predictors)
If one keeps adding useless predictors to a model, the
EMS will become less and less stable. Rsquared is also influenced by
the range of your dependent value so if two models have the same residual
mean square but one model has a much narrower range of values for the
dependent variable that model will have a higher Rsquared. This explains
the fact that both models will do as well for prediction purposes.
A considerable portion of the output of the computer
programs previously mentioned are devoted to a description of the tests
of significance of the regression.
Back to Statistical
Forecasting Home Page
