How to Forecast using Regression Analysis

Introduction

Regression is the study of relationships among variables, a principal purpose of which is to predict, or estimate the value of one variable from known or assumed values of other variables related to it.

Variables of Interest: To make predictions or estimates we must identify the effective predictors of the variable of interest: which variables are important indicators and can be measured at the least cost, which carry only a little information, and which are redundant.

Predicting the Future: Predicting a change over time or extrapolating from present conditions to future conditions is not the function of regression analysis. To make estimates of the future, use time series analysis.

Experiment: Begin with a hypothesis about how several variables might be related to another variable and the form of the relationship.

Types of Analysis

Simple Linear Regression: A regression using only one predictor is called a simple regression.

Multiple Regression: Where there are two or more predictors, multiple regression analysis is employed.

Data: Since it is usually unrealistic to obtain information on an entire population, a sample which is a subset of the population is usually selected. The sample may be either randomly selected for a researcher may chose the x-values based on the capability of the equipment utilized in the experiment or the experiment design. Where the x-values are preselected, usually only limited inferences can be drawn depending upon the particular values chosen. When both x and y are randomly drawn, inferences can generally be drawn over the range of values in the sample.

Scatter Diagram: A graphical representation of the pairs of data called a scatter diagram can be drawn to gain an overall view of the problem. Is there an apparent relationship? Direct? Inverse? If the points lie within a band described by parallel lines we can say there is a linear relationship between the pair of x and y values. If the rate of change is generally not constant, then the relationship is curvilinear.

The Model: If we have determined there is a linear relationship between t and y we want a linear equation stating y as a function of x in the form Y = a + bt + e where a is the intercept, b is the slope and e is the error term accounting for variables that affect y but are not included as predictors, and/or otherwise unpredictable and uncontrollable factors.

Least Squares Method: To predict the mean y-value for a given t-value, we need a line which passes through the mean value of both t and y and which minimizes the sum of the distance between each of the points and the predictive line. Such an approach should result in a line which we can call a "best fit" to the sample data. The least squares method achieves this result by calculating the minimum average squared deviations between the sample y points and the estimated line. A procedure is sued for finding the values of a and b which reduces to the solution of simultaneous linear equations. Shortcut formulas have been developed as an alternative to the solution of simultaneous equations.

Solution Methods: Techniques of Matrix Algebra can be manually employed to solve simultaneous linear equations. When performing manual computations, this technique is especially useful when there are more than two equations in two unknowns.
Several well-known computer packages are widely available and can be utilized to relieve the user of the computational problem, all of which can be used to solve both linear and polynomial equations: the BMD packages (Biomedical Computer Programs) from UCLA; SPSS (Statistical Package for the Social Sciences) developed by the University of Chicago; and SAS (Statistical Analysis System). Another package that is also available is IMSL, the International Mathematical and Statistical Libraries, which contains a great variety of standard mathematical and statistical calculations. All of these software packages use matrix algebra to solve simultaneous equations.

Use and Interpretation of the Regression Equation: The equation developed can be used to predict an average value over the range of the sample data. The forecast is good for short to medium ranges.

Measuring Error in Estimations: The scatter or variability about the mean value can be measured by calculating the variance, the average squared deviation of the values around the mean. The standard error of estimate is derived from this value by taking the square root. This value is interpreted as the average amount that actual values differ from the estimated mean.

Confidence Intervals: Interval estimates can be calculated to obtain a measure of the confidence we have in our estimates that a relationship exists. These calculations are made using t-distribution tables. From these calculations we can derive confidence bands, a pair of non-parallel lines narrowest at the mean values which express our confidence in varying degrees of the band of values surrounding the regression equation.

Assessment: How confident can we be that a relationship actually exists? The strength of that relationship can be assessed by statistical tests of that hypothesis such as the null hypothesis which are established using t-distribution, R-squared, and F-distribution tables. These calculations give rise to the standard error of the regression coefficient, an estimate of the amount that the regression coefficient b will vary from sample to sample of the same size from the same population. An Analysis of Variance (ANOVA) table can be generated which summarizes the different components of variation.

When you want to compare models of different size (different numbers of independent variables and/or different sample sizes) you must use the Adjusted R-Squared, because the usual R-Squared tends to grow with the number of independent variables.
The Standard Error of Estimate (i.e. square root of error mean square) is a good indicator of the "quality" of a prediction model since it "adjusts" the Error Sum of Squares (EMS) for the number of predictors in the model as follow:

EMS = Error Sum of Squares/(N - Number of Linearly Independent Predictors)

If one keeps adding useless predictors to a model, the EMS will become less and less stable. R-squared is also influenced by the range of your dependent value so if two models have the same residual mean square but one model has a much narrower range of values for the dependent variable that model will have a higher R-squared. This explains the fact that both models will do as well for prediction purposes.

A considerable portion of the output of the computer programs previously mentioned are devoted to a description of the tests of significance of the regression.

Back to Statistical Forecasting Home Page

Copyright © 2006 Statistical Forecasting. All Rights Reserved