Regression analysis

Find relationships for prediction.

Regression is a generic term for all methods attempting to fit a model to observed data in order to quantify the relationship between two groups of variables. The fitted model may then be used either to merely describe the relationship between the two groups of variables, or to predict new values.

Simple Linear Regression (SLR)

Simple linear regression is a method that enables you to determine the relationship between a continuous process output (Y) and one factor (X). The relationship is typically expressed in terms of a mathematical equation such as Y = b + mX

Suppose we believe that the value of y tends to increase or decrease in a linear manner as x increases. Then we could select a model relating y to x by drawing a line which is well fitted to a given data set. Such a deterministic model – one that does not allow for errors of prediction –might be adequate if all of the data points fell on the fitted line. However, you can see that this idealistic situation will not occur for the data of Table 11.1 and 11.2. No matter how you draw a line through the points in Figure 11.2 and Figure 11.3, at least some of points will deviate substantially from the fitted line.

The solution to the proceeding problem is to construct a probabilistic model relating y to x- one that knowledge the random variation of the data points about a line. One type of probabilistic model, a simple linear regression model,

makes assumption that the mean value of y for a given value of x graphs as straight line and that points deviate about this line of means by a random amount equal to e, i.e.

y = A + B x + e,

where A and B are unknown parameters of the deterministic (nonrandom ) portion of the model.
If we suppose that the points deviate above or below the line of means and with expected value E(e) = 0 then the mean value of y is

y = A + B x.

Therefore, the mean value of y for a given value of x, represented by the symbol E(y) graphs as straight line with y-intercept A and slope B.

Multiple Linear Regression (MLR)

This procedure performs linear regression on the selected dataset. This fits a linear model of the form

Y= b 0 + b 1 X 1 + b 2 X 2 + …. + b k X k + e

where Y is the dependent variable (response) and X 1 , X 2 ,.. .,X k are the independent variables (predictors) and e is random error. b 0 , b 1 , b 2 , …. b k are known as the regression coefficients, which have to be estimated from the data. The multiple linear regression algorithm in XLMiner chooses regression coefficients so as to minimize the difference between predicted values and actual values.

Linear regression is performed either to predict the response variable based on the predictor variables, or to study the relationship between the response variable and predictor variables. For example, using linear regression, the crime rate of a state can be explained as a function of other demographic factors like population, education, male to female ratio etc.

Linear Regression Model

Linear regression is a statistical procedure for predicting the value of a dependent variable from an independent variable when the relationship between the variables can be described with a linear model.

A linear regression equation can be written as Yp= mX + b, where Yp is the predicted value of the dependent variable, m is the slope of the regression line, and b is the Y-intercept of the regression line.

In statistics, linear regression is a method of estimating the conditional expected value of one variable y given the values of some other variable or variables x. The variable of interest, y, is conventionally called the “dependent variable”. The terms “endogenous variable” and “output variable” are also used. The other variables x are called the “independent variables”. The terms “exogenous variables” and “input variables” are also used. The dependent and independent variables may be scalars or vectors. If the independent variable is a vector, one speaks of multiple linear regression.

Statement of the linear regression model

A linear regression model is typically stated in the form y = α + βx + ε

The right hand side may take other forms, but generally comprises a linear combination of the parameters, here denoted α and β. The term ε represents the unpredicted or unexplained variation in the dependent variable; it is conventionally called the “error” whether it is really a measurement error or not. The error term is conventionally assumed to have expected value equal to zero, as a nonzero expected value could be absorbed into α. See also errors and residuals in statistics; the difference between an error and a residual is also dealt with below. It is also assumed that is ε independent of x.

A useful alternative to linear regression is robust regression in which mean absolute error is minimized instead of mean squared error as in linear regression. Robust regression is computationally much more intensive than linear regression and is somewhat more difficult to implement as well.

Robust regression usually means linear regression with robust (Huber-White) standard errors (e.g. relaxing the assumption of homoskedasticity).

An equivalent formulation which explicitly shows the linear regression as a model of conditional expectation is with the conditional distribution of y given x essentially the same as the distribution of the error term. A linear regression model need not be affine, let alone linear, in the independent variables x.

Try multivariate analysis in action – download free trial!

Download the most easy to use all-in-one tool for multivariate analysis. It is the preferred tool for 25000 data analysts, researchers and engineers who need to analyse data quickly, easily and accurately.

Get started with multivariate analysis.

Maximize your analytical skills and accelerate your organisations success using multivariate analysis with our flexible training options to suit different learning preferences and skill levels.

Book: An introduction to Multivariate Analysis.

All updated 6th edition of the best selling book on chemometrics and multivariate techniques, covering PLS, PCA, TOS, DoE and much more.