Home > Resources > Multivariate Analysis [MVA] - Types of Methods

# Multivariate Analysis [MVA] - Types of Methods

The Unscrambler® combines statistical analysis techniques and “multivariate mapping” for easier data interpretation. This means users can effortlessly perform powerful statistical analysis while easily communicating results to colleagues.

## The methods of analysis used by Unscrambler® include:

Principal Components Analysis (PCA)

PCA is a bilinear modeling method which gives an interpretable overview of the main information in a multidimensional data table.

The information carried by the original variables is projected onto a smaller number of underlying (“latent”) variables called principal components. The first principal component covers as much of the variation in the data as possible. The second principal component is orthogonal to the first and covers as much of the remaining variation as possible, and so on.

By plotting the principal components, one can view interrelationships between different variables, and detect and interpret sample patterns, groupings, similarities or differences.

Large data tables usually contain a large amount of information, which is partly hidden because the data are too complex to be easily interpreted.

Principal Component Analysis is a projection method that helps you visualize all the information contained in a data table.

PCA helps you find out in what respect one sample is different from another, which variables contribute most to this difference, and whether those variables contribute in the same way (i.e. are correlated) or independently from each other. It also enables you to detect sample patterns, like any particular grouping.

Finally, it quantifies the amount of useful information - as opposed to noise or meaningless variation - contained in the data.

It is important that you understand PCA, since it is a very useful method in itself, and forms the basis for several classification (SIMCA) and regression (PLS/PCR) methods. The Unscrambler® running a PCA analysis The PCA Overview displaying 4 plots produced by the PCA Analysis.

Regression Analysis

Regression is a generic term for all methods attempting to fit a model to observed data in order to quantify the relationship between two groups of variables. The fitted model may then be used either to merely describe the relationship between the two groups of variables, or to predict new values.

General Notation and Definitions

The two data matrices involved in regression are usually denoted X and Y, and the purpose of regression is to build a model Y = f(X). Such a model tries to explain, or predict, the variations in the Y-variable(s) from the variations in the X-variable(s). The link between X and Y is achieved through a common set of samples for which both X- and Y-values have been collected.

Names for X and Y

The X- and Y-variables can be denoted with a variety of terms, according to the particular context (or culture). The most common ones are listed in the table below:

Usual names for X- and Y-variables
 Context X Y General Predictors Responses Multiple Linear Regression (MLR) Independent Variables Dependent Variables Designed Data Factors, Design Variables Responses Spectroscopy Spectra Constituents
Univariate vs. Multivariate Regression

Univariate regression uses a single predictor, which is often not sufficient to model a property precisely. Multivariate regression takes into account several predictive variables simultaneously, thus modeling the property of interest with more accuracy.

How and why to use Regression?

Building a regression model involves collecting predictor and response values for common samples, and then fitting a predefined mathematical relationship to the collected data.

For example, in analytical chemistry, spectroscopic measurements are made on solutions with known concentrations of a given compound. Regression is then used to relate concentration to spectrum.

Once you have built a regression model, you can predict the unknown concentration for new samples, using the spectroscopic measurements as predictors. The advantage is obvious if the concentration is difficult or expensive to measure directly.

More generally, classical indications for regression as a predictive tool could be the following:

• Every time you wish to use cheap, easy-to-perform measurements as a substitute for more expensive or time-consuming ones
• When you want to build a response surface model from the results of some experimental design, i.e. describe precisely the response levels according to the values of a few controlled factors.
Multiple Linear Regression (MLR)

MLR is a method for relating the variations in a response variable (Y-variable) to the variations of several predictors (X-variables), with explanatory or predictive purposes.

An important assumption for the method is that the X-variables are linearly independent, i.e. that no linear relationship exists between the X-variables. When the X-variables carry common information, problems can arise due to exact or approximate collinearity.

Multiple Linear Regression (MLR) is a well-known statistical method based on ordinary least squares regression.

This operation involves a matrix inversion, which leads to collinearity problems if the variables are not linearly independent. Incidentally, this is the reason why the predictors are called independent variables in MLR; the ability to vary independently of each other is a crucial requirement to variables used as predictors with this method. MLR also requires more samples than predictors or the matrix cannot be inverted.

The Unscrambler® uses Singular Value Decomposition to find the MLR solution. No missing values are accepted.

Partial Least Squares Regression (PLSR)

Partial Least Squares Regression is a bilinear modeling method where information in the original X-data is projected onto a small number of underlying (“latent”) variables called PLS components. The Y-data are actively used in estimating the “latent” variables to ensure that the first components are those that are most relevant for predicting the Y-variables. Interpretation of the relationship between X-data and Y-data is then simplified as this relationship is concentrated on the smallest possible number of components.

By plotting the first PLS components one can view main associations between X-variables and Y-variables, and also interrelationships within X-data and within Y-data. PLS2 Regression running on The Unscrambler® A PLS 2 Regression overview Response Surface from PLS2 Regression

3-Way PLS Regression

A method for relating the variations in one or several response variables (Y-variables) arranged in a 2-D table to the variations of several predictors arranged in a 3-D table (Primary and Secondary X-variables), with explanatory or predictive purposes.

Principal Component Regression (PCR)

PCR is a method for relating the variations in a response variable (Y-variable) to the variations of several predictors (X-variables), with explanatory or predictive purposes.

This method performs particularly well when the various X-variables express common information, i.e. when there is a large amount of correlation, or even collinearity.

Principal Component Regression is a two-step method. First, a Principal Component Analysis is carried out on the X-variables. The principal components are then used as predictors in a Multiple Linear Regression.

Classification

Contrary to regression, which predicts the values of one or several quantitative variables, classification is useful when the response is, a category variable that can be interpreted in terms of several classes to which a sample may belong.

The main goal of classification is to reliably assign new samples to existing classes (in a given population).

Note that classification is not the same as clustering.

You can also use classification results as a diagnostic tool:

1. To distinguish among the most important variables to keep in a model (variables that “characterize” the population)
2. Or to find outliers (samples that are not typical of the population)

It follows that, examples of such situations are:

• Predicting whether a product meets quality requirements, where the result is simply “Yes” or “No” (i.e. binary response)
• Modeling various close species of plants or animals according to their easily observable characteristics, so as to be able to decide whether new individuals belong to one of the modeled species
• Modeling various diseases according to a set of easily observable symptoms, clinical signs or biological parameters, so as to help future diagnosis of those diseases
Types of Classification

The SIMCA classification is based on making a PCA model for each class in the training set. Unknown samples are then compared to the class models and assigned to classes, according to their analogy to the training samples.

Solving a classification problem requires two steps:

1. Modeling : Build one separate model for each class
2. Classifying New Samples : Fit each sample to each model and decide whether the sample belongs to the corresponding class

The modeling stage implies that you have identified enough samples as members of each class to be able to build a reliable model. It also requires enough variables to describe the samples accurately.

The actual classification stage uses significance tests, where the decisions are based on statistical tests performed on the object-to-model distances.

PLS Discriminant Analysis (PLS-DA)

Classification method based on modeling the differences between several classes with PLS.

If there are only two classes to separate, the PLS model uses one response variable, which codes for class membership as follows: -1 for members of one class, +1 for members of the other one. The PLS1 algorithm is then used.

If there are three classes or more, PLS2 is used, with one response variable (-1/+1 or 0/1, which is equivalent) coding for each class.

ANOVA

Analysis of variance (ANOVA) is based on breaking down the variations of a response into several parts that can be compared to each other for significance testing.

To test the significance of a given effect, you have to compare the variance of the response accounted for by the effect to the residual variance, which summarizes experimental error. If the “structured” variance (due to the effect) is no larger than the “random” variance (error), the effect can be considered negligible. If it is significantly larger than the error, it is regarded as significant.

In practice, this is achieved through a series of successive computations, with results traditionally displayed as a table . The elements listed hereafter define the columns of the ANOVA table, and there is one row for each source of variation:

1. First, several sources of variation are defined. For instance, if the purpose of the model is to study the main effects of all design variables, each design variable is a source of variation. Experimental error is also a source of variation
2. Each source of variation has a limited number of independent ways to cause variation in the data. This number is called number of degrees of freedom (DF)
3. Response variation associated to a specific source is measured by a sum of squares (SS)
4. Response variance associated to the same source is then computed by dividing the sum of squares by the number of degrees of freedom. This ratio is called mean square (MS)
5. Once mean squares have been determined for all sources of variation, f-ratios associated to every tested effect are computed as the ratio of MS (effect) to MS (error). These ratios, which compare structured variance to residual variance, have a statistical distribution which is used for significance testing. The higher the ratio, the more important the effect
6. Under the null hypothesis (i.e., that the true value of an effect is zero), the f-ratio has a Fisher distribution. This makes it possible to estimate the probability of getting such a high f-ratio under the null hypothesis. This probability is called p-value ; the smaller the p-value, the more likely it is that the observed effect is not due to chance. Usually, an effect is declared significant if p-value<0.05 (significance at the 5% level). Other classical thresholds are 0.01 and 0.001

The outlined sequence of computations applies to all cases of ANOVA. Those can be the following:

• Summary ANOVA : ANOVA on the global model. The purpose is to test the global significance of the whole model before studying the individual effects
• Linear ANOVA : Each main effect is studied separately
• Linear with Interactions ANOVA : Each main effect and each 2-factor interaction is studied separately
• Quadratic ANOVA : Each main effect, each 2-factor interaction and each quadratic effect is studied separately
ANOVA for Linear Response Surfaces

The ANOVA table for a linear response surface includes a few additional features compared to the ANOVA table for analysis of effects.

Two new columns are included into the main section showing the individual effects:

• b-coefficients : The values of the regression coefficients are displayed for each effect of the model
• Standard Error of the b-coefficients : Each regression coefficient is estimated with a certain precision, measured as a standard error
ANOVA for Quadratic Response Surfaces

The ANOVA table for a quadratic response surface includes one new column and one new section:

Min/Max/Saddle : Since the purpose of a quadratic model often is to find out where the optimum is, the minimum or maximum value inside the experimental range is computed, and the design variable values that produce this extreme are displayed as an additional column for the rows where linear effects are tested. Sometimes the extreme is a minimum in one direction of the surface, and a maximum in another direction; such a point is called a saddle point, and it is listed in the same column .

Model Check : This new section of the table checks the significance of the linear (main effects only) and quadratic (interactions and squares) parts of the model. If the quadratic part is not significant, the quadratic model is too sophisticated and you should try a linear model instead, which will describe your surface more economically and efficiently .

For linear models with interactions, the model check (linear only vs. interactions) is included, but not min/max/saddle.