The Unscrambler® combines statistical analysis techniques and “multivariate mapping” for easier data interpretation. This means users can effortlessly perform powerful statistical analysis while easily communicating results to colleagues.
PCA is a bilinear modeling method which gives an interpretable overview of the main information in a multidimensional data table.
The information carried by the original variables is projected onto a smaller number of underlying (“latent”) variables called principal components. The first principal component covers as much of the variation in the data as possible. The second principal component is orthogonal to the first and covers as much of the remaining variation as possible, and so on.
By plotting the principal components, one can view interrelationships between different variables, and detect and interpret sample patterns, groupings, similarities or differences.
Large data tables usually contain a large amount of information, which is partly hidden because the data are too complex to be easily interpreted.
Principal Component Analysis is a projection method that helps you visualize all the information contained in a data table.
PCA helps you find out in what respect one sample is different from another, which variables contribute most to this difference, and whether those variables contribute in the same way (i.e. are correlated) or independently from each other. It also enables you to detect sample patterns, like any particular grouping.
Finally, it quantifies the amount of useful information - as opposed to noise or meaningless variation - contained in the data.
It is important that you understand PCA, since it is a very useful method in itself, and forms the basis for several classification (SIMCA) and regression (PLS/PCR) methods.
The Unscrambler® running a PCA analysis
The PCA Overview displaying 4 plots produced by the PCA Analysis.
Regression is a generic term for all methods attempting to fit a model to observed data in order to quantify the relationship between two groups of variables. The fitted model may then be used either to merely describe the relationship between the two groups of variables, or to predict new values.
General Notation and DefinitionsThe two data matrices involved in regression are usually denoted X and Y, and the purpose of regression is to build a model Y = f(X). Such a model tries to explain, or predict, the variations in the Y-variable(s) from the variations in the X-variable(s). The link between X and Y is achieved through a common set of samples for which both X- and Y-values have been collected.
Names for X and YThe X- and Y-variables can be denoted with a variety of terms, according to the particular context (or culture). The most common ones are listed in the table below:
Usual names for X- and Y-variablesContext | X | Y |
General | Predictors | Responses |
Multiple Linear Regression (MLR) | Independent Variables | Dependent Variables |
Designed Data | Factors, Design Variables | Responses |
Spectroscopy | Spectra | Constituents |
Univariate regression uses a single predictor, which is often not sufficient to model a property precisely. Multivariate regression takes into account several predictive variables simultaneously, thus modeling the property of interest with more accuracy.
How and why to use Regression?Building a regression model involves collecting predictor and response values for common samples, and then fitting a predefined mathematical relationship to the collected data.
For example, in analytical chemistry, spectroscopic measurements are made on solutions with known concentrations of a given compound. Regression is then used to relate concentration to spectrum.
Once you have built a regression model, you can predict the unknown concentration for new samples, using the spectroscopic measurements as predictors. The advantage is obvious if the concentration is difficult or expensive to measure directly.
More generally, classical indications for regression as a predictive tool could be the following:
MLR is a method for relating the variations in a response variable (Y-variable) to the variations of several predictors (X-variables), with explanatory or predictive purposes.
An important assumption for the method is that the X-variables are linearly independent, i.e. that no linear relationship exists between the X-variables. When the X-variables carry common information, problems can arise due to exact or approximate collinearity.
Multiple Linear Regression (MLR) is a well-known statistical method based on ordinary least squares regression.
This operation involves a matrix inversion, which leads to collinearity problems if the variables are not linearly independent. Incidentally, this is the reason why the predictors are called independent variables in MLR; the ability to vary independently of each other is a crucial requirement to variables used as predictors with this method. MLR also requires more samples than predictors or the matrix cannot be inverted.
The Unscrambler® uses Singular Value Decomposition to find the MLR solution. No missing values are accepted.
Partial Least Squares Regression is a bilinear modeling method where information in the original X-data is projected onto a small number of underlying (“latent”) variables called PLS components. The Y-data are actively used in estimating the “latent” variables to ensure that the first components are those that are most relevant for predicting the Y-variables. Interpretation of the relationship between X-data and Y-data is then simplified as this relationship is concentrated on the smallest possible number of components.
By plotting the first PLS components one can view main associations between X-variables and Y-variables, and also interrelationships within X-data and within Y-data.
PLS2 Regression running on The Unscrambler®
A PLS 2 Regression overview
Response Surface from PLS2 Regression
3-Way PLS Regression
A method for relating the variations in one or several response variables (Y-variables) arranged in a 2-D table to the variations of several predictors arranged in a 3-D table (Primary and Secondary X-variables), with explanatory or predictive purposes.
PCR is a method for relating the variations in a response variable (Y-variable) to the variations of several predictors (X-variables), with explanatory or predictive purposes.
This method performs particularly well when the various X-variables express common information, i.e. when there is a large amount of correlation, or even collinearity.
Principal Component Regression is a two-step method. First, a Principal Component Analysis is carried out on the X-variables. The principal components are then used as predictors in a Multiple Linear Regression.
ClassificationContrary to regression, which predicts the values of one or several quantitative variables, classification is useful when the response is, a category variable that can be interpreted in terms of several classes to which a sample may belong.
The main goal of classification is to reliably assign new samples to existing classes (in a given population).
Note that classification is not the same as clustering.
You can also use classification results as a diagnostic tool:
It follows that, examples of such situations are:
The SIMCA classification is based on making a PCA model for each class in the training set. Unknown samples are then compared to the class models and assigned to classes, according to their analogy to the training samples.
Solving a classification problem requires two steps:
The modeling stage implies that you have identified enough samples as members of each class to be able to build a reliable model. It also requires enough variables to describe the samples accurately.
The actual classification stage uses significance tests, where the decisions are based on statistical tests performed on the object-to-model distances.
Classification method based on modeling the differences between several classes with PLS.
If there are only two classes to separate, the PLS model uses one response variable, which codes for class membership as follows: -1 for members of one class, +1 for members of the other one. The PLS1 algorithm is then used.
If there are three classes or more, PLS2 is used, with one response variable (-1/+1 or 0/1, which is equivalent) coding for each class.
Analysis of variance (ANOVA) is based on breaking down the variations of a response into several parts that can be compared to each other for significance testing.
To test the significance of a given effect, you have to compare the variance of the response accounted for by the effect to the residual variance, which summarizes experimental error. If the “structured” variance (due to the effect) is no larger than the “random” variance (error), the effect can be considered negligible. If it is significantly larger than the error, it is regarded as significant.
In practice, this is achieved through a series of successive computations, with results traditionally displayed as a table . The elements listed hereafter define the columns of the ANOVA table, and there is one row for each source of variation:
The outlined sequence of computations applies to all cases of ANOVA. Those can be the following:
The ANOVA table for a linear response surface includes a few additional features compared to the ANOVA table for analysis of effects.
Two new columns are included into the main section showing the individual effects:
The ANOVA table for a quadratic response surface includes one new column and one new section:
Min/Max/Saddle : Since the purpose of a quadratic model often is to find out where the optimum is, the minimum or maximum value inside the experimental range is computed, and the design variable values that produce this extreme are displayed as an additional column for the rows where linear effects are tested. Sometimes the extreme is a minimum in one direction of the surface, and a maximum in another direction; such a point is called a saddle point, and it is listed in the same column .
Model Check : This new section of the table checks the significance of the linear (main effects only) and quadratic (interactions and squares) parts of the model. If the quadratic part is not significant, the quadratic model is too sophisticated and you should try a linear model instead, which will describe your surface more economically and efficiently .