Classification in PLS is performed, in the SIMCA (Soft Independent Modeling of Class Analogy) approach, in order to identify local models for possible groups and to predict a probable class membership for new observations. At first, this approach runs a global Principal Component Analysis or PLS regression (according to the available data structure) on the whole dataset in order to identify groups of observations. Local models are then estimated for each class. Finally, new observations are classified to one of the established class models on the basis of their best fit to the respective model.
This approach, enforces the composition of the classes to be the same as the one initially chosen on the basis of the global model, computes the distance of each observation from the model with respect to the explanatory variable, and in order to compute the class membership probabilities, refers to a distribution of this distance whose shape and degrees of freedom, are not yet completely clear and demonstrated.
In SIMCA, a PCA is performed on each class in the data set, and a sufficient number of principal components are retained to account for most of the variation within each class. Hence, a principal component model is used to represent each class in the data set. The number of principal components retained for each class is usually different. Deciding on the number of principal components that should be retained for each class is important, as retention of too few components can distort the signal or information content contained in the model about the class, whereas retention of too many principal components diminishes the signal-to-noise. A procedure called cross-validation ensures that the model size can be determined directly from the data. To perform cross-validation, segments of the data are omitted during the PCA. Using one, two, three, etc., principal components, omitted data are predicted and compared to the actual values. This procedure is repeated until every data element has been kept out once. The principal component model that yields the minimum prediction error for the omitted data is retained. Hence, cross-validation can be used to find the number of principal components necessary to describe the signal in the data while ensuring high signal-to-noise by not including the so-called secondary or noise-laden principal components in the class model. The variance that is explained by the class model is called the modeled variance, which describes the signal, whereas the noise in the data is described by the residual variance or the variance not accounted for by the model.
By comparing the residual variance of an unknown to the average residual variance of those samples that make up the class, it is possible to obtain a direct measure of the similarity of the unknown to the class. This comparison, is also a measure of the goodness of fit of the sample, to a particular principal component model.
An attractive feature of SIMCA is that a principal component mapping of the data has occurred. Hence, samples that may be described by spectra or chromatograms, are mapped onto a much lower dimensional subspace for classification. If a sample is similar to the other samples in the class, it will lie near them in the principal component map defined by the samples representing that class.
Another advantage of SIMCA is that an unknown is only assigned to the class for which, it has a high probability. If the residual variance of a sample exceeds the upper limit for every modeled class in the data set, the sample would not be assigned to any of the classes because, it is either an outlier or comes from a class that is not represented in the data set.
Finally, SIMCA is sensitive to the quality of the data used to generate the principal component models. As a result, there are diagnostics to assess the quality of the data, such as the modeling power and the discriminatory power. The modeling power describes how well a variable helps the principal components to model variation, and discriminatory power describes how well the variable helps the principal components to classify the samples in the data set. Variables with low modeling power and low discriminatory power are usually deleted from the data because they contribute only noise to the principal component models.
SIMCA can work with as few as 10 samples per class, and there is no restriction on the number of measurement variables, which is an important consideration, because the number of measurement variables often exceeds the number of samples in chemical studies. Most standard discrimination techniques would break down in these situations because of problems arising from collinearity and chance classification.
|All-In-One Multivariate Data Analysis (MVA) and Design of Experiments (DoE) Package|