K-Means methodology is a commonly used clustering technique. In this analysis the user starts with a collection of samples and attempts to group them into ‘k’ Number of Clusters based on certain specific distance measurements. The prominent steps involved in the K-Means clustering algorithm are given below.
1. This algorithm is initiated by creating ‘k’ different clusters. The given sample set is first randomly distributed between these ‘k’ different clusters.
2. As a next step, the distance measurement between each of the sample, within a given cluster, to their respective cluster centroid is calculated.
3. Samples are then moved to a cluster (k ¢ ) that records the shortest distance from a sample to the cluster (k ¢ ) centroid.
As a first step to the cluster analysis, the user decides on the Number of Clusters‘k’. This parameter could take definite integer values with the lower bound of 1 (in practice, 2 is the smallest relevant number of clusters) and an upper bound that equals the total number of samples.
The K-Means algorithm is repeated a number of times to obtain an optimal clustering solution, every time starting with a random set of initial clusters.
The following distance types can be used for clustering.
This is the most usual, “natural” and intuitive way of computing a distance between two samples. It takes into account the difference between two samples directly, based on the magnitude of changes in the sample levels. This distance type is usually used for data sets that are suitably normalized or without any special distribution problem.
Also known as city-block distance, this distance measurement is especially relevant for discrete data sets. While the Euclidean distance corresponds to the length of the shortest path between two samples (i.e. “as the crow flies”), the Manhattan distance refers to the sum of distances along each dimension (i.e. “walking round the block”).
This distance is based on the Pearson correlation coefficient that is calculated from the sample values and their standard deviations. The correlation coefficient 'r' takes values from –1 (large, negative correlation) to +1 (large, positive correlation).
Effectively, the Pearson distance -dp- is computed as dp = 1 - r and lies between 0 (when correlation coefficient is +1, i.e. the two samples are most similar) and 2 (when correlation coefficient is -1).
Note that the data are centered by subtracting the mean, and scaled by dividing by the standard deviation.
In this distance, the absolute value of the Pearson correlation coefficient is used; hence the corresponding distance lies between 0 and 1, just like the correlation coefficient. The equation for the Absolute Pearson distance -da- is:
da = 1 - ½ r ½
Taking the absolute value gives equal meaning to positive and negative correlations, due to which anti-correlated samples will get clustered together.
This is the same as the Pearson correlation, except that the sample means are set to zero in the expression for un-centered correlation. The un-centered correlation coefficient lies between –1 and +1; hence the distance lies between 0 and 2.
This is the same as the Absolute Pearson correlation, except that the sample means are set to zero in the expression for un-centered correlation. The un-centered correlation coefficient lies between 0 and +1; hence the distance lies between 0 and 1.
This non-parametric distance measurement is more useful in identifying samples with a huge deviation in a given data set.
|All-In-One Multivariate Data Analysis (MVA) and Design of Experiments (DoE) Package with K-Means Clustering|
CAMO Software provides professional training in multivariate data analysis, spectroscopy, sensometrics, K-Means Clustering, statistical regression analysis, simple linear regression, Linear Regression and chemometrics across United States & Canada, Europe, South America, Africa, Australia and Asia through our panel of chemometric experts, spectroscopy professionals, sensometrics instructors and Multivariate Data Analysis Trainers.