Statistics and Data Mining
|Statistics and Data Mining In The Analysis of Massive Data Sets
By James Kolsky June 1997
Researchers are challenged with drawing meaningful conclusions from these masses of data in a timely manner. One catch-phrase being bandied about with some regularity is "Data Mining." And, in fact, several software packages now have "Data Mining" options, inferring new methodologies that can be used to solve some of the problems associated with massive data sets. The phrase "data mining" has been applied to a host of procedures over the years, many with poor connotations, especially among statisticians. The current use of the phrase seems to have originated in the computer science field. A definition that has been repeated often is "a process that deals with the discovery of hidden knowledge, unexpected patterns, and new rules from large data bases, particularly the discovery of optimal clusters and interesting irregularities." Not surprisingly, this sounds suspiciously like the role of statistical analysis. And, in fact, many of these data mining techniques are similar to basic statistical methods of exploratory data analysis (EDA) and data visualization that have been used for years. More importantly, the same statistical issues that have plagued statisticians and non-statisticians alike in their analysis efforts have not, in any way, been resolved by the use of Data Mining software. Consequently, Data Mining software, rather than being a panacea, may add little to the toolboxes of researchers already familiar and with access to basic statistical tools.
So what are these issues? First, and foremost, massive data sets are still collections of data, and it is important to understand how the data was collected. Any conclusions from the analysis will only be as good as the original data. Researchers saddled with "bad" data are faced with a severe disadvantage and, in extreme cases, not even sophisticated statistical techniques can address the objectives to any degree of satisfaction. One cause of bad data is the poor specification of objectives. For instance, poorly worded or vague objectives can lead to the collection of data that answers the wrong question. Poor data collection methods can also create biases in the data or result in data not representative of the population that was to be sampled. Other problems include data that has been aggregated over important variables and data sets with large amounts of missing data. Most of these problems are easily avoided by spending time prior to the data collection stage carefully outlining the objectives and by performing short pilot studies. Pilot studies are good, cost effective tools for identifying problems in protocols and design methodologies.
Extremely large data sets are usually quite complex, frequently containing scores of variables, many of which can only be described by non-linear relationships. Numerous variables may also interact with each other. These issues all combine to make many statistical procedures, such as Analysis of Variance or regression analysis, difficult to use. Care must also be taken such that data with many variables is not "over analyzed." Not matter how large the data set is originally, if it is cut into enough segments, significant differences will be found between groups simply by chance.
EDA and data visualization techniques, though primarily descriptive, remain the primary starting points to identifying relationships in the data. Such techniques include box plots and histograms of individual variables and scatter plots of pairs of variables. These graphical representations can lead to a reduction in the number of variables that must be addressed by highlighting strong trends or patterns in the data. More graphical displays, however, need to be developed that can better describe higher dimensional patterns in the data.
Alternatively, multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables. By reducing the dimension of the data to a few clusters, it may be possible to use standard statistical tools for all subsequent analysis.
Constraints on software packages may prevent the standard analysis of data sets with massive numbers of observations. The amount of data needed to be analyzed can be reduced by sampling the data base. Software packages can then be used on the sampled data. Simple random samples, where each observation has the same probability of selection, are simple, commonly used plans. However, these assume that the data base to be sampled is homogenous. If there are clusters of data within the data base, a simple random sample will not be an effective tool and subsequent conclusions may be biased. In these cases, other sampling plans should be examined.
While large data sets introduce additional complications to their analysis, researchers should not disregard the basic statistical concepts that have served so well when analyzing smaller data sets. Data collection methods should reflect overall objectives and initial analysis should be composed of EDA and data visualization techniques. Once a complete understanding of the data has been gained more complicated methods, such as cluster analysis or data base sampling, can be attempted.
Reprinted with permission of the American Marketing Association (Marketing News :to be published)