My general area of research is data and knowledge engineering, with a focus on developing effective and efficient data analysis techniques for emerging data intensive applications. Specifically, I am interested in researching and developing data mining algorithms from a data perspective. In the past several years, I have been studying, researching, and practicing various data analysis methods. A key problem that puzzled me is the complexity and diversity caused by the intrinsic characteristics of data. Indeed, real-world data can have varying levels of noise, distributions, densities, types, and dimensions to name a few. While an algorithm may be appropriate for one type of data, it may not be effective for another type. In fact, although it is important to develop novel data analysis techniques, it is also critical to provide a clear understanding of the strengths and weaknesses of these techniques with respect to different types of real-world data. In my view, the relationship between a data mining researcher and data can be analogous to the relationship between a medical doctor and patients. Data analysis techniques can be analogous to medical devices and medicines. It is easy to understand that different treatment methods should be used for different patients. The same is true for the use of data mining algorithms. In addition, my research reveals that it is also possible to exploit special characteristics of certain data distributions for developing suitable data analysis tools.
The above thoughts cast the light on the major body of my research work, which involves the following three aspects. First, we have been working on the problem of association analysis - the problem of efficiently finding groups of strongly-related data objects. A unique perspective of our work is to exploit traditional statistical correlation measures instead of using the association measures in the support-confidence framework. Also, a novel pruning strategy by exploiting data distributions was developed in this research. Second, we are working towards better understanding clustering algorithms and clustering validation measures from a data distribution perspective. Finally, for data sets with imbalanced class distributions, we have exploited local clustering techniques for decomposing complex concepts in large classes and developed the COG method for rare class analysis. A detailed discussion of these research work is included below.
A large body of association mining work in data mining was motivated by the difficulty of efficiently identifying the highly correlated objects in huge data volumes. This has led to the use of alternative interest measures, e.g. support and confidence, despite the lack of precise relationship between these new interest measures and statistical correlation measures. However, this approach is not effective for discovering potentially interesting patterns at low levels of support. Also, it tends to generate too many spurious patterns involving objects which are poorly correlated. There is a clear desire for characterizing the relationship between statistical correlation measures and new association measures towards the development of a science of data mining. Along this line, we provided a precise relation between Pearson's correlation coefficient and the support measure. We also proposed an efficient algorithm called TAPER to identify highly-correlated pairs of objects by contributing two algorithmic ideas: the monotonic upper bound of Pearson's correlation coefficient and novel pruning of candidates based on the ordering of object-pairs containing a common object. This work was published in IEEE Transactions on Knowledge and Data Engineering (TKDE) in 2006. In addition, we proposed an efficient algorithm, called TOP-COP, to find the top-k strongly correlated pairs. In this work, we developed a diagonal traversal strategy by exploiting a two-dimensional monotone property of an upper bound of Phi correlation coefficient. This work was published in INFORMS Journal on Computing (JOC) in 2008.
While TAPER and TOP-COP can efficiently identify highly correlated pair objects, they have difficulty in identifying highly-correlated objects beyond pairs. Therefore, we introduced a framework for mining hyperclique patterns, which contain objects that are highly affiliated with each other; that is, every pair of objects within a hyperclique pattern is guaranteed to have an uncentered Pearson's correlation coefficient above a certain level. In this framework, an objective measure called h-confidence is developed to find hyperclique patterns. An algorithm called a hyperclique miner is also proposed to exploit both cross-support and anti-monotone properties of the h-confidence measure for the efficient discovery of hyperclique patterns. A key observation in this work is that itemsets involving items with substantially different support levels are usually not the candidates for hyperclique patterns, since the pair wise correlations of items with substantially different support are extremely poor. Based on this observation, the concept of the cross-support property is formally defined. This property can help efficiently eliminate patterns involving items with substantially different support levels, and thus help to identify hyperclique patterns efficiently. This correlation computing technique is particularly effective for the Zipf-like data distribution. In the real world, Zipf-like distributions have been observed in a variety of application domains, including commercial retail data, Web click streams, and telecommunication data. This work was published in the journal Data Mining and Knowledge Discovery in 2006.
Finally, we have been making a significant effort on the development of an incremental solution for volatile correlation computation. Indeed, the sizes of real-world data sets are growing at an extraordinary rate, and these data are often dynamic and need to be continually updated. With such large and growing data sets, new research efforts are expected to develop an incremental solution, called CHECKPOINT, for volatile correlation computing. To that end, we proposed an innovative idea by setting a checkpoint to establish a computation buffer, which can be used to significantly reduce the correlation computing cost in dynamic data and has the advantage of compacting the use of memory space. To the best of our knowledge, this is the first real-time solution for volatile correlation computing. The result of this work has been accepted for publication as a regular paper in the proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008).
Clustering and Validation Measures
As a major effort towards understanding clustering algorithms from a data perspective, we presented an analysis of the strength and weakness of K-means algorithms with respect to different data distributions. Indeed, our results show that a skewed data distribution can make a significant impact on the performance of K-means as well as the effectiveness of some external clustering validation measures, such as Entropy and Variation of Information (VI). Also, our recent work shows that all proximity functions that fit K-means clustering can be generalized as K-means distance, which can be derived by a differentiable convex function. This is a big step towards the understanding of K-means algorithms from a data distribution perspective. Such an understanding is critical for guiding the practical use and further development of K-means clustering algorithms. These results have been published in the proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2006) and IEEE International Conference on Data Mining (ICDM-2007) respectively.
In addition, for the case that the KL-divergence is used as the proximity function for K-means, there is a remaining challenge to deal with high-dimensional sparse data. In fact, it is possible that the centroids contain many zero-value features for high-dimensional sparse data. This leads to infinite KL divergence values, which create a dilemma in assigning objects to the centroids during the iteration process of K-means. To solve this dilemma, we propose a Summation-based Incremental Learning (SAIL) method for INFO-K-means clustering. Specifically, by using an equivalent objective function, SAIL replaces the computation of the KL-divergence by the computation of the Shannon entropy. This can avoid the zero-value dilemma caused by the use of the KL-divergence. With SAIL as a booster, the clustering performance of K-means can be significantly improved. Also, SAIL leads to quick convergence and a robust clustering performance on high dimensional sparse data. This work has been accepted for publication in the proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008).
Clustering validation is a long standing challenge in the clustering literature. While many validation measures have been developed for evaluating the performance of clustering algorithms, these measures often provide inconsistent information about the clustering performance and the suitable measures to use in practice remain unknown. To that end, we provided an organized study of 16 external validation measures for K-means clustering. Specifically, we revealed the importance of measure normalization in the evaluation of the clustering performance on data with imbalanced class distributions. We also provided normalization solutions for several measures. In addition, we summarized the major properties of these external measures. These properties can serve as the guidance for the selection of validation measures in different application scenarios. Finally, we revealed the interrelationships among these measures. By mathematical transformation, we show that some validation measures are equivalent. Also, some measures have consistent validation performances. Most importantly, we provide a guide line to select the most suitable validation measures for K-means clustering. After carefully profiling these validation measures, we believe it is suitable to use the normalized van Dongen criterion (VDn) if we do not have any prior knowledge of data, since VDn has a simple computation form, satisfies mathematically sound properties, and can measure well on the data with imbalanced class distributions. However, for the case that the clustering performance is hard to distinguish, we may want to use the normalized Variation of Information VIn instead, since VIn has high sensitivity on detecting the clustering change.
Rare Class Analysis
Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attentions in the literature. However, rare class analysis remains a critical challenge, because there is no natural way for handling imbalanced class distributions. To fill this crucial void, we developed a method for Classification using lOcal clusterinG (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional learning algorithms, such as Support Vector Machines (SVMs), for classification. Along this line, we explore key properties of local clustering for a better understanding of the effect of COG on rare class analysis. Indeed, COG produces significantly higher prediction accuracies on rare classes than state-of-the-art methods and the COG scheme can greatly improve the computational performance of SVMs. Furthermore, we found that COG can also improve the performances of traditional supervised learning algorithms on data sets with balanced class distributions. Finally, as two case studies, we have applied COG for two real-world applications: credit card fraud detection and network intrusion detection. This work has been published in the proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2007).