Overview of Research Projects
- Hyperclique Pattern Discovery (Data
Mining)
A hyperclique pattern is a type of association pattern containing objects that are highly affiliated with each other; that is, every pair of objects within a hyperclique pattern is
guaranteed to have the cosine similarity (uncentered correlation coefficient) above
certain level. Discovering such patterns is extremely useful for a variety of applications such as identifying low-support-high-confidence association rules, rare event detection, and protein functional module extraction. A key idea central to the discovery of
hyperclique patterns is the use of an association measure called h-confidence. The
h-confidence measure possesses an anti-monotonic property that allows us to
incorporate the measure directly into the mining process, rather than using it during
post-processing. Furthermore, the h-confidence measure has another important property, called the cross-support property, which allows dramatically pruning of the exponential
search space by eliminating patterns involving items from different levels of support. Due
to these two properties, hyperclique patterns can be found even when the support
threshold is set to zero. A summary of the preliminary work has been published in
the Third IEEE International Conference on Data Mining (ICDM 2003) and the extended
version of this work has been submitted to the journal - Data Mining and Knowledge
Discovery.
- Protein Functional Module Extraction
(Bioinformatics)
Proteins usually do not act isolated in a cell but function within complicated
cellular pathways, interacting with other proteins either in pairs or as components of larger
complexes. While many protein complexes have been identified by large-scale experimental studies, due to a
large number of false-positive interactions existing in current protein complexes, it is still difficult to obtain an accurate
understanding of functional modules, which encompass groups of proteins involved in common elementary
biological function. In this project, we present a hyperclique pattern discovery approach
for extracting functional modules (hyperclique patterns) from protein complexes. The
analysis of hyperclique patterns using the Gene Ontology suggests that proteins within the
same hyperclique pattern more likely perform the same function and participate in the
same biological process. More interestingly, the 3-D structural view of proteins within a
hyperclique pattern reveals that these proteins physically interact with each other. In
addition, we observe that several hyperclique patterns corresponding to different
functions can participate in the same protein complex as independent modules; and a
hyperclique pattern can be involved in different complexes performing different
higher-order biological functions, although the pattern corresponds to a specific
elementary biological function. Finally, the results also indicate that our method can
facilitate the functional annotation of uncharacterized proteins. (More detailed results are
available at the project web
site. A summary of the preliminary work has been accepted for
publication and oral presentation in the Pacific Symposium on Biocomputing
(PSB 2005).
- Spatial Co-location Patterns (Spatial Databases/Spatial Data
Mining)
Given a collection of Boolean spatial features, the co-location pattern
discovery process finds the subsets of features frequently located together.
Co-location patterns are important for many application domains involving large geographical datasets such as
location-based E-services, epidemiology, and NASA's climatologic project. With no
notion of transaction in continuous geographic space, traditional measures and mining
algorithms are difficulty to apply. Instead, we proposed the concept of user-specified
neighborhoods in place of transactions to specify groups of items. New interest
measures were also proposed which are robust in the face of potentially infinite
overlapping neighborhoods. The full version of this work has been accepted for publication in
the IEEE Transaction on Knowledge and Data Engineering (TKDE). In addition, we proposed a novel measure called the maximal participation index and
developed efficient algorithms to find high-confidence-low-prevalence co-location rules.
This work has been published in {\it the ACM Symposium on Applied Computing (ACM
SAC), 2003} and the extended version of this paper has been submitted to the GeoInformatica, An International Journal on Advances of Computer Science for Geographic Information
Systems. Finally, we are continuing working on this project and extending the concept of co-locations from point features to extended spatial objects (e.g. polygons and line strings). This work results in the paper "A Framework for
Discovering Co-location Patterns in Data Sets with Extended Spatial Objects", which has been published in
the fourth SIAM International Conference on Data Mining (SDM'04).
- Pattern Preserving Clustering (Data
Mining)
This project proposes a new approach for clustering---pattern preserving
clustering---which produces more easily interpretable and usable clusters. This
approach is motivated by the following observation: while there are usually strong patterns in the
data---patterns that may be key for the analysis and description of the data---these
patterns are often split among different clusters by current clustering approaches. This is,
perhaps, not surprising, since clustering algorithms have no built in knowledge of these
patterns and may often have goals that are in conflict with preserving patterns, e.g.,
minimize the distance of points to their nearest cluster centroid. Also, patterns are
typically overlapping, i.e., may involve some of the same objects, and if the
clustering algorithm produces disjoint clusters, then some patterns must be split when the objects
are clustered. In this project we describe a technique for pattern preserving clustering that
first finds patterns composed of tightly connected groups of objects or attributes and then, starting from these patterns,
performs agglomerative clustering using the Group Average (UPGMA) technique.
We present the results of some experiments on document data that compare our
approach, HIerarchical Clustering with PAttern Preservation (HICAP), to two
other clustering techniques: bisecting K-means and traditional UPGMA. These results show that, despite the extra constraint of pattern preservation, HICAP
has performance very much like traditional UPGMA with respect to the cluster
evaluation criteria of entropy and F-measure. More importantly, we also illustrate
how patterns, if preserved, can aid cluster interpretation. A summary of the
preliminary work has been published in the fourth SIAM International Conference on Data Mining (SDM
2004).
- Server Based Agent for Web Usage Mining (Web
Mining)
To perform web usage analysis, unique user sessions must be identified. There
are heuristics that can help identify user sessions on web logs from the traditional
web server log mechanism; however the identified user sessions using heuristic are
often incomplete and inaccurate. Instead, we designed and implemented a server-based
agent to capture user session explicitly at server end and construct a new web log,
which is more suitable for web usage mining tasks.
- Privacy Leakage in Databases via Semi-supervised Learning (Database
Security)
In multi-relational databases, a view, which is a context- and content-dependent subset of one or more tables (or other views), is often used to preserve privacy by hiding sensitive
information. However, recent developments in data mining present a new challenge for
database security even when traditional database security techniques, such as data
access control via a view, are employed. This project presents a data mining framework
using semi-supervised learning that demonstrates the potential for privacy leakage in
multi-relational databases. Many different types of semi-supervised learning techniques,
such as K-nearest neighbor (KNN) methods, can be used to demonstrate privacy
leakage. However, we also introduce a new approach to semi-supervised learning,
hyperclique pattern based semi-supervised learning (HPSL), which differs from
traditional semi-supervised learning approaches in that it considers the similarity
among groups of objects instead of only pairs of objects. Our experimental
results show that both the KNN and HPSL methods have the ability to compromise
database security, although HPSL is better at this privacy violation (has higher accuracy) than KNN methods.
- TAPER: All-Pairs Correlation Computing
(Databases/Data Mining/Statistical Computing)
With the wide spread use of statistical techniques for data analysis, it is expected that many such techniques will be made available in a database environment where users can apply the techniques more flexibly, efficiently, easily, and with minimal mathematical
assumptions. The motivation of this project is directed towards developing such
techniques. More specifically, given a user-specified minimum correlation threshold
q and
a market basket database with N items and T transactions, all-strong-pairs correlation query finds all item pairs with correlations above the threshold
q. However,
when the number of items and transactions are large, the computation cost of this query
can be very high. In this project, we identify an upper bound of Pearson's correlation
coefficient for binary variables. This upper bound is not only much cheaper to compute
than Pearson's correlation coefficient but also exhibits a special monotone property
which allows pruning of many item pairs even without computing their upper bounds. A
Two-step All-strong-Pairs corrElation queRy (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost
model which shows that the computation savings from pruning is independent or
improves when the number of items is increased in data sets with common Zipf or linear
rank-support distributions. Experimental results from synthetic and real data sets exhibit
similar trends and show that TAPER can be several orders of magnitude faster than
brute-force alternatives. A summary of the preliminary work has been accepted for publication in
the Tenth ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD
2004) and the extended version of this work has been submitted to IEEE Transactions on Knowledge and Data Engineering
(TKDE).
- Systematic Development and Exploration of Data Cleaning Techniques for Enhancing Data Analysis (Data
Mining/Bioinformatics)
Data cleaning is an essential tool for ensuring the reliable analysis of noisy data sets. Most existing data cleaning methods primarily focus on the detection and/or correction of low-level data errors that result from an imperfect data gathering process. However, the presence of data objects that are irrelevant or only weakly relevant can significantly hinder data analysis, and these objects should also be considered as noise. Thus, there is a need for a systematic development and evaluation of data cleaning techniques that also address data sets with irrelevant or weakly relevant objects. This is particularly important for data sets with large amounts of noise. To that end, this paper studies the performance of four techniques intended for data cleaning to enhance data analysis in the presence of high noise levels. Three of these methods are based on traditional outlier detection techniques: distance-based, clustering based, and an approach based on the Local Outlier Factor (LOF) of an object. The other technique, which is a new method that we are proposing, is a hyperclique-based data cleaner (HCleaner). We also present a framework for measuring data cleaning performance in terms of its impact on the subsequent data analysis. We use this framework to evaluate the four techniques we described in terms of their impact on clustering and association analysis for document and gene expression data sets. Our experimental results show that using HCleaner generally provides better clustering performance and significantly higher quality association rules as compared to the outlier based data cleaning alternatives. The other techniques sometimes performed as well or slightly better for clustering, but their performance was not as consistent. For instance, the clustering based technique had good performance only when the number of clusters specified matched the actual number of classes in the data. However, this limitation significantly restricts the usefulness of this method. This work has been submitted to
IEEE Transactions on Knowledge and Data Engineering (TKDE), Special Issue on Intelligent Data
Preparation.
