EMAN ABDU
Room 4421
Cluster analysis is an active area of research with applications in various fields including information retrieval, social sciences, bioinformatics, object recognition, and image segmentation (Jain et al. 1999). However, most algorithms are intended for numerical (continuous) data where proximity among data objects is naturally defined by virtue of their numerical properties. Although these algorithms can be used on categorical data, they are not designed to handle data properties typically found in this data type such as high dimensionality and lack of inherent relationships among attribute values. During the past decade, several algorithms have been designed for categorical data such as K-modes (Huang 1998), STIRR (Gibson et al, 1998), CACTUS (Ganti and Ramakrishnan, 1999), ROCK (Guha et al, 2000), COOLCAT (Barbara et al, 2002), LIMBO (Andritsos et al, 2004), CLICKS (Zaki et al, 2005, 2007), and others. These algorithms exploit inter or intra relationships among attributes through data summaries such as frequency, data values co-occurrences, information entropy and links among data items. In this proposal, we focus on using data summaries and spectral analysis to detect clustering structure in categorical data. Spectral techniques provide a relaxed solution to the discrete clustering problem which has been shown to be NP-hard (Drineas 2004). Formulating the clustering problem as a graph partitioning problem and then finding the minimum normalized cut leads to a solution based on eigenvectors of the similarity matrix (i.e. Laplacian matrix). Spectral methods have been used in various algorithms and have been shown to find non-linearly separable clusters. Equally important, spectral analysis encompasses techniques for handling high-dimensional data since input data is projected into a lower-dimensional space where all computation/comparisons can be performed. Our approach is to extend spectral techniques to data summaries which are relatively less expensive to compute than object similarity matrix for very large datasets. Our goal is to combine the benefits of spectral analysis with the relative low cost of computing data summaries. We have developed three algorithms that use spectral techniques and data summaries to cluster categorical data. Our test results on standard datasets show that our algorithms are competitive with current spectral and non-spectral algorithms for categorical data.
PROFESSOR BILAL KHAN, MENTOR, JOHN JAY OF CRIMINAL JUSTICE
PROFESSOR PING JI, JOHN JAY OF CRIMINAL JUSTICE
PROFESSOR SPIRIDON BAKIRAS, JOHN JAY OF CRIMINAL JUSTICE