Author: Moussa, Sherin Mohamed Mahmoud/ Title: HYBRID TECHNIQUES FOR PERFORMANCE ENHANCEMENT OF DOCUMENT CLUSTERING,

Search In this Thesis

العنوان

HYBRID TECHNIQUES FOR PERFORMANCE ENHANCEMENT OF DOCUMENT CLUSTERING,

الناشر

AIN SHAMS UNIVERSITY. FACULTY OF COMPUTER
& INFORMATION SCIENCES. Department of Information Systems,

المؤلف

Moussa, Sherin Mohamed Mahmoud

تاريخ النشر

2003 .

عدد الصفحات

224 p.

الفهرس

Only 14 pages are availabe for public view

from

154

from

154

Abstract

Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high dimensional and sparse vectors; a few thousand dimensions is typical. Practical approaches to clustering such document vectors use an iterative procedure (e.g. k-means and Expectation-Maximization) that is known to be especially sensitive to initial starting conditions (number of clusters k and initial centroids).
In this thesis, we introduce “The Hybrid Clustering Algorithm” that determines these initial conditions automatically, depending on the required clustering accuracy for the obtained clusters. The hybrid algorithm combines the agglomerative hierarchical approach with the k-means approach to provide k disjoint clusters. However, the textual, unstructured nature of documents makes the task considerably more difficult than other data sets. The proposed algorithm adapts a sampling-based pruning strategy to simplify hierarchical clustering. It can be applied to any text documents data set whose entries can be embedded in a high dimensional space that uses the Minkowski distance function (e.g. Euclidean/Manhattan) in which every document is a vector of real numbers. We present the results of an experimental study of our proposed algorithm on a data set obtained from the National Science Foundation (NSF). The experiments study the effect of some parameters on the computed initial starting conditions and the clustering accuracy of the resultant clusters according to the data set taxonomy. For the initial conditions, it was found that using the Manhattan distance function provides almost the same results obtained using the Euclidean distance function. Therefore, we recommend using the Manhattan distance function since it has less complexity. As for the clustering accuracy, it is also recommended to use the Manhattan distance function for its simplicity since the range of accuracy obtained using the Euclidean distance function is the same as for the Manhattan distance function especially for the high pruning thresholds.