Search In this Thesis
   Search In this Thesis  
العنوان
Knowledge Mining Approach for Web Documents /
المؤلف
Negm, Noha Eed Mohamed Abd El-Hammed.
هيئة الاعداد
باحث / نهى عيد محمد عبد الحميد نجم
مشرف / محمد عبد البديع محمد سالم
مناقش / Roumen Kountchev
مناقش / محمد امين عبد الواحد
الموضوع
XML (Document markup language) Web sites- Design.
تاريخ النشر
2011 .
عدد الصفحات
212 P. :
اللغة
الإنجليزية
الدرجة
الدكتوراه
التخصص
الرياضيات الحاسوبية
الناشر
تاريخ الإجازة
2/4/2014
مكان الإجازة
جامعة المنوفية - كلية العلوم - قسم الحاسب الالى
الفهرس
Only 14 pages are availabe for public view

from 246

from 246

Abstract

A knowledge mining approach for clustering web documents with high performance using Association Rules is presented. The approach was passed by three development systems: the first system (CWDHFT) designed for clustering web documents based on frequent termsets that provides significant dimensionality reduction. Throughout CWDHFT system, the mining process and execution time are enhanced. while the second system (ARWDC) is extended to CWDHFT system for clustering web documents based on Association Rules. Association Rules has an important advantage of having a higher descriptive power in document clustering. Weighted Score measure is introduced to improve the accuracy and efficiency of clustering process by removing the overlapping between partitions. Finally, SKMC system combined domain knowledge with the features namely terms frequency, inverse document frequency and association rules in more effective way for clustering medical documents. It allows overlapping clusters to avoid confining each document to only one cluster. Experimental results on different three datasets (Reuters dataset, Reuters News, and MEDLINE abstracts) reveal improving in the mining process through the scanning and computational cost at large size of documents using MTHFT algorithm. In MTHFT algorithm, the time is consumed in building a hash table only one time. After saving the hash table, there is no time consuming in generating new frequent termsets at different minimum support threshold. The CWDHFT and ARWDC systems outperform all other Bisecting K-means and FIHC algorithms in terms of accuracy. They have increased values of F-measure because they use a better model for text documents. The label of clusters in ARWDC is more informative than in CWDHFT since association rules are useful in constructing accurate descriptions of the clusters than frequent termsets. from various evaluations carried out, the performance of SKMC found to be high comparatively to other systems in terms of F- measures for clustering documents in biomedical domain. In addition, allowing overlapping between partitions increased the ability of the clustering algorithm to place similar documents in the same partition and increased the performance of SKMC by 10%.