Author: Aboul-Fetouh, Sara Mohamed Mosaad Mostafa./ Title: Improving The Efficiency of Implementing K-means Algorithm on Different Big Data Platforms /

Search In this Thesis

العنوان

Improving The Efficiency of Implementing K-means Algorithm on Different Big Data Platforms /

المؤلف

Aboul-Fetouh, Sara Mohamed Mosaad Mostafa.

هيئة الاعداد

باحث / Sara Mohamed Mosaad Mostafa Aboul-Fetouh

مشرف / Yehia Mostafa Kamal Helmy

مشرف / Manal Abdel-Kader Abdel-Fattah

الموضوع

Information systems. Computers.

تاريخ النشر

2020.

عدد الصفحات

1vol.(various pageing):

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

Information Systems

تاريخ الإجازة

5/2/2020

مكان الإجازة

جامعة حلوان - كلية الحاسبات والمعلومات - Information Systems

الفهرس

Only 14 pages are availabe for public view

from

Abstract

Research Objectives : Enhancing the clusters stability through handling the following issues:
•Deciding the number of clusters that lead to high stability.
•Initializing the clusters centroids in a rational way that is near to the right ones.
•Reducing the runtime for clustering by saving much iteration on a large dataset like big-data.
Research Methodology:
The research follows both qualitative and quantitative methodologies, where determining the optimal number of clusters beforehand enhance the quality of the formed clusters and selecting the initial clusters’ centroids in a rational way has reduced both; the cluster building time and the number of iterations taken to reach convergence.Before going through the implementation of k-means, the methods of operation in custom code, Spark and Flink was highlighted in order to know how the modification being proposed has overcome the issues found and enhance the performance of implementing the k-means on different operation environments. Alsotwo datasets have been used to test the performance and the quality of both; the standard and the proposed k-means algorithms.A comparison was held between the performance of both; the standard and the proposed k-means algorithms in terms of; the number of iterations taken to reach convergence, the execution runtime and Davies-Bouldin cluster validity index.
1.6 Research Scope.
The research scope :
covers the enhancements in K-means Algorithm by solving the stability issue, concerning the determination of the number of clusters and the selection of the initial centroid along with minimizing the execution runtime by minimizing the number of iterations. It attempts to implement the proposed K-means algorithm on two big data platforms;Apache Spark which is s an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. and Apache Flink which is an open-source stream-processing framework core of Apache Flink is a distributed streaming data-flow that executes arbitrary dataflow programs in a data-parallel and pipelined manner.
Results :
Our proposed algorithm has been compared with the clustering algorithm present in Spark MLLib library and Flink MLib. Experiments were implemented on Fisher Iris dataset and Mall Customer Analysis dataset to address the two orthogonal domains; number of clusters, and initialization scheme.First the proposed algorithm waschecked for the determining the optimal number of clusters. For thetwo datasets thatwere used, one of them is not big enough to be considered as big data but it popular, well- known dataset for its results in order to ensure that the proposed algorithm outputs the right results, and the other dataset was considered big data. The data source used in both cases was HDFS. The runtime of the k-means was affected many dimensions. So in order to understand how it affect its performance,the execution runtime taken to run the proposedstable k-means algorithm and the standardk-means given in both MLLib and MLibwas measuredusing getrusage() which is measured in secondsalong with the number of iteration taken by both the proposed and the standardk-means algorithm to reach convergence by varying the number of clustersthat assess the quality of initializing the clusters’ centroids. However, Davies Bouldin Cluster validity index is used to assess the quality of the number of clusters generated throughdetermining thenumber of clusters.