Author: / Title: Data Quality Enhancement Based on Dependability Rules /

Search In this Thesis

العنوان

Data Quality Enhancement Based on Dependability Rules /

هيئة الاعداد

باحث / أسماء سعد عبد الجواد عبده

مشرف / حاتم محمد سيد أحمد

مناقش / أبوالعلا عطيفى حسنين

مشرف / أشرف بهجات السيسى

الموضوع

Mathematical statistics - Data processing - Quality control. Data mining.

تاريخ النشر

2016.

عدد الصفحات

92 P. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Information Systems

الناشر

تاريخ الإجازة

16/6/2016

مكان الإجازة

جامعة المنوفية - كلية الحاسبات والمعلومات - قسم نظم المعلومات

الفهرس

Only 14 pages are availabe for public view

from

Abstract

Today, data plays a major role in people’s daily activities. The value of data depends on its degree of quality. Business and scientific organizations need to assure quality on their data to gain competitive advantage, and increase the efficiency of their data. In the area of data quality research, enhancing data quality is a still big challenge especially in large databases. Since databases are often suffered from data inconsistency issues, these issues should be resolved in the data cleaning process. Data cleaning is still a long-standing research challenge issue in the current database community. Unfortunately, little basic research on the information systems and computer science communities has been conducted which is directly related to error detection and data cleansing. The scope is focused on data preprocessing, i.e., data cleaning process, of inconsistent databases issues.Data mining techniques can be reutilized efficiently in the process of data cleaning, which focus on analyzing, and processing extremely massive amount of data into meaningful knowledge through clean and correct erroneous data record, which result in improving data quality.This thesis is focusing on the enhancement of data quality in databases. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data. We make a comparative study between existing traditional data cleaning techniques to know the advantages and the drawbacks of these techniques. Then, we propose three techniques to overcome problems in currently techniques.
The main contributions of this thesis involve proposing techniques for enhancing data cleaning process. More specifically:The first contribution of the thesis is to propose ICCFD_Miner technique for generating dependable data cleaning rules based mainly on frequent closed patterns. The second contribution of thesis is to propose MICCFD_Miner technique for enhancing the performance of first proposed ICCFD_Miner technique in generating data cleaning rules. The enhancement is realized due to extracting maximal frequent patterns and their associated generators instead of using closed frequent patterns as effective search strategy mechanism to reduce the size of the search space domain. The third contribution of the thesis is to propose T_Repair algorithm for given generated rules and input tuple (t) of relation for validating on data and repairing errors found from given dataset.The proposed techniques yield a promising method, where we aim to move the database to optimal, consistent state and to overcome the drawbacks of the traditional data cleaning techniques.The healthcare applications domain is one of several information systems applications which suffer from inconsistent and dirty data issues. Since healthcare applications are critical, and managing healthcare environments efficiently results in patient care improvement. Ensuring high-quality data of patient records is very important in healthcare data management, whereas critical decisions are based on patient status informed from medical records. The proposed techniques is validated against several datasets from healthcare environment. It provides clinicians with an automated framework for enhancing the quality of electronic medical records. A variety of experiments is conducted to test the performance of proposed techniques including the accuracy, response time, storage space, and the database scalability. Experimental results demonstrate that our proposed techniques achieve significant enhancement over existing approaches in term of efficiency, accuracy and scalability using both real-life and synthetic datasets.