Author: Rashad, Shaymaa Mohammed./ Title: Mining Gene Expression Databases for Association Rules /

Search In this Thesis

العنوان

Mining Gene Expression Databases for Association Rules /

المؤلف

Rashad, Shaymaa Mohammed.

الموضوع

Data base.

عدد الصفحات

1 VOL. (various paging’s) :

الفهرس

Only 14 pages are availabe for public view

from

112

from

112

Abstract

Abstract
With the development of bioinformatics, many new research areas rise. One of the most important research problems is mining gene expression datasets, Mining such data helps in understanding the response to treatments as well as provides deep insight into the nature of many diseases which leads to new drugs. Association rules algorithms have proven to be the most suitable among other mining techniques to analyze gene expression data. This is due to the nature of data and the requirements of the biologists. However, it is nontrivial to perform efficient mining of association rule for these data. The high dimensionality nature of these datasets poses a great challenge on efficient processing and memory requirements to most existing frequent itemsets mining algorithms.
The aim of this study is to give biologists a software tool to explore reasons and the different impacts of the breast cancer. Breast Cancer Associations (BCA) system is a proposed one to mine the breast cancer data in order to find significant and important association rules. The data undergoes two processes in BCA. In the first process, input data is preprocessed by four steps. These four steps are data filtration, normalization, discritization, and data adaptation. The second process is the mining process stage which uses a proposed algorithm called Row Intersection Support Starting (RISS), RISS algorithm uses a proposed bottom-up search strategy for row enumeration- based mining algorithm which starts the row enumeration tree from the minimum support threshold using a vertical data format called RowSet. The last stage in BCA concerns the production of the association rules based on the user defined minimum confidence threshold, BCA is followed by a post processing system (PPS) which is implemented to filter important and interested biological rules. This is achieved through a set of operators; rule, confidence, mixed, and experimental operators.
A comparative experiment is conducted between RISS and charM to compare the RISS performance with a known algorithm. dCHARM deploys
diffset data structure to discover Closed Frequent Itemsets. It is one of the most famous algorithms that are applied on gene expression data. It is found that RISS outperforms dCHARM in terms of time and search space. In order to be confident in BCA, fifteen different experiments have been conducted with different parameters. These experiments are needed as there is no rule of thumb for choosing the parameters values.
The results of the experiments are recorded and compared. Results show that BCA is able to discover significant and important rules in efficient computational manner. The merits of BCA are reducing the amount of processed data, speeding up mining processes, focusing on certain data values, enhancing performance in terms of processing time, memory space, and search space, indicating the state of the gene, using of simple data format, and using of simple and rational pruning techniques. PPS merits are reducing the rules scalability, and aiding the biologists in identifying unique associations.
The biological results are previously common ones, rarely ones, and unique new ones. The unique ones require further biological investigation
.