الفهرس | Only 14 pages are availabe for public view |
Abstract Much of the biological information can be provided by molecular biology. Such information is related to the analysis of biochemical molecules (Deoxyribonucleic acid DNA, Ribonucleic acid RNA, and protein). This information helps molecular biologists to solve several problems. The task of extracting this kind of information is not an easy process. However, even if we can get this information, biologists cannot retrieve any useful knowledge due to the quantity and the complexity of the data. For this reason, the merging between biology, informatics, and mathematics for managing, analyzing, and manipulating the biological data helps to achieve different goals of the new discipline called bioinformatics. One of the main tasks of bioinformatics is the computational analysis of sequence databases. The biological sequence classification is extremely useful for many problems in bioinformatics, based on the principle that sequences having similar structures have also similar functions. There are several well-studied, and more specific problems. Examples of these problems are the taxonomic problem, where the phylogenetic group of an organism can be known using some of its biological sequences and this is the main concern of this thesis. In this thesis, taxonomy methods have been investigated; alignment-based methods, alignment-free methods, and machine learning, especially deep learning (DL) methods. The DL methods have achieved excellent results in solving a variety of problems in different fields, especially in the area of big data processing. Recently, DL methods have been used in bioinformatics. We are mainly concerned with DL methods, which are used for bacterial DNA sequence classification into taxonomic ranks. This thesis aims to build a classification model based on the bacteria 16S rRNA(ribosomal RNA) sequence and classify bacteria by phylum, class, order, family, and genus. In terms of accuracy and F1 score, the performance of different DL models is evaluated. Three methods for bacterial classification based on DL are proposed and implemented to improve the accuracy of classification. The first one is based on different downsampling convolutional neural networks (CNNs). The proposed downsampling is an improved form of the existing pooling layer to obtain better classification accuracy results. The two-dimensional discrete transforms (2D DTs) and the two-dimensional random projection (2D RP) are applied for downsampling. The second method depends on bidirectional long short-term memory (BLSTM). The choice of suitable pre-processing stages with the BLSTM classifier has a significant impact on the accuracy. It is showed that frequency chaos game representation (FCGR) (image representation) can generate more details of DNA sequences that are not apparant in the text representation as in hot coding, and finally a hybrid model that combines CNN and BLSTM is implemented. Two kinds of experiments are conducted. In the first one, the classification is based on the whole sequence. In the second one, only a small sequence fragment is used. This means that we evaluated the performance of the methods based on two datasets of different sizes. These methods can give high accuracy. The effect of these three proposed methods on the efficiency of bacterial classification is studied. The results show that the encoded DNA sequences by FCGR at k=6 are the best choice for the pre-processing step. The combination of the DWT with RP performs better than the other pooling layers. Moteover, the CNN is more suitable than the BLSTM as it achieves the best trade-off between the required accuracy and the required processing time for the classifier. |