Author: Abd alhalem, Samia Mohammed Mohammed./ Title: Classification of DNA Sequences based on<br>Deep Learning

Search In this Thesis

العنوان

Classification of DNA Sequences based on
Deep Learning

المؤلف

Abd alhalem, Samia Mohammed Mohammed.

هيئة الاعداد

باحث / سامية محمد محمد عبد الحليم

مشرف / محمد عبد العظيم محمد

مشرف / حسام الدين حسين أحمد

مشرف / السيد محمود الربيعي

مشرف / نبيل عبد الواحد إسماعيل

الموضوع

Telecommunication. Artificial intelligence. DNA Computing. Machine learning.

تاريخ النشر

2021.

عدد الصفحات

96 p. :

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

الهندسة الكهربائية والالكترونية

تاريخ الإجازة

14/9/2021

مكان الإجازة

جامعة المنوفية - كلية الهندسة الإلكترونية - الهندسه الالكترونيات والاتصالات الكهربية

الفهرس

Only 14 pages are availabe for public view

from

Abstract

Much of the biological information can be provided by molecular biology.
Such information is related to the analysis of biochemical molecules
(Deoxyribonucleic acid DNA, Ribonucleic acid RNA, and protein). This information
helps molecular biologists to solve several problems. The task of extracting this kind
of information is not an easy process. However, even if we can get this information,
biologists cannot retrieve any useful knowledge due to the quantity and the
complexity of the data. For this reason, the merging between biology, informatics, and
mathematics for managing, analyzing, and manipulating the biological data helps to
achieve different goals of the new discipline called bioinformatics. One of the main
tasks of bioinformatics is the computational analysis of sequence databases. The
biological sequence classification is extremely useful for many problems in
bioinformatics, based on the principle that sequences having similar structures have
also similar functions. There are several well-studied, and more specific problems.
Examples of these problems are the taxonomic problem, where the phylogenetic
group of an organism can be known using some of its biological sequences and this is
the main concern of this thesis. In this thesis, taxonomy methods have been
investigated; alignment-based methods, alignment-free methods, and machine
learning, especially deep learning (DL) methods.
The DL methods have achieved excellent results in solving a variety of
problems in different fields, especially in the area of big data processing. Recently,
DL methods have been used in bioinformatics. We are mainly concerned with DL
methods, which are used for bacterial DNA sequence classification into taxonomic
ranks. This thesis aims to build a classification model based on the bacteria 16S
rRNA(ribosomal RNA) sequence and classify bacteria by phylum, class, order,
family, and genus. In terms of accuracy and F1 score, the performance of different DL
models is evaluated. Three methods for bacterial classification based on DL are
proposed and implemented to improve the accuracy of classification. The first one is
based on different downsampling convolutional neural networks (CNNs). The
proposed downsampling is an improved form of the existing pooling layer to obtain better classification accuracy results. The two-dimensional discrete transforms (2D
DTs) and the two-dimensional random projection (2D RP) are applied for
downsampling. The second method depends on bidirectional long short-term memory
(BLSTM). The choice of suitable pre-processing stages with the BLSTM classifier
has a significant impact on the accuracy. It is showed that frequency chaos game
representation (FCGR) (image representation) can generate more details of DNA
sequences that are not apparant in the text representation as in hot coding, and finally
a hybrid model that combines CNN and BLSTM is implemented. Two kinds of
experiments are conducted. In the first one, the classification is based on the whole
sequence. In the second one, only a small sequence fragment is used. This means that
we evaluated the performance of the methods based on two datasets of different sizes.
These methods can give high accuracy. The effect of these three proposed methods on
the efficiency of bacterial classification is studied. The results show that the encoded
DNA sequences by FCGR at k=6 are the best choice for the pre-processing step. The
combination of the DWT with RP performs better than the other pooling layers.
Moteover, the CNN is more suitable than the BLSTM as it achieves the best trade-off
between the required accuracy and the required processing time for the classifier.