Search In this Thesis
   Search In this Thesis  
العنوان
Performance Enhancement of the Analysis, Prediction of Protein Coding Regions and Recognition of DNA Sequence /
المؤلف
Dessouky, Ahmed Moawad Ibrahim.
هيئة الاعداد
باحث / أحمد معوض إبراهيم دسوقي
مشرف / هشام فتحي علي حامد
مشرف / فتحي السيد عبد السميع
مشرف / جرجس منصور سلامة
الموضوع
Electrical engineering.
تاريخ النشر
2022.
عدد الصفحات
146 p. :
اللغة
الإنجليزية
الدرجة
الدكتوراه
التخصص
الهندسة الكهربائية والالكترونية
تاريخ الإجازة
1/1/2022
مكان الإجازة
جامعة المنيا - كلية الهندسه - الهندسة الكهربية
الفهرس
Only 14 pages are availabe for public view

from 174

from 174

Abstract

This thesis studies the statistical characteristics of original and mathematically-modeled deoxyribonucleic acid (DNA) sequences without and with viruses. It has two distinguishable parts: the first presents an evaluation and comprehensive analysis of mathematical modeling of DNA sequences. For this purpose, proposed mathematical models are examined and presented to create closed formulas for the original DNA data sequences using different methods. The second part of the thesis presents the use of the proposed hybrid approach based on a digital band-pass filter in conjunction with non-parametric and parametric spectral estimation methods, the Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT) for exon prediction.
This thesis examines the statistical characteristics of actual and mathematically-modeled DNA sequences without and with viruses. Accuracy is studied through two main stages. The first stage depends on the Root Mean Squared Error (RMSE) and correlation coefficient (R) metric parameters for examining the behavior of the actual and all mathematical model forms to select the best representative DNA mathematical models. The second stage is performed in deep with more statistical parameters such as energy, entropy, standard deviation, variance, mean, range, Mean Absolute Deviation (MAD), Skewness and Kurtosis for the selected best or optimum models. Different modeling methods are considered, including polynomial, exponential, Gaussian and Sum of Sine (SoS) methods. Accuracy is guaranteed through maximizing the correlation coefficient and minimizing the RMSE after the modeling process. More statistical parameters used as metrics are included with the different modeling methods to create such models as closed mathematical formulas.
In addition, the proposed hybrid approach based on digital band-pass filtering with non-parametric and parametric spectral estimation techniques to analyze DNA sequences is presented and discussed. The proposed hybrid method, in conjunction with the discrete wavelet transform (Haar and Daubechies wavelet transforms) and discrete cosine transform, depends on different spectral estimation techniques to improve the prediction of the protein-coding region (exon prediction) from DNA sequences. It enhances the detectability of the peak in the exon region that is used to find genes, which increases the accuracy of exon region prediction. The non-parametric power spectrum estimation techniques used in this thesis are periodogram, Bartlett, Welch, and Blackman and Tukey. The parametric techniques used in this thesis comprise Burg, Covariance, Modified Covariance, Yule-Walker, Multiple Signal Classification (MUSIC) and Autoregressive Moving Average (ARMA). These spectral estimation techniques improve the analysis of DNA sequences and enable the extraction of some desirable information about them. In addition, they can be used to investigate and predict the location of the coding regions (exons) in DNA sequences using the proposed approach.
The results of gene prediction in exon regions from DNA sequences based on original and modeled DNA sequences without viruses guarantee the success of the proposed SoS model for optimum modeling of DNA sequences, while the Gaussian model is not appropriate for this task.
Furthermore, this thesis presents different closed-form mathematical expressions using data modeling approaches for COVID-19, Middle East Respiratory Syndrome (MERS), Severe Acute Respiratory Syndrome (SARS) and Plasmodium-Malariae viruses. The best selected mathematical models to represent the DNA sequences of these viruses are the Gaussian model with 7 terms and the SoS model with 8 terms. There is an optimum coincidence and matching between the original DNA sequence and these mathematical models for COVID-19, MERS, SARS and Plasmodium-Malariae viruses. In addition, it is possible with the proposed SoS model as the best DNA modeling approach to build up dictionaries for DNA of the different viruses considered in this thesis. We use these dictionaries for further exploration and classification of the characteristics of these viruses.