Author: Hamed, Belal Ahmed Mohammed./ Title: Improving Pattern Matching Algorithms for Biological Sequences /

Search In this Thesis

العنوان

Improving Pattern Matching Algorithms for Biological Sequences /

المؤلف

Hamed, Belal Ahmed Mohammed.

هيئة الاعداد

باحث / بلال أحمد محمد حامد

مشرف / طارق عبدالحفيظ عبد الرحمن

مشرف / عثمان عـلي صادق

الموضوع

Algorithms. Application software.

تاريخ النشر

2023.

عدد الصفحات

124 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Computer Science (miscellaneous)

تاريخ الإجازة

17/5/2023

مكان الإجازة

جامعة المنيا - كلية العلوم - علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

139

from

139

Abstract

Pattern-matching is an extremely helpful process that may be used at numerous stages of the computational pipeline. Furthermore, various research trends in this field have resulted in the expansion and upkeep of biological databases over time. This study compares and analyzes several algorithms for comparable pattern-matching in terms of complexity, efficiency, and approaches. What is the best algorithm for a particular DNA sequence, and why? This section highlights the numerous algorithms for various tasks that involve pattern-matching as a key component of functionality.
This study demonstrates that Boyer Moore, Horspool, Zhu Takaoka, Quick Search, Fast Search, Smith, and Sheik-Sumit-Anindya-Balakrishnan and Sekar techniques use the poor character pre-processing function. Furthermore, the Boyer Moore, Sheik-Sumit-Anindya-Balakrishnan and Sekar, Thathoo-Virmani-Sai-Balakrishnan-and Sekar, and Berry–Ravindran Fast Search techniques use two ways in the pre-processing step, which reduces pre-processing time. Additionally, Karp-Rabin, Quick Search, Sheik-Sumit-Anindya-Balakrishnan and Sekar, Berry–Ravindran Fast Search, and Shift-Or are not recommended for long patterns, and Zhu Takaoka, Fast Search, derivative Boyer-Moore, Raita, and Smith are not recommended for short patterns. This is because they are time-consuming. Certain algorithms, such as Zhu Takaoka and Divide and Conquer Pattern Matching, consume a significant amount of time and space during the matching and search processes, while others, such as derivative Boyer-Moore and Two Sliding Window, conserve space and time. Although Divide and Conquer Pattern Matching, Berry–Ravindran Fast Search, and Quick Search are faster than other algorithms, First Last Pattern Matching, Processor-Aware Pattern Matching, and Least Frequency Pattern Matching are the most complicated in terms of time.
Pattern matching is essential at numerous phases of the computational process, enabling users to search for specific DNA subsequences or DNA sequences in a database. Additionally, rapidly expanding biological databases are regularly updated. High-speed pattern matching algorithms can improve pattern searches.
Researchers are striving to improve solutions in numerous areas of computational bioinformatics as biological data grows exponentially. Faster algorithms with a low error rate are needed in real-world applications. As a result, this study offers two pattern-matching algorithms that were created to help speed up DNA sequence pattern searches.
The proposed algorithms (Enhance-First-Last Pattern-matching (EFLPM) method and Enhance- Processor-Aware Pattern-matching Algorithm (EPAPM)) improve performance by utilizing word-level processing instead of character-level processing used in previous research studies. In terms of time cost, these algorithms increase performance by leveraging word-level processing with a large pattern size. Experimental results show that the proposed methods are faster than other algorithms for short and long patterns. The EFLPM algorithm is 54% faster than the First-Last Pattern-matching method, while the EPAPM algorithm is 39% faster than the Processor-Aware Pattern-matching method.
Finally, we introduce a simple model for performing effective approaches for pre-processing and feature extraction of DNA data. The classification will be used to categorize DNA sequences using machine learning (ML) methods. A pattern-matching algorithm is efficiently used to get the matched sequences. The outcomes showed that the SVM linear classifier produces accurate results. We compare the outcome using the two suggested algorithms (EFLPM and EPAPM) using the same dataset based on the results.