Search In this Thesis
   Search In this Thesis  
العنوان
Mining the Publication Papers via Text Mining /
المؤلف
El-dosouky, Ahmed Saeed Ibrahim.
هيئة الاعداد
باحث / احمد سعيد ابراهيم الدسوقي حسين
مشرف / مصطفى محمود عارف
مناقش / زكي طه احمد فايد
مناقش / عبير محمد القرني محمد
تاريخ النشر
2022.
عدد الصفحات
81 P. :
اللغة
الإنجليزية
الدرجة
ماجستير
التخصص
Computer Science (miscellaneous)
تاريخ الإجازة
1/1/2022
مكان الإجازة
جامعة عين شمس - كلية الحاسبات والمعلومات - قسم علوم الحاسب
الفهرس
Only 14 pages are availabe for public view

from 81

from 81

Abstract

Nowadays, the excessive usage of electronic devices has resulted in the generation of a massive amount of data in different forms (such as text and images). To extract useful knowledge from such content, different automated mining processes are used. Text mining is a valuable process that is used for analyzing documents to capture key concepts and discover hidden relationships. Mining research papers is one of the most popular applications for text mining.
Text Mining (known as Intelligent Text Analysis, Text Data Mining, or Knowledge-Discovery in Text (KDT)) is the process of extracting or gathering structured information from unstructured text via text mining tools. The extraction process occurs via text mining tools. For example, text would be found in any document, for instance, newspapers, books, publication papers, magazines, etc., so text would be the dominant feature in most of the documents.
Text is considered as data, and there are two types of data: unstructured data and structured data. Structured data is any data that is stored in row and column representation where that data can be easily processed and ordered, for example, phone numbers and addresses. Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Most documents would be mixed with text, images, and graphs, so an early essential step must be done to separate text and images. After this, the text would be preprocessed to be ready for mining.
In the preprocessing stage, all words are tokenized, and all redundant words are removed. The document would be represented as a list of words. In this thesis, a novel methodology based on deep neural networks is proposed to analyze research paper content to retrieve papers that are similar or related in topic. Thus, saving time and efforts of researchers.
v | P a g e
A methodology is to be presented in this thesis consisting of four approaches. The first approach is to extract the effective keywords from a document (research) that describe the theme of the publication paper using the term frequency.
The second approach is to retrieve similar papers using effective keywords from the main paper that we are looking for. These keywords are formed together and used as a query to the search engine, to find the papers related to the main paper. Finally measure the similarity between the main paper and the retrieved papers from the search engine. The presented methodology, which involves retrieving similar and related papers, has a precision of 0.91 and a recall of 0.94.
The third approach is Named-Entity Recognition (NER) (also known as entity identification, entity chunking, and entity extraction) is the method of identifying all main concepts and entities in documents and then attaching attributes to a specific entity or predefined category, such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. This step would help in organizing the data and finding relationships among words.
The fourth approach is document clustering, which is the process of dividing a set of document collections into a different number of groups based on document contents-similarity. This process uses Machine Learning algorithm and Natural Language Processing techniques. The goal of this process is to assign a specific document to a specific cluster.
vi | P a g e
There are three types of learning: supervised learning, unsupervised learning, and semi-supervised learning. All these types can be used to solve any classification problem; however, they differ in whether they use labelled data or not for training. The clustering is unsupervised learning that doesn’t require labelled data.