الفهرس | Only 14 pages are availabe for public view |
Abstract Semantic textual similarity is an important field in Natural language Processing (NLP).It’s useful in a variety of NLP applications, including information retrieval, plagiarism detection, date extraction, and machine translation. Sentence similarity in the Arabic language has not been investigated deeply because of the lake of the Arabic language resources. This thesis presents a new Arabic dataset for the sentence similarity task. This dataset can be sued to help develop sentence similarity approaches. In addition, the main purpose of the created dataset is to evaluate the sentence similarity approaches. The dataset has been collected from Wikipedia, an intermediate lexicon, and other WWW resources. This thesis gives more details about the processing of collecting data, filtering , preprocessing the pairs of sentences and some statistics about the dataset. The dataset is available for the future research in the field. Moreover, this thesis process an approach to calculate the semantic similarity between Arabic sentence. The process approach uses two types of word embedding : context independent word embedding (word2vec) and contextual embedding (BERT) to measure similarity between words. |