Computing Textual Semantic Similarity for Short Texts using Different Similarity Measures

Page 1

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395 -0056

Volume: 04 Issue: 05 | May -2017

p-ISSN: 2395-0072

www.irjet.net

Computing textual semantic similarity for short texts using different similarity measures Riya A Gandhi1, Dr Vimalkumar B Vaghela2 1Student,

Dept. Of Computer Engineering, L D college of engineering, Gujarat, India Professor, Dept. Of Computer Engineering, L D college of Engineering, Gujarat, India ---------------------------------------------------------------------***--------------------------------------------------------------------2Assistant

Abstract - Semantic similarity of short text is the method of natural language processing which is widely used in natural language processing, opinion mining, text mining, text summarization, information retrieval and recognizing textual entailment(RTE), etc. Semantic similarity reflects the semantic relation between the meaning of two sentences. Sentence similarity is used to access the likelihood between phrases. This paper presents various methods which shows similarity between two sentence pairs, performance of the methods and importance of various method. Key Words: Semantic Similarity, RTE, STS, WSD, etc.

1.INTRODUCTION The importance of sentence semantic similarity measures The sentence semantic similarity measures are important in natural language research because of increasing applications in text-related research fields. Semantic similarity methods are classified into three types –corpus based, ontology based and hybrid approach[1]. The first method calculates the similarity from syntactic information and semantic information that they contain. In this method there are three similarity functions to derive generalized text semantic similarity. In first function string similarity is calculated then in second function semantic similarity is calculated. After that there is semantic word order function is there to incorporate semantic information in this method. At last string similarity, semantic similarity and common word order similarity are combined and normalized to calculate overall text similarity and this method is called STS(Semantic Text Similarity). The ontology based method is omiotis. It is an ontology based algorithm and based on WordNet and WSD (Word-sense disambiguation). Omiotis uses various POS(part-of-speech) and semantic relations like synonymy, antonymy, hyponymy, etc. It extends Semantic Relatedness(SR) measure between the words. It is based on the semantic links between the words according to a word thesaurus which is WordNet. In Omiotis SR in word level and statistical information in the text level is integrated and gives final SR score between texts. SyMSS(syntax-based measure for short-text semantic similarity) uses WordNet measures and parse tree. SyMSS uses grammar parser to obtain the parse tree. It is the new method which considers the syntactic information and it uses this information in WSD for reducing word matching and time complexity. STATIS is the hybrid measure which combines Word Net based and corpus Š 2017, IRJET

|

Impact Factor value: 5.181

|

based word similarities. It evaluates two sentences in the form of two vectors and obtain semantic similarity between the sentences by using vector space model(VSM). In STASIS VSM semantic similarity methods and word order similarity methods are combined to compute the sentence similarity. STATIS does not perform preprocessing tasks like stop words and meaningless words removal which will result in inaccurate similarity score. Omiotis and SyMSS reduce the ambiguity between words using the syntactic information, POS and parse tree, respectively, to match words with the same syntactic role.

2. METHODS OF SEMANTIC SIMILARITY 2.1 Corpus-Based Word Similarity and String Similarity(STS) This method measures semantic similarity of text using corpus-based measure of semantic word similarity. This method mainly focus on measurement of semantic similarity between two sentences or short paragraphs and it uses semantic and syntactic information to evaluate the similarity of two texts. 1. String Similarity between Words For calculating string similarity, Longest common subsequence (LCS) measure is used and some modification and normalization is done for evaluation of string similarity. Three different modified versions of LCS is used and then it will take a weighted sum of these and normalized LCS. It divides the length of the longest common subsequence by the length of the longer string which is called LCSR (longest common subsequence ratio). But it does not take into consideration the length of the shorter string which sometimes have noteworthy impact on the similarity score. Therefore computing the normalized longest common subsequence (NLCS) which takes into account the length of both the shorter and the longer string which is, (1) While in classical LCS, the common subsequence needs not be consecutive, in database schema matching, consecutive common subsequence is important for a high degree of matching. Therefore maximal consecutive longest common subsequence which is starting at character1, MCLCS1 (Algorithm [2]) and maximal consecutive longest common subsequence starting at any character n, MCLCSn (Algorithm ISO 9001:2008 Certified Journal

|

Page 1207


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.