Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 10 Issue: 06 | Jun 2023

p-ISSN: 2395-0072

www.irjet.net

Performance Analysis and Parallelization of CosineSimilarity of Documents Bandi Harshavardhan Reddy1, Gopa laasya lalitha priya2, Kodakanti Prashanth3, L Mohana Sundari4 123

UG Student, School of Computer Science and Engineering, Vellore Institute of Technology, India. Assistant Prof Senior, School of Computer Science and Engineering, Vellore Institute of Technology, India. ---------------------------------------------------------------------------***--------------------------------------------------------------------------Abstract - Currently, the Internet contains an II. LITERATURE REVIEW 4

extensive col- lection of documents, and search engines utilize web crawlers to retrieve content for queries. The retrieved pages are then ranked based on their relevance to the query using page rank algorithms. Typically, the cosine similarity algorithm is employed to determine the similarity between the retrieved content and the query. However, a challenge arises when dealing with a large set of retrieved documents. Applying the conventional cosine similarity algorithm to rank pages becomes difficult in such cases. To address this issue, we propose an optimized algorithm that utilizes parallelization to calculate the cosine similarity of documents in large sets. By parallelizing the procedure, we can enhance efficiency and reduce latency by processing a greater number of documents in less time. Keywords-Web Crawlers, Cosine Parallelizing, Effi-ciency, Optimized

Similarity,

I. INTRODUCTION The primary goal of the project is to utilize parallel computing to determine document similarity, thereby simplifying the task and achieving similarity results with reduced computational power. By implementing cosine similarity algorithms in parallel computing, we aim to enhance the speed and efficiency of document search. Unlike other algorithms, cosine similarity can effectively address certain problems. When processing a query, numerous relevant documents are retrieved and sub- sequently require page ranking. However, the existing page ranking algorithm proves inefficient when dealing with a large number of retrieved documents. To overcome this challenge, we have devised a more powerful and efficient algorithm that leverages parallelization to process extensive document sets insignificantly less time compared to the page ranking algorithm.

Impact Factor value: 8.226

[1] The rapid global expansion of the Internet has resulted in a massive amount of data being stored on servers. The amount of data produced in the last two years alone surpasses the cumulative data generated in previous years, primarily attributed to the extensive adoption of Internet of Things (IoT) devices. This data has emerged as a valuable resource for con- ducting predictive analysis of forthcoming events. However, the increasing diversity of data types and the speed at which it is being generated has posed a challenge for data analysis technology. The objective of this study is to examine the interaction between big data files and a range of data mining algorithms, including Na¨ıve Bayes, Support Vector Machines, Linear Discriminant Analysis Algorithm, Artificial Neural Networks, C4.5, C5.0, and KNearest Neighbor. Specifically, Twitter comments are analyzed as the input data for these algorithms. [2]This research paper examines the use of Sharpened Cosine Similarity (SCS) as an alternative to convolutional layers in image classification. While previous studies have reported promising results, a comprehensive empirical analysis of neural network performance using SCS is lacking. The researchers investigate the parameter behavior and potential of SCS as an alternative to convolutions in different CNN architectures, employing CIFAR-10 as a benchmark dataset. The findings indicate that while SCS may not significantly im-prove accuracy, it has the potential to learn more interpretable representations. Moreover, in specific situations, SCS might provide a marginal improvement in adversarial robustness. [3] A parallelization approach to enhance the performance of the summarization process. By integrating the Nondominated Sorting Genetic Algorithm II (NSGA-II) with MapReduce, the parallelization approach achieves enhanced efficiency and quicker extraction of text summaries from multiple documents. To evaluate its performance, the proposed method is compared to a nonparallelized version of

ISO 9001:2008 Certified Journal

Page 1055