International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 09 Issue: 11 | Nov 2022
p-ISSN: 2395-0072
www.irjet.net
Search Engine Scrapper Somnath Dudhat1, Dnyaneshwar Nawale2, Atharva Dhotre3, Vinayak Rahate4, Priyanka Halle5, Prof. Priyanka halle6 1,2,3,4,5 BE
Students, Department of Computer Science & Engineering, SKN Sinhgad Institute of Technology And Science, Lonavala, Pune, Maharashtra, India 6Assistent Professor, Department of Computer Science & Engineering, SKN Sinhgad Institute of Technology And Science, Lonavala, Pune, Maharashtra, India --------------------------------------------------------------------------***---------------------------------------------------------------------Abstract – 3. Conversion to relevant result: search engine scrapper is a set of processes which allows the user to collect relevant information presented on the World Wide Web (WWW) similar technology is used by search engines (Browsers like Chrome, Firefox Mozilla). This article covers the processes involves to extraction of data from different website content so user can get relevant and necessary information of their query.
In this service using ML model summarised data is getting converted to relevant result.
Key Words: Web Scrapping, Web Crawling, Search Engine, Natural Language Processing
1. INTRODUCTION Nowadays people are facing so much problem to search relevant information of their query, so to make them easy we are going to create search engine scrapper which applies web scrapping, natural language processing (NLP) and pointwise mutual information (PMI) and the use of SERP extraction API which help to extract and analyse the website content. Basically, our project model extracts the data from website content and then it summarized the data which is useful for the user and at the end summarized data is made grammatically correct with the help of machine learning module so that user can get the relevant and essential information of their query in meaningful way.
Search engine scrapping done in three stages:
1.extraction: This includes extraction of data from the different website content. 2.summarization: This deals with the summarization of extracted data, which is relevant to the user. 3.conversion to the relevant result: This includes the conversion of summarized data into meaningful manner so that user can understood the context.
2.METHODOLOGY
3. LITERATURE SURVEY
1. Extraction of Data: -
Information is the most important asset in the world, but for retrieving it we need data. Data being the second important asset is not accessible to all the people around. Everyone can’t get access to data which they require, for this purpose web scraping come up to the surface. Web Scraping has entirely shifted the way we used to see this world with less amount of data. Analysis and Retrieval have become so easy as of now.
It gathers the publicly available web data from different search engines through SERP Data Extractor APIs. 2. Summarisation of Data: SERP Extracted data get summarize through NLTK (Natural language Toolkit) processing libraries Input document → sentences similarity → weight sentences → select sentences with higher rank.
© 2022, IRJET
|
Impact Factor value: 7.529
|
ISO 9001:2008 Certified Journal
|
Page 544