International Research Journal of Engineering and Technology (IRJET) Volume: 04 Issue: 02 | Feb -2017
e-ISSN: 2395 -0056
www.irjet.net
p-ISSN: 2395-0072
WEB CRAWLER FOR MINING WEB DATA S.AMUDHA, B.SC., M.SC., M.PHIL., Assistant Professor, VLB Janakiammal College of Arts and Science, Tamilnadu, India amudhajaya@gmail.com -----------------------------------------------------------------********---------------------------------------------------------------------
ABSTRACT Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Web crawlers are full text search engines which assist users in navigating the web. Web crawling is an important method for collecting data on, and keeping up with, the rapidly expanding Internet. Users can find their resources by using different hypertext links. A vast number of web pages are continually being added every day, and information is constantly changing. Search engines are used to extract valuable Information from the internet. Web crawlers are the principal part of search engine, is a computer program or software that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. This Paper is an overview of various types of Web Crawlers and the policies like selection, revisit, politeness, and parallelization. Key Words: Web Crawler, World Wide Web, Search Engine, Hyperlink, Uniform Resource Locator.
1. INTRODUCTION A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. If the crawler is performing archiving of websites it copies and saves the information. The archives are regularly stored in such a way they can be viewed, read and navigated as they were on the live web, but are preserved as ‘snapshots'. The large volume implies the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize to download. The high rate of change can imply the pages might have already been updated or even deleted. The number of possible URLs crawled being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content. The World Wide Web has grown from a few thousand pages in 1993 to more than two billion pages at present. The contributing factor to this explosive growth is the widespread use of microcomputer, increased case of use in computer packages and most importantly tremendous opportunities that the web offers to business .New tools and techniques are crucial for intelligently searching for useful information on the web [10]. Web crawling is an important method for collecting data and keeping up to date with the rapidly expanding Internet. A web crawler is a program, which automatically traverses the web by downloading documents and following links from page to page. It is a tool for the search engines and other information seekers to gather data for indexing and to enable them to keep their databases up to date. All search engines internally use web crawlers to keep the copies of data a fresh. Search engine is divided into different modules. Among those modules crawler module is the module on which search engine relies the most because it helps to provide the best possible results to the search engine. Crawlers are small programs that ’browse’ the web
© 2017, IRJET
|
Impact Factor value: 5.181
|
ISO 9001:2008 Certified Journal
|
Page 128