International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017
p-ISSN: 2395-0072
www.irjet.net
Browser Extension TO Removing Dust Using Sequence Alignment and Content Matching Priyanka Khopkar1, D.S.Bhosal 1PG
Student, Ashokrao Mane group of institution, Vathar
2Associate
Professor, Ashokrao Mane group of institution, Vathar
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - If documents of two URLs are similar, then
deriving normalization rules. More general rules can be generated using this multiple sequence alignment algorithm to remove the duplicate URLs with similar contents. After URLs normalization process, normalized URLs are sends towards URL Content Matching for further comparison, Where first efforts focused on comparing document content that inspect the URLs with fetching the corresponding page contents. After removing duplicate urls , there is another algorithm is proposed which presents a novel and interesting problem of extracting top-k lists from the web. Compared to other structured data, top-k lists are cleaner, easier to understand and more interesting for human consumption, and therefore are an important source for data mining and knowledge discovery. We demonstrate a algorithm that automatically extracts over 1.7 million such lists from the a web snapshot and also discovers the structure of each list.
they are called DUST. Similarly, detection of near duplicate documents is complex. The duplicate documents content will be similar but there will be small differences in the content. Different URLs with same content are the source of multiple problems. Most of the existing methods generate very specific rules. So more number of rules are required to increase detection of duplicate URLs. Existing methods cannot detect duplicate url’s across different sites where as candidate rules are derived from URL pairs within the dup-cluster. Existing Methods Complexity is proportional to the number of specific rules generated from all clusters. In the proposed system, the URL normalization process is used which identifies DUST with fetching the content of the URLs. In Proposed system, a new method is present, which obtains a smaller and more general set of normalization rules using multiple sequence alignment. The proposed method is used to generate rules with an acceptable computational cost even when crawling in large scale scenarios. The valid URL’S contents can be fetched from its web. Key Words: Crawlers, Dust, Uniform Resource Locator (URL).
The goal of the proposed system is to detect dust and remove duplicate URLs. It is achieved by using two algorithms, the algorithm that detects the dust rules from a list and the algorithm that is used to convert these URLs into same canonical form. The contents of a valid URL can be fetched from its web server. If the documents of two URLs are similar, they are called DUST. Similarly identification of near duplicate documents is complex. The content of the near duplicate documents will be similar but there will be small differences in the content. The web pages with same content, but different URLs is the source of multiple problems. A list consists of an URL and a HTTP code. The list type can be obtained from web server logs. The algorithm to detect DUST rules is used to create an ordered list of DUST rules from a website. Canonization is the process of converting every URL into a single canonical form. There is a possibility of efficient canonization of DUST rules detected.
1.INTRODUCTION There are multiples URLs that have similar content on the web. Different URLs with Similar content are known as DUST. Since Crawling this duplicate URLs is results in poor user experience, a waste of resources, so the task of detecting DUST is important for a search engine. In the proposed method, these duplicate URLs are converted into same canonical form which can be used by web crawlers to avoid and removing DUST. In proposed system, multiple sequence alignment and URLs content matching methods are used. multiple sequence alignment is considered as a natural generalization of the pair wise alignment problem, for any given set of sequences with more than two sequences. This problem requires that all sequences to be of the same length and thus spaces are inserted at the appropriate places. In the proposed system, to obtain a smaller and general set of rules to avoid duplicate URLs multiple sequence alignment is used. Multiple sequence alignment is used for identifying identical patterns. This Multiple sequence alignment can be used to identifying similar strings, which can be used for
© 2017, IRJET
|
Impact Factor value: 5.181
2.RELATED WORK A. Agarwal, H. S. Koppula, K. P. Leela, K. P. Chitrapura, S. Garg, P. Kumar GM, C.Haty,A.Royand A.Sasturkar [1], a set of techniques have to proposed to mine rules from URLs and utilize these learnt rules for de-duplication using just URL strings without fetching the content explicitly. The rule
|
ISO 9001:2008 Certified Journal
| Page 2308