Review on an automatic extraction of educational digital objects and metadata from institutional web

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395 -0056

Volume: 04 Issue: 02 | Feb -2017

p-ISSN: 2395-0072

www.irjet.net

Review on An Automatic Extraction of Educational Digital Objects and Metadata from institutional Websites Kajal K. Nandeshwar1, Prof.Praful B. Sambhare2 1

2

M.E. IInd year, Dept. of Computer Science, P. R. Pote College of Engg, Amravati, Maharashtra, India Assistant Professor, Dept. of Computer Science, P. R. Pote College of Engg, Amravati, Maharashtra, India

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Web is growing constantly and exponentially

some type of structure such as call for papers or scientific papers. In all cases previously analyzed, only information that is contained in the document is extracted but they did not explore information that could be in linked websites [6].

every day. Thus relevant information gathering becomes unfeasible. Web mining is nothing but the extraction of useful contents of information. Many learning object repositories stores high quality learning materials. High learning materials are expensive to create, so it is very important to ensure reuse of learning material. The learning material can be tagged either manually or automatically. Manual annotation is time consuming and expensive process. Regarding automatic gathering information system various proposals have been developed. The first is the Agathe which is multiagent system for gathering the information on restricted domain. Next one is Crossmarc which is multi domain system based on multilingual agents for extracting information from web pages. Next is CiteSeerX which is scientific literature digital library and search engine. In this paper, a review is proposed for increasing the efficiency of automatically gathering information.

2. Preliminaries 2.1 Web crawling A Web crawler is a program which inspects web pages in a methodical and automated way [8]. One of its common uses is to create a copy of all visited web pages by a search engine that indexes pages providing a fast search for later processing. The beginning of the Web crawlers is visiting a list of URLs, identify the links in these pages and add them to the list of URLs to visit recurrently according to a given set of rules. The usual processing of a crawler is from a group of initial URLs addresses where linked resources are downloaded and analyzed in order to look for links to new resources, typically HTML pages, repeating this process until the final conditions are reached. These conditions vary according to the desired crawling policy [6].

Key Words: EDOs, gathering information, website, repository, automatic, extraction

1. INTRODUCTION

2.2 Information Extraction, Retrieval, and Gathering The main goal of Information Extraction systems is to locate information from text documents in natural language, producing as output a structured tabular data without ambiguity, which can be summarized and presented in a uniform way [9]. Increasingly, it is necessary to extract information for different purposes from the web [6]. The relevant documents within a larger collection of documents are retrieves by an Information Retrieval system, while relevant information in one or more documents is extracted by an Information Extraction system. Therefore, both techniques are complementary and used in combination can result in powerful tools for text processing [6]. Because of growth of the web and heterogeneity of its pages, the gathering information is increasingly complicate. For performing the retrieval and extraction of information in well-defined collections, a Gathering Information System is responsible. To retrieve relevant information, the gathering should be restricted to the specific domains [6].

Among other things, as an educational information source, internet is used [1]. The learning object repository is storing content or assets or resources as well as their metadata record [2]. Educational resource also called digital object, learning object, learning resources, digital resources, digital content, reusable learning object, educational content (McGreal, 2004). An Educational Digital Object (EDO) is any material in digital format that can be used as an educational resource. For example, a scientific publication, an educational material used in a class is an educational resource [6].World Wide Web (WWW) is a vast repository of interlinked hypertext documents known as web pages. A hypertext document consists of both, the contents and the hyperlinks to related documents. Information on the Web is very huge in size. For effectively satisfying the information need of the user on the Web, there is a need to use this big volume of information efficiently [7]. There are various systems are used for automatic gathering information such are Agathe, CROSSMARK, CiteSeerX. All these works consider documents that have

ÂŠ 2017, IRJET

|

Impact Factor value: 5.181

|

ISO 9001:2008 Certified Journal

| Page 1229

Turn static files into dynamic content formats.

Create a flipbook