Document Type

Research Paper

College Affiliation

College of Science and Engineering

Department or Program Affiliation

Computer Science Department

Abstract

Vertical search engines allow users to query for information within a subset of documents relevant to a pre-determined topic (Chakrabarti, 1999). One challenging aspect to deploying a vertical search engine is building a Web crawler that distinguishes relevant documents from non-relevant documents. In this research, we describe and analyze various methods to crawl relevant documents for vertical search engines, and we examine ways to apply these methods to building a local search engine. In a typical crawl cycle for a vertical search engine, the crawler grabs a URL from the URL frontier, downloads content from the URL, and determines the document’s relevancy to the pre-defined topic. If the document is deemed relevant, it is indexed and its links are added to the URL frontier. Two questions are raised in this process: how do we judge a document’s relevance, and how should we prioritize URLs in the frontier in order to reach the best documents first? To determine the relevancy of a document, we may hold on to a set of pre-determined keywords that we attempt to match in a crawled document’s content and metadata. Another possibility is to use relevance feedback, a mechanism where we train the crawler to spot relevant documents by feeding it training data. In order to prioritize links within the URL frontier, we can use a breadth-first crawler where we just index pages one level at a time, bridges which are pages that aren’t crawled but used to gather more links, reinforcement learning where the crawler is rewarded for reaching relevant pages, and decision trees where the priority given to a link depends on the quality of the parent page.

Subject – LCSH

Search engines, Text processing (Computer science), Sorting (Electronic computers), Relevance, Database searching

Genre/Form

Term papers

Publisher

Western Washington University

Date

2008

Rights

Copying of this document in whole or in part is allowable only for scholarly purposes. It is understood, however, that any copying or publication of this document for commercial purposes, or for financial gain, shall not be allowed without the author’s written permission.

Type

Text

Format

application/pdf

Language

English

Share

COinS