1 / 1

Web

Google. User provided/ Discovered Sites. Generate queries. API. Thesaurus. Crawl pages: eliminate old or duplicate ones. Crawl. Search potential pages using broad terms in the thesaurus; many results returned. Clear noisy content (ads, menus, etc) & index text. Extract & Index

savea
Download Presentation

Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Google User provided/ Discovered Sites Generate queries API Thesaurus Crawl pages: eliminate old or duplicate ones. Crawl Search potential pages using broad terms in the thesaurus; many results returned. Clear noisy content (ads, menus, etc) & index text. Extract & Index Content Back-End Database Filter irrelevant web pages using text algorithms that analyze page's topic. Also update site list according to overall score. Filter Thesaurus Improve the thesaurus with relevancy (i.e. ranking) feedback. Classifier (News, Technical Docs, etc.) Classify discovered pages. Web Portal Feed the web portal with the new results. Web Retrieved Pages Relevant Pages Figure 2: Visualization of thesaurus in Oracle Text system Figure 3: Part of an interface used to set system parameters Further Research Find ways to automatically update thesaurus. Enrich thesaural relationships for effective relevancy judgment. Take temporal information into account when scoring relevancy. Introduce IE and QA tools into web crawling. Figure 4: Interface used for system evaluation FOCUSED CRAWLING: EXPERIENCES IN A REAL WORLD PROJECT Antonio Badia, Tulay Muezzinoglu, Olfa Nasraoui University of Louisville {abadia, t0muez01,olfa.nasraoui}@louisville.edu Abstract Challenges Approaches Anatomy of the System In this project, we developed a focused web crawler to retrieve web resources on a user-provided topic. Given a thesaurus and a set of URLs by the user, we used Google API to get more seed URLs for the topic. We mainly used Oracle DB to store and index crawled pages, and utilized the provided thesaurus in ranking and filtering the content. We review some of the problems encountered, roughly dividing them into practical or engineering issues and conceptual issues. • There is no perfect search engine: • Mostly limited to keyword based search. • Absence or existence of a word does not necessarily imply irrelevancy or relevancy: ‘vessel’, ‘ready to ship’. • You cannot ask for pages about a topic: ‘rust in ships’. • Millions of results are returned. • Relevancy to a topic: • Difficult to determine. • Utilize thesauri: easy to maintain. • Network structure: • If page p is about a topic t and p links to p’, there is a chance that page p’ is about t. • Start with pages about the desired topic. • Beware of topic drift. • Find the pages linking to a given page to determine the topic: backlinks. • Web is not a fully connected network: allow new source discovery mechanism. • Keep track of good sites and hubs. • Unit of work: • Page may contain several topics: page is too coarse unit. • Entire site may be devoted to a certain topic: page is not an apt unit, but site is. • Useful information: • Extract real content (Clean forms, footers, advertisements, site navigation menus, etc.) • Detect good hubs: No actual content. • Eliminate forums, blogs: no easy way to determine their quality. • Freshness of information: • Very important in many cases, disregarded by most search engines. • Last modified date in http response header: not always available. • For announcements last modified date is not useful. • Page itself may contain time (date) information, but it is very difficult to find - Information Extraction techniques are needed. • Difficulty in knowledge representation • No formal definition of what a topic is. • No consistent assignment of relationships • Automated searches demand for thesauri with relationships that are subcategorized. Goal The National Surface Treatment Center partners with the Navy, DoD Operations, and industry to fight corrosion and solve coating problems. Its web site aims to be the main reference point for people and organizations involved in these problems. The goal is to help the NST Center achieve its objectives for the web portal by gathering relevant information from the web, distributing this information to users and interested parties. • It is impossible to guarantee that all pages returned are relevant (Figure 1): • Quality of content cannot always be guaranteed. • A page may contain noisy content (adds, links, etc.). • Only part of a page may contain relevant data. • Information may be old. • It is impossible to guarantee that all relevant pages are returned (even inspected) (Figure 1): • Millions of pages available, many of them not indexed in any search engine ("hidden web"), • Pages come and go; what is there one day may not be there next. Tools Focused Web Crawler: A program that explores the web trying to get pages on a particular topic, Thesaurus: Series of semantic networks built-up by a collection of keywords connected by various relationships. Written in Java. Web interface with JSP. Many PL/SQL functions to support effective thesaurus usage in Oracle. Incorporated open source packages such as Htmlparser, JTidy, Lucene. Application Interface Figure 1 Highlights of the Algorithm Use Google to increase recall. Initial gathering of pages guided by several keyword searches; each one uses a few top thesaurus terms ("cast a wide net"). Keep a list of good sites, hub pages. Increase weights of the keywords with their depth in the thesaurus when scoring. Variety of keywords are rewarded. Negative word list is constructed: if a word from the list exists, score of page is significantly lowered . Acknowledgements This research is supported by Innovative Productivity, Inc., a nonprofit Kentucky company that runs the National Surface Treatment Center for the U.S. Navy. The work of A. Badia was supported by National Science Foundation CAREER award IIS-0347555. The work of O. Nasraoui was supported by National Science Foundation CAREER Award IIS-0133948. Knowledge Discovery & Web Mining Research / E-commerce Lab Database Lab CECS Department Director: Dr. Olfa Nasraoui Director: Dr. Antonio Badia

More Related