1 / 15

Chapter 7: Web Content Mining

Building an Intelligent Web . Theory and Practice. Rajendra Akerkar Pawan Lingras. Chapter 7: Web Content Mining. Presented by:. Qi Jia. 12 -4 -12. - University of North Texas - DSCI 5240 Fall 2012 - Graduate Presentation - Option A.

davis
Download Presentation

Chapter 7: Web Content Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building anIntelligentWeb Theoryand Practice RajendraAkerkar PawanLingras Chapter 7:Web Content Mining Presented by: Qi Jia 12-4-12 - University of North Texas - DSCI 5240 Fall 2012 - Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version

  2. OUTLINES Introduction Crawlers Queries Search Engine

  3. INTRODUCTION 1 Web content mining 2 Uses of Web-content mining techniques 3 4 Problems with the web data Two approaches of web-content mining

  4. INTRODUCTION FirstTwo Topics Web Content Uses of Web-content Mining techniques • Web-content mining techniques are used to discover useful information from content on the web. • Some of the web content is generated dynamically using queries to database management systems. • Other web content may be hidden from general users. • Textual • Audio • Video • Still Images • Metadata • Hyperlinks

  5. INTRODUCTION Distributed data Large volume Unstructured data 3 Problems with the web data Redundant data Prob.7 Prob.6 Prob.5 Prob.4 Prob.1 Prob.2 Prob.3 Quality of data Extreme percentage volatile data Varied data

  6. INTRODUCTION st nd • software agents perform the content mining 4 Two approaches of web-content mining 1 2 • view the Web data as belonging to a database database oriented agent-based

  7. CRAWLERS Web Crawler Context Graph Focused Crawler

  8. CRAWLERS Context Graph Web Crawler 1 2 3 Crawling process • A computer program that navigates the hypertext structure of the web. Builds an index visiting number of pages and then replaces the current index. - Begin with group of URLs - Breath-first or depth-first - Extract more URLs 4 5 6 Numerous crawlers Context Graph Context Graph - Focused crawling has proposed the use of context graphs, which in turn created the context focused crawler (CFC). - Two steps of the CFC performs crawling - Problem of redundancy - Web partition robot per partition

  9. CRAWLERS Focused Crawler st st • Generally recommended for use due to large size of the Web • Visits pages related to topics of interest nd nd • The focused crawler structure consists of two major parts: • The distiller & The hypertext classifier • The pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller 3 1 2 4 Two major parts Documents Focused Crawler Priority-based structure • Sample documents are identified and classified based on a hierarchical classification tree • Documents are used as the seed documents to begin the focused crawling

  10. SEARCH ENGINE 1 Examples of search engine 2 Components to a search engine 3 4 Search engine mechanism Responsibilities of Search Engines

  11. SEARCH ENGINE Examples Components • Basic components to a search engine: • The spider: gathers new or updated information on Internet websites • The index: used to store information about several websites • The search software: performs searching through the huge index in an effort to generate an ordered list of useful search results • Uses a ‘spider’ or ‘crawler’ that crawls the Web hunting for new or updated Web pages to store in an index. Search engines URL AltaVista www.altavista.com Excite www.excite.com Google www.google.com Infoseek www.infoseek.com Lycos www.lycos.com

  12. SEARCH ENGINE Mechanism Responsibilities • Document collection • choose the documents to be indexed • Document indexing • indicate the content of the selected documents • frequently 2 indices preserved • Searching • indicate the user information need into a query • Retrieval • Document and query management • present the outcome • virtual collection • Generic structure of all search engines is basically the same • However, the search results differ from search engine to search engine for the same search terms Search engine mechanism Responsibilities of Search Engines Search engine mechanism

  13. QUERIES Phases of Queries • Three-tier process of translating the user's need into a search engine query: st nd rd 1 3 2 On the next level, the search engine must translate the words with possible spelling errors into processing tokens. The first level involves the user formulating the information need into a question or a list of terms using experiences and vocabulary and entering it into the search engine. On the third level, the search engine must use the processing tokens to search the document database and retrieve the appropriate documents.

  14. QUERIES Types of Queries 1 2 3 Boolean Queries Natural Language Thesaurus Queries In a thesaurus query the user selects the term from a preceding set of terms predetermined by the retrieval system. Boolean logic queries connect words in the search using operators such as AND or OR. In natural language queries the user frames as a question or a statement. 4 5 6 Fuzzy Queries Term Searches Probabilistic Queries Probabilistic queries refer to the way in which the IR system retrieves documents according to relevancy. The most common type of query on the Web is when a user provides a few words or phrases for the search. Fuzzy queries reflect no specificity.

  15. Thank you for your attention!

More Related