1 / 21

CS5201 Sem A 2002 2002 C H Lee

A web page organized into a

Melvin
Download Presentation

CS5201 Sem A 2002 2002 C H Lee

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Web Searching

    Web searching paradigms Search engine Page discovery Indexing Ranking Metasearcher Measuring search

    Slide 2:Basic Search Paradigms

    Querying User formulates a query, sends to search engine (server). Search engine processes query, and sends back a set of pages as response. Browsing User visits successive web pages. “Navigate” through the Web. Interleaved User sends an “approximate” query and obtains a set of pages from search engine, and browse from a page which seems to be most relevant. While browsing, user may formulate a “more accurate” query.

    Slide 3:Basic Search Paradigms (Cont’d)

    Web Directories A web page organized into a “table of content” style of a directory of main topics. An attempt to classify a portion of the Web. Alternative to querying/browsing. User scans the directory, may also set query within a directory. Information retrieval is usually of higher relevance (quality) Limiting the domain sharpens searching.

    Slide 4:Search Engine

    A server that Processes users’ queries and return links to relevant pages. Main functional components Web crawler (or spider) Crawls the web to discover pages. Indexer Extracts key-terms (keywords or phrases) from page and associates them with the link to the page. Produces data structure for efficient search. Query processor Processes user queries against the index. User interface Solicit user’s queries and present responses.

    Slide 5:Search Engine Architecture

    Users Web User Interface Web Crawler Query Processor Indexer Index

    Slide 6:Web Crawler

    Recursive crawl Start with a initial URL, get page, and get other links within the page. Different traversal modes Breath-first. Depth-first. Multiple crawlers Partition Web (say using domain names) and assign crawlers to different partitions. Crawling may load down Web servers Some server may restrict access from crawlers

    Slide 7:Web Crawler (cont’d)

    Revisit frequency Pages may get updated, deleted. Page owner submit Page submitted to search engine site. Crawler may start exhaustive crawl from that page. May search only to a limit depth.

    Slide 8:Indexer

    Scans page to extract keywords (or key-terms). Builds inverted list (or inverted file) For each keyword, a set of pointers to the pages (actually, links to the pages) where the word appears. The pointer to a page also include a weight An indication of the relevance of the page with respect to the keyword, an commonly used weight factor is the frequency count. Also included is some description of the page: title, size, few lines containing the keyword, etc. Typical storage for 100 million pages 50 Gbyte for page (URL) descriptions (at 500 bytes each) 150 Gbyte for the inverted list

    Slide 9:Inverted List

    : : 8 www.catf.. Catfish Institute .. next page 5 www.plann.... ..is a good fish .. next page Lexicon (or vocabulary) weight link (url) page description next page

    Slide 10:Inverted List (Cont’d)

    The inverted list is organized to optimize searching. The entries are sorted to allow Binary search: search time ~ log2N. log2(1 billion) ~ 30 Interpolation search: Search time ~ log2 (log2N) log2 log2(1 billion) ~ 5 Substantial processing overhead. The set of pages (links) corresponding to a keyword can be ordered by the weight. One commonly weight is the frequency count: the number of occurrences of the word in the page.

    Slide 11:Indexing a Page

    Scan page to extract keywords (and/or key-terms). Ignore stopwords (the, a, an, and, or, I, you, etc) 100 most frequent words ~ 50% of document. Stemming Replace all variants of a word with the single stem of the word. Communicate: communicates, communicating, communicated, communication, … Stopwords elimination and stemming reduce inverted list size and improve search speed. Various possibilities on deciding on weight of indexed words or terms Frequency count, appearance in title, etc., and combinations.

    Slide 12:User Interface - Query Specification

    Basic specification : Keywords Disjunction (OR), e.g., AltaVista Conjunction (AND), e.g., Google Advanced query interface Boolean operators: AND, OR, etc. Phrase match (phrase in quotes: e.g., “tender heart”) Proximity, wild cards. Filtering by date, internet domain, etc. In AltaVista, can specify terms by importance (separate from query specification) Content: multimedia, .PDF, .PPT files

    Slide 13:Query Processing

    Searching the inverted list List entries are in sorted order. Significantly reduce search time. Query terms processing Boolean: e.g. OR corresponding to union of search results, AND to intersection, etc. May require additional information stored for page (link): e.g., PROXIMITY requires storing positions of keywords. Ranking Determining the order (or priority) for presentation to user.

    Slide 14:Ranking

    Basic model Page (document) modeled as a vector of keyword-weight pairs: P = {(kw1, w2), (kw2, w2), …, (kwt, wt)} Query modeled as a “specification” for the desired page(s) (ideal answer to query): Q = {(kw1, u2), (kw2, u2), …, (kwt, ut)}. Ranking algorithm calculate a rank value = R(P, Q). An example R: R = ?i wiui Weight is used to rank page and can be made to depend on Presence of keyword in the title of the document. Frequency/count of keyword in document. Link popularity (how many other pages points to this one. and/or combinations.

    Slide 15:Ranking (Cont’d)

    Example – PageRank algorithm (used by Google) Link popularity is used to help rank a page. A link from page A to page B is interpreted as a vote (by A) for B. A vote cast by a page that is more “important” has higher rank (or weight) value and make the voted for page more “important”. Hence, the rank value of a page is based on the value of the pages that reference it. The rank also takes into consideration of other more tradition factors such as keyword frequency counts, etc.

    Slide 16:Search Engines Differences

    Coverage (number of documents) Web crawler algorithms Frequency and depth of visits Indexing algorithms Search interfaces Ranking algorithms

    Slide 17:Search Engines Sizes

    150 12 50 80 SEARCHES/DAY (MILLIONS) AV Altavista EX Excite FAST FAST GG Google Go Go (Infoseek) INK Inktomi NL Northern Light WT WebTop.com SHADED DATA FOR GG AND INKTOMI INCLUDES PAGES INDEXED BUT NOT VISITED SOURCE: SEARCHENGINEWATCH.COM Dec. 11, 2001

    Slide 18:Metasearcher

    A search server. Submit the same query to several search engines and collect the answers. Exploit efforts of many different search engines. Save user’s effort to send queries to multiple servers. A page that is retrieved by multiple search engines is likely to be more relevant. Improved coverage. Example: metacrawler, savvysearch.

    Slide 19:Measuring Retrieval

    Precision and Recall For each query, the page collection is partitioned by the answer of the search (see diagram) Precision = A/(A?C) Recall = A /(A?B) Precision Can be estimated Recall Difficult to estimate for large collection such as the Web, where the complete set (which comprise also those not retrieved) may not be known. A = relevant retrieved B = relevant not retrieved C = not relevant retrieved D = not relevant not retrieved

    Slide 20:Observations

    Search engines are effective tools for eCommerce Enable buyers and sellers to find each other. Allow your visitors to search your site. You may submit you pages to Web directories Search engine algorithms (especially ranking algorithms) are often proprietary A search engine usually does not cover the Web completely New paradigms should be developed In addition to keywords Link popularity Extension to images, audio, video, etc.

    Slide 21:Web Searching

    List of search engines http://www.searchenginecolossus.com/ Search engine resources http://www.pandia.com/resources/index.html Submitting a page/site to a Web directory http://dmoz.org/add.html Adding a search engine to your web site Using KSearch

More Related