790 likes | 1.14k Views
http://comet.lehman.cuny.edu/jung/presentation/presentation.html. Introduction to Modern Information Retrieval and Search Engines And Some Research Issues Professor Gwang Jung Department of Mathematics and Computer Science Lehman College, CUNY November 10, Fall 05. Outline.
E N D
http://comet.lehman.cuny.edu/jung/presentation/presentation.htmlhttp://comet.lehman.cuny.edu/jung/presentation/presentation.html Introduction to Modern Information Retrieval and Search Engines And Some Research Issues Professor Gwang Jung Department of Mathematics and Computer Science Lehman College, CUNY November 10, Fall 05 Intro to IR and SE, research issues
Outline Introduction to Information Retrieval Introduction to Search Engines (IR Systems for the Web) Search Engine Example: Google Brief Introduction to Semantic Web Useful Tools for IR System Building and Resources for Advanced Research Research Issues Intro to IR and SE, research issues
Introduction to Information Retrieval Intro to IR and SE, research issues
Information Age Intro to IR and SE, research issues
IR in General • Information Retrieval in general deals with • Retrieval of structured, semi-structured and unstructured data (information items) in response to a user query (topic statement). • User query • Structured (e.g., Boolean expression of keywords or terms) • Unstructured (e.g., terms, sentence, document) • In other words, IR is the process of applying algorithms over unstructured, semi-structured, or structured data in order to satisfy a given query. • Efficiency with respect to: • Algorithms, Query processing, Data organization/structure • Effectiveness with respect to: • Retrieval results Intro to IR and SE, research issues
IR Systems Intro to IR and SE, research issues
Formal Definition of IR System • IRS = (T, D, Q, F, R) • T: set of index terms (terms) • D: set of documents in a document database • Q: set of user queries • F: D x Q R (retrieval function) • R: real numbers (RSV: Retrieval Status Value) • Relevance Judgment is given by users. Intro to IR and SE, research issues
IRS versus DBMS Intro to IR and SE, research issues
IR Systems Focus on Retrieval effectiveness • The effective retrieval of relevant information depends on • User task (formulating effective query for the information need) • Indexing • IR systems in general adopt index terms to represent documents and queries. • The process of developing document representations by assigning index terms to documents (information items). • Retrieval model (often called IR model) and logical view of documents • Logical view of documents (logical representation of documents) depends on IR model Intro to IR and SE, research issues
Indexing • The process of developing document representations by assigning descriptions to information items (texts, documents, or multimedia items). • Descriptors = index terms = terms • Descriptors also lead users to participate in formulating information requests. • Two types of index terms: • Objective: author name, publisher, date of publication • Subjective: keywords selected from full text • Two types of indexing methods: • Manual: performed by human experts (for very effective IR systems)– may use ontology • Automatic: performed by computer HW and SW Intro to IR and SE, research issues
Indexing Aims (1) • Recall: the proportion of relevant items (documents) retrieved. • R = # of relevant items retrieved / total # of relevant items in the db • Precision: the proportion of retrieved documents that are relevant. • P = # of relevant items retrieved / total # of items retrieved • Effectiveness of indexing is mainly controlled by Term Specificity • Broader terms may retrieve both useful (relevant) and useless (non-relevant) info items for the user. • Narrower (specific) index terms favor precision at the expense of recall. • Index Language (set of well-selected index terms) • T = { index term t} • Pre-specified (controlled): easy maintenance; poor adaptability • Uncontrolled (dynamic): expanded dynamically; taken freely from the texts to be indexed and from the users’ queries. • Synonymous terms can be expanded to T by thesaurus, e-dictionary (e.g., WordNet), and/or knowledge base (e.g., ontology). Intro to IR and SE, research issues
Indexing Aims (2) • Recall and Precision values vary from 0 to 1. • Average users want to have high recall and high precision. • In practice, a compromise must be reached (middle point). R 1.0 P 0 1.0 Intro to IR and SE, research issues
Steps for Indexing • Objective attributes of a document are extracted (e.g., title, author, URL, structure). • Grammatical functional words (stop words) in general are not considered as index terms (e.g., of, then, this, and, …., etc). • Case insensitivity might be performed. • Stemming might be used. • Frequency of nonfunctional words are used to specify the term importance. • Term frequency weight fulfils only one of the indexing aims, I.e., Recall. • Terms that occur rarely in the individual document database may be used to distinguish documents in which they occur from those in which they do not occur could improve Precision. • Document frequency: the number of documents in the collection in which a term tj T occurs Intro to IR and SE, research issues
Dj, tfj, df Index terms D7, 4 3 computer database D1, 3 2 D2, 4 4 science system 1 D5, 2 Inverted Index File Inverted Index Entries Optionally postings (the positions of the term in a document) Intro to IR and SE, research issues
Retrieval Models (1) • Set theoretic IR models • Documents are represented by a set of terms • Well known Set Theoretic Models • Boolean IR Model • Retrieval Function is based on Boolean operation (e.g., and, or, not) • Query is formulated by Boolean logic • Fuzzy Set IR Model • Retrieval function is based on Fuzzy set operations • Query is formulated by Boolean logic • Rough Set IR Model • Various set operations were examined. • Ad-hoc Boolean query • Probabilistic IR model • Mainly used for probabilistic index term weighting • Provides mathematical framework for the well known tf*idf indexing scheme • Language Model based • Infer query concept from a document as retrieval process Intro to IR and SE, research issues
Retrieval Models (2) • Vector space model • Queries and documents are represented as weighted vectors. • Vectors in the basis are called term vectors, and assumed they are semantically independent. • A document (query) is represented as a linear combination of vectors in the generating set. • Retrieval function is based on dot product or cosine measure between document and query vectors. • Extended Boolean IR model • Combine characteristics of the vector space IR model with properties of Boolean algebra. • Retrieval function is based on Euclidean distances in a n-dimensional vector space. Distances are measured by using p-norms, where 1 p Intro to IR and SE, research issues
The Retrieval Process Intro to IR and SE, research issues
The retrieval Process in IR System Intro to IR and SE, research issues
Introduction to Search Engines (IR Systems for the Web) Intro to IR and SE, research issues
World Wide Web History • 1965 – Hypertext • Ted Nelson developed idea of hypertext in 1965. • Late 1960’s • Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960’s at SRI. • Early 1970’s • ARPANET was developed in the early 1970’s. • 1982 - Transmission Control Protocol (TCP) and Internet Protocol (IP) • 1989- WWW • Developed by Tim Berners-Lee and others in 1990 at CERN to organize research documents available on the Internet. • Combined idea of documents available by FTP with the idea of hypertext to link documents. • Developed initial HTTP network protocol, URLs, HTML, and first web server. Intro to IR and SE, research issues
Search Engine (Web-based IR System) History • By late 1980’s many files were available by anonymous FTP. • In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”) • Assembled lists of files available on many FTP servers. • Allowed regular expression search of these file names. • In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers. • In 1993, early web robots (spiders) were built to collect URL’s: • Wanderer • ALIWEB (Archie-Like Index of the WEB) • WWW Worm (indexed URL’s and titles for regex search) • In 1994, Stanford graduate students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo. Intro to IR and SE, research issues
Search Engine History (cont’d) • In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Washington. (eventually became part of Excite and AOL). • A few months later, Fuzzy Maudlin, a professor at CMU developed Lycos with his graduate students. • First to use a standard IR system as developed for the DARPA Tipster project. • First to index a large set of pages. • In late 1995, DEC developed Altavista. • Used a large farm of Alpha machines to quickly process large numbers of queries. • Supported boolean operators, phrases, and “reverse pointer” queries. • In 1998 – Google was developed by graduate students Larry Page & Sergey Brin at Stanford U • use of link analysisto rank documents Intro to IR and SE, research issues
How do Web SE Work? • Search Engines for the general web • search a database of the full text of web pages selected from billions of Web pages • searching is based on inverted index entries • Search Engine Databases • Full text documents are collected by software robot (also called softbot, spider). They navigate the web for collecting pages. • Web can be viewed as a graph structure. • The navigation can be based on DFS (Depth First Search), or BFS (Breadth First Search), or based on some combined navigation heuristics. • How to detect cycles? research issue • Indexer then build inverted index entries stored them into inverted files. • If necessary the inverted files may be compressed. • Some types of pages & links are excluded from the search engine • form invisible Web (maybe many times bigger than the visible Web). Intro to IR and SE, research issues
------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Breadth-First Crawling Intro to IR and SE, research issues
------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Depth-First Crawling Intro to IR and SE, research issues
Web Search Engine System Architecture Intro to IR and SE, research issues
Robot User Internet Websites Interface Temporary storage Logical Document Representation (based on IR Models) Retrieval Mechanism Parser Inverted Files (can be based on Different physical data structures Stopper/Stemmer Indexer Intro to IR and SE, research issues
Distributed Architecture (example) • Harvest (http://harvest.sourceforge.net/) • Distributedweb search engine • distribute the load among different machines • indexer doesn't run on the same machine as broker or web server Intro to IR and SE, research issues
What Makes a SE Good? • Database of web documents • Size of database • Freshness (Recency or up-to-datedness) • Types of documents offered • Retrieval Speed • The search engine's capabilities • Search options • Effectiveness of the retrieval mechanism • Support Concept-based search semantic web • Concept-based search systems try to determine what you mean, not just what you say. • Concept-based often works better in theory than in practice. Concept-based indexing is difficult task to perform. • Presentation of the results • keywords highlighted in context • showing summary of the web page that match Intro to IR and SE, research issues
Search Engine Example (Google) Intro to IR and SE, research issues
Google • The most popular web search engine: • Crawls (by robots) the web, stores a local cache of found pages • Builds a lexicon of common words • For each word creates an index list of pages containing it • Also human-compiled information from the Open Directory • Cached links - let you see older versions of recently changed ones • Link Analysis system: • page rank heuristic • Estimated size of index • 580 million pages visited and recorded • Uses link data to get to another 500 million pages (by link analysis system) • Recent estimation is around 4 billion pages (??) • Index refresh • Updated monthly/weekly or daily for popular pages • Serves queries from three data centres (service replication) • Service updates are synchronized. • Two on West Coast of the US, one on East Coast. Intro to IR and SE, research issues
Market share 50% Google Yahoo! 40% 30% 20% MSN 10% Lycos AOL AltaVista 0% 2001 2002 2003 2004 Source: WebSideStory Google Founders • Larry Page, Co-founder & President, Products • Sergey Brin, Co-founder & President, Technology • PhD students at Stanford • Became public co. last year Intro to IR and SE, research issues
Google Architecture Overview Intro to IR and SE, research issues
Google Indexer term frequencies Intro to IR and SE, research issues
Google Lexicon Intro to IR and SE, research issues
Google Searcher Intro to IR and SE, research issues
Google Features • Combines traditional IR text matching with extremely heavy use oflink popularity to rank the pages it has indexed. • Otherservices also use link popularity, but none do to the extentthat Google does. • Traditional IR (LITE) • Link Popularity (HEAVYLY USED) • Citation Importance Ranking (Quality of links pointing at it) • Relevancy • Similarity between query and a page • Number of Links • Link Quality • Link Content • Ranking boosts on text styles • PageRank • Usage simulation & Citation importance ranking • User randomly navigates • Process modelled by Markov Chain Intro to IR and SE, research issues
Collecting Links in Google • Submission (by Web Promotion): • Add URL page (may not need to do a "deep" submit) • The best way to ensure that your site is indexed is to build links. The more other sites are pointing atyou, the more likely you will be crawled and ranked well. • Crawling and Index Depth: • Aims to refresh its index on a monthly basis, • If Google doesn't actually index pages, it may still return it in a searchbecause it makes extensive use of the text within hyperlinks. • This text is associated with the pages the link points at, andit makes it possible for Google to find matching pages evenwhen these pages cannot themselves be indexed. Intro to IR and SE, research issues
Google Guidelines for Web-submission Intro to IR and SE, research issues
Deep SubmitPro Intro to IR and SE, research issues
Link Analysis for Relevancy (1) • Inspired by the CiteSeer (NEC International, Princeton, NJ) and IBM Clever Project • CiteSeer….. • http://www.almaden.ibm.com/cs/k53/clever.html • Google ranks web pages based on thenumber, quality and content of links pointing at them (citations). • Number of Links • All things being equal, a page with morelinks pointing at it will do better than a page with few or nolinks to it. • Link Quality • Numbers aren't everything. A single link froman important site might be worth more than many links fromrelatively unknown sites. • Weights page importance – links from important pages weighted higher Intro to IR and SE, research issues
Link Analysis for Relevancy (2) • Link Content • The text in and around linksrelatesto the page they point at. For apage to rank well for "travel," it would need to have manylinks that use the word travel in them or near them on thepage. It also helps if the page itself is textuallyrelevant for travel. • Ranking boosts on text styles • The appearance of terms in bold text, or in header text, or ina large font size is all taken into account. None of these aredominant factors, but they do figure into the overallequation. Intro to IR and SE, research issues
PageRank • Usage simulation & Citation importance ranking: • Based on a model of a Web surfer who follows links and makes occasional haphazard jumps, arrivingat certain places more frequently than others. • User randomly navigates • Jumps to random page with probability p • Follows a random hyperlink from the page with probability 1-p • Does not go back to a previously visited page by following a previously traversed link backwards • Google finds a type of universallyimportant page intuitively • locations that are heavily visited in a random traversal of theWeb's link structure. Intro to IR and SE, research issues
PageRank Heuristics • Process modelled by the following heuristics • probability of being in each page is computed, p set by the system • wj = PageRank of page j • ni = number of outgoing links on page i • m is the number of nodes in G (the number of Web pages in the collection) Intro to IR and SE, research issues
w1 w2 w3 n3 n1 n2 p m PageRank Illusrtation w1 wm + wj w2 (1- p) wn + w3 Intro to IR and SE, research issues
Google Spamming • Link popularity rankingsystem leaves it relatively immune to traditional spammingtechniques. • Goes beyond the text on pages todecide how good they are. No links, low rank. • Common spam idea • Create a lot of new pages within a site that link to a single page, in aneffort to boost that page's popularity, perhaps spreading out these pages across a network ofsites. • The (Evil) Genius of Comment Spammers By Steven Johnson, WIRED 12.03 http://www.wired.com/wired/archive/12.03/google.html?pg=7 Intro to IR and SE, research issues
http://www.wired.com/wired/archive/12.03/google.html?pg=7 Intro to IR and SE, research issues
Topic Search http://www.google.com/options/index.html Intro to IR and SE, research issues
Brief Introduction to Semantic Web Intro to IR and SE, research issues
Machine Process-able Knowledge on the Web • Unique identity of resources and objects- URI • Metadata Annotations • Data describing the content and meaning of resources • But everyone must speak the same language… • Terminologies • Shared and common vocabularies • But everyone must mean the same thing… • Ontologies • Shared and common understanding of a domain • Essential for exchange and discovery of knowledge • Inference • Apply the knowledge in the metadata and the ontology to create new metadata and new knowledge Intro to IR and SE, research issues