Web Search for X-Informatics

Web Search forX-Informatics Spring Semester 2002 MW 6:00 pm – 7:15 pm Indiana Time Geoffrey Fox and Bryan Carpenter PTLIU Laboratory for Community Grids Informatics, (Computer Science , Physics) Indiana University Bloomington IN 47404 gcf@indiana.edu

References I • Here are a set addressing Web Search has one approach to information retrieval • http://umiacs.umd.edu/~bonnie/cmsc723-00/CMSC723/CMSC723.ppt • http://img.cs.man.ac.uk/stevens/workshop/goble.ppt • http://www.isi.edu/us-uk.gridworkshop/talks/goble_-_grid_ontologies.ppt • http://www.cs.man.ac.uk/~carole/cs3352.htm has several interesting sub-talks in it • http://www.cs.man.ac.uk/~carole/IRintroduction.ppt • http://www.cs.man.ac.uk/~carole/SearchingtheWeb.ppt • http://www.cs.man.ac.uk/~carole/IRindexing.ppt • http://www.cs.man.ac.uk/~carole/metadata.ppt • http://www.cs.man.ac.uk/~carole/TopicandRDF.ppt • http://www.isi.edu/us-uk.gridworkshop/talks/jeffery.ppt from the excellent 2001 e-Science meeting

References II: Discussion of “real systems” • General review stressing the “hidden web” (content stored in databases)http://www.press.umich.edu/jep/07-01/bergman.html • IBM “Clever Project” Hypersearching the Webhttp://www.sciam.com/1999/0699issue/0699raghavan.html • Google Anatomy of a Web Search Enginehttp://www.stanford.edu/class/cs240/readings/google.pdf • Peking University Search Engine Grouphttp://net.cs.pku.edu.cn/~webg/refpaper/papers/jwang-log.pdf • A Huge set of links can be found at:http://net.cs.pku.edu.cn/~webg/refpaper/

This lecture built around this presentation by Xiaoming Li We have inserted material from other cited references WebGather:towards quality and scalability of a Web search service LI Xiaoming • Department of Computer Science and Technology, Peking Univ. A presentation at Supercomputing 2001 through a constellation site in China November 15, 2001

How many search engines out there ? • Yahoo ! • AltaVista • Lycos • Infoseek • OpenFind • Baidu • Google • WebGather (天网) • … there are more than 4000 in the world ! (Complete Planet White Paperhttp://www.press.umich.edu/jep/07-01/bergman.html)

http://e.pku.edu.cn

WebGather

Our System

Agenda • Importance of Web search service • Three primary measures/goals of a Web search service • Our approaches to the goals • Related works • Future work

Importance of Web Search Service • Rapid growing of web information • >40 millions of Chinese web pages under .cn • The second popular application on the web • email, search engine • Information accessing: from address-based to content-based • who can remember all those URLs ?! • search engine: a first step towards content-based web information accessing • There are 4/24 sessions, 15/78 papers at WWW10 !

How the Web is growing in China

Primary Measures/Goals of a Search engine • Scale • volume of indexed web information, ... • Performance • “real time” constraint • Quality • do the end user like the result returned ? they are at odds with one another !

Scale: go for massive ! • the amount of information that is indexed by the system (e.g. number of web pages, number of ftp file entries, etc.) • the number of websites it covers. • coverage: percentages of the above with respect to the totals out there on the Web • the number of information forms that is fetched and managed by the system (e.g. html, txt, asp, xml, doc, ppt, pdf, ps, Big5 as well as GB, etc.)

Primary measures/goals of a search engine • Scale • volume of indexed information, ... • Performance • “real time” constraint • Quality • does the end user like the result returned ? they are at odds with one another !

Performance: “real time” requirement • fetch the targeted amount of information within a time frame, say 15 days • otherwise the information may be obsolete • deliver the results to a query within a time limit (response time), say 1 second • otherwise users may turn away from your service, never come back ! larger scale may imply degradation of performance

Primary measures/goals of a search engine • Scale • volume of information indexed, ... • Performance • “real time” constraint • Quality • do the end user like the result returned ? they are at odds with one another !

Quality: do the users like it ? • recall rate • can it return information that should be returned ? • high recall rate requires highcoverage • accuracy • percentage of returned results that are relevant to the query • high accuracy requires bettercoverage • ranking (a special measure of accuracy) • are the most relevant results appearing before those less relevant ?

Our approach • Parallel and distributed processing: reach for large scale and scalability • User behavior analysis: give forth mechanisms for performance • Making use of content of web pages: hint innovative algorithms for quality

Towards scalability • WebGather 1.0: a million-page level system operated since 1998, uni-crawler. • WebGather 2.0: a 30-million-page level system operated since 2001, a fully parallel architecture. • not only boosts up the scale • but also improves performance • and delivers better quality

Architecture of typical search engines Internet ... robot scheduler robot indexer searcher user interface index database crawler raw database

Architecture of WebGather 2.0

Towards scalability: main technical issues • how to assign crawling tasks to multiple crawlers for parallel processing • granularity of the tasks: URL or IP address ? • maintenance of a task pool: centralized or distributed ? • load balance • low communication overhead • dynamic reconfiguration • in response to failure of crawlers, …, (remembering that crawling process usually takes weeks)

Parallel Crawling in WebGather CR: crawler registry

Task Generation and Assignment granularity of parallelism: URL or domain name task pool: distributed, and tasks are dynamically created and assigned A hash function is used for task assignment and load balance H(URL) = F(URL’s domain part) mod N

Simulation result: load balance

Simulation result: scalability Speedup number of crawlers

Experimental result: scalability Speedup Number of crawlers

Our Approach • Parallel and distributed processing: reach for large scale and scalability • User behavior analysis: give forth mechanisms for performance • Making use of content of web pages: hint innovative algorithms for quality

Towards high performance • “parallel processing”, of course, is a plus to performance, and • more importantly , user behavior analysis suggests critical mechanisms for improved performance • a search engine not only maintains web information, but also logs user queries • a good understanding of the queries gives rise to cache design and performance tuning approaches

What do you keep? • So you gather data from the web storing • Documents and more importantly words extracted from documents • After removing dull words, you store document# for each word together with additional data • position and meta-information such as font, tag enclosed in (i.e. if in meta-data section) • Position needed to be able to respond to multiword queries with adjacency requirements • There is a lot of important research in best way to get, store and retrieve information

What Pages should one get? • A Web Search is an Information not a Knowledge retrieval engine • It looks at a set of text pages with certain additional characteristics • URL, Titles, Fonts, Meta-data • And matches a query to these pages returning pages in a certain order • This order and choices made by user in dealing with this order can be thought of as “knowledge” • E.g. user tries different queries and ecides which of returned set to explore • People complain about “number of pages” returned but I think this is a GOOD model for knowledge and it is good to combine people with the computer

How do you Rank Pages • One can find at least 4 criteria • Content of Document i.e. nature of occurrence of query terms in document (Author) • Nature of links to and from this document – this is characteristic of a Web page (Other authors) • Google and IBM Clever project emphasized this • Occurrence of documents in compiled directories (editors) • Data on what users of search service have done (users)

Document Content Ranking • Here the TF*IDF method is typical • TF Term (query word) Frequency • IDF is Inverse Document Frequency • This gives a crude ranking which can be refined by other schemes • If you have multiple terms then you can add their values of TF*IDF • Next slides come from earlier courses from Goble (Manchester) and Maryland cited at start

IR (Information Retrieval) as Clustering • A query is a vague spec of a set of objects, A • IR is reduced to the problem of determining which documents are in set A and which ones are not • Intra clustering similarity: • What are the features that better describe the objects in A • Inter clustering dissimilarity: • What are the features that better distinguish the objects A from the remaining objects in C A: Retrieved Documents x x x x x x C: Document Collection

Index term weighting

occ(t,d) tf(t,d) = occ(tmax, d) N idf(t) = log n(t) Index term weighting Normalised frequency of term t in document d Intra-clustering similarity • The raw frequency of a term t inside a document d. • A measure of how well the document term describes the document contents Inter-cluster dissimilarity • Inverse document frequency • Inverse of the frequency of a term t among the documents in the collection. • Terms which appear in many documents are not useful for distinguishing a relevant document from a non-relevant one. Inverse document frequency Weight(t,d) = tf(t,d) x idf(t)

occ(t,d) weight(t,d) = x occ(tmax, d) 0.5occ(t,q) x 0.5 + weight(t,d) = occ(tmax, q) N N log log n(t) n(t) Inverse document frequency Term frequency Term weighting schemes • Best known • Variation for query term weights

TF*IDF Example 1 2 3 4 1 2 3 4 Unweighted query: contaminated retrieval Result: 2, 3, 1, 4 5 2 1.51 0.60 complicated 0.301 4 1 3 0.50 0.13 0.38 contaminated 0.125 5 4 3 0.63 0.50 0.38 fallout 0.125 Weighted query: contaminated(3) retrieval(1) Result: 1, 3, 2, 4 6 3 3 2 information 0.000 1 0.60 interesting 0.602 3 7 0.90 2.11 nuclear 0.301 IDF-weighted query: contaminated retrieval Result: 2, 3, 1, 4 6 1 4 0.75 0.13 0.50 retrieval 0.125 2 1.20 siberia 0.602

Document Length Normalization • Long documents have an unfair advantage • They use a lot of terms • So they get more matches than short documents • And they use the same words repeatedly • So they have much higher term frequencies

Cosine Normalization Example 1 2 3 4 1 2 3 4 1 2 3 4 5 2 1.51 0.60 0.13 0.57 0.69 complicated 0.301 4 1 3 0.50 0.13 0.38 0.29 0.14 contaminated 0.125 5 4 3 0.63 0.50 0.38 0.37 0.19 0.44 fallout 0.125 6 3 3 2 information 0.000 1 0.60 0.62 interesting 0.602 3 7 0.90 2.11 0.53 0.79 nuclear 0.301 6 1 4 0.75 0.13 0.50 0.77 0.05 0.57 retrieval 0.125 2 1.20 0.71 siberia 0.602 1.70 0.97 2.67 0.87 Length Unweighted query: contaminated retrieval, Result: 2, 4, 1, 3 (compare to 2, 3, 1, 4)

Google Page Rank • This exploits nature of links to a page which is a measure of “citations” for page • Page A has pages T1 T2 T3 …Tn which point to it • d is a fudge factor (say 0.85) • PR(A) = (1-d) + d *(PR(T1)/C(T1) + PR(T2)/C(T2) + … + PR(Tn)/C(Tn) ) • Where C(Tk) is number of links from page Tk

HITS: Hypertext Induced Topic Search • The ranking scheme depends on the query • Considers the set of pages that point to or are pointed at by pages in the answer S • Implemented in IBM;s Clever Prototype • Scientific American Article: • http://www.sciam.com/1999/0699issue/0699raghavan.html

HITS (2) • Authorities: • Pages that have many links point to them in S • Hub: • pages that have many outgoing links • Positive two-way feedback: • better authority pages come from incoming edges from good hubs • better hub pages come from outgoing edges to good authorities

Authorities and Hubs Authorities ( blue ) Hubs (red)

HITS two step iterative process  • assigns initial scores to candidate hubs and authorities on a particular topic in set of pages S • use the current guesses about the authorities to improve the estimates of hubs—locate all the best authorities • use the updated hub information to refine the guesses about theauthorities--determine where the best hubs point most heavily and call these the goodauthorities. • Repeat until the scores eventually converge to the principle eigenvector of the link matrix of S, which can thenbe used to determine the best authorities and hubs. A(u) H(p) = u  S | p  u  H(u) A(p) = v  S | v  p

Cybercommunities HITS is clusteringweb intoCommunities

Google assigns initial rankings andretains them independently of any queries -- enables faster response. looks only in the forward direction, from link to link. Clever assembles a different root setfor each search term and then prioritizes those pages in the context of that particular query. also looks backward from an authoritative page to see what locations are pointing there. Humans are innately motivated to create hub-like content expressing their expertise on specific topics. Google vs Clever

Peking UniversityUser behavior analysis • taking 3 month worth of real user queries (about 1 million queries) • each query consists of <keywords, time, IP address, …> • keywords distribution: weobserve that high frequency keywords are dominating • grouping the queries in 1000, exam the difference between consecutive groups: we observe a quite stable process (the difference is quite small) • do the above for different group sizes: we observe a strong self-similarity structure

Distribution of user queries Terms (query words as fraction) 0.2 0.8 Queriesas timesearching • Only 160,000 different keywords in 960,000 queries • 20% of high-frequency queries occupies 80% of the total visit times

Towards high performance Query caching improves system performance dramatically more than 70% of user queries can be satisfied in less than 1 millisecond almost all queries are answered in 1 second User behavior may also be used for other purpose evaluation of various ranking metrics, e.g., the link popularity and replica popularity of a URL have positive influence to its importance

Web Search for X-Informatics

Web Search for X-Informatics

Presentation Transcript

Search web

Web Search

X-Informatics Cloud Technology

X-Informatics Cloud Technology (Continued)

X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics?

The Search for X

Web Search

Web search

Web Search

Beth Plale Director, Center for Data and Search Informatics School of Informatics

X-Informatics Web Search; Text Mining B

Web Search

Web Search

Adaptive Parallelism for Web Search

X-Informatics Cloud Technology (Continued)

Web Search

Web Search

Web Search

Web Search

Web Search