web search for x informatics l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Web Search for X-Informatics PowerPoint Presentation
Download Presentation
Web Search for X-Informatics

Loading in 2 Seconds...

play fullscreen
1 / 59

Web Search for X-Informatics - PowerPoint PPT Presentation


  • 217 Views
  • Uploaded on

Web Search for X-Informatics Spring Semester 2002 MW 6:00 pm – 7:15 pm Indiana Time Geoffrey Fox and Bryan Carpenter PTLIU Laboratory for Community Grids Informatics, (Computer Science , Physics) Indiana University Bloomington IN 47404 gcf@indiana.edu References I

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Web Search for X-Informatics' - benjamin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
web search for x informatics

Web Search forX-Informatics

Spring Semester 2002 MW 6:00 pm – 7:15 pm Indiana Time

Geoffrey Fox and Bryan Carpenter

PTLIU Laboratory for Community Grids

Informatics, (Computer Science , Physics)

Indiana University

Bloomington IN 47404

gcf@indiana.edu

references i
References I
  • Here are a set addressing Web Search has one approach to information retrieval
  • http://umiacs.umd.edu/~bonnie/cmsc723-00/CMSC723/CMSC723.ppt
  • http://img.cs.man.ac.uk/stevens/workshop/goble.ppt
  • http://www.isi.edu/us-uk.gridworkshop/talks/goble_-_grid_ontologies.ppt
  • http://www.cs.man.ac.uk/~carole/cs3352.htm has several interesting sub-talks in it
    • http://www.cs.man.ac.uk/~carole/IRintroduction.ppt
    • http://www.cs.man.ac.uk/~carole/SearchingtheWeb.ppt
    • http://www.cs.man.ac.uk/~carole/IRindexing.ppt
    • http://www.cs.man.ac.uk/~carole/metadata.ppt
    • http://www.cs.man.ac.uk/~carole/TopicandRDF.ppt
  • http://www.isi.edu/us-uk.gridworkshop/talks/jeffery.ppt from the excellent 2001 e-Science meeting
references ii discussion of real systems
References II: Discussion of “real systems”
  • General review stressing the “hidden web” (content stored in databases)http://www.press.umich.edu/jep/07-01/bergman.html
  • IBM “Clever Project” Hypersearching the Webhttp://www.sciam.com/1999/0699issue/0699raghavan.html
  • Google Anatomy of a Web Search Enginehttp://www.stanford.edu/class/cs240/readings/google.pdf
  • Peking University Search Engine Grouphttp://net.cs.pku.edu.cn/~webg/refpaper/papers/jwang-log.pdf
  • A Huge set of links can be found at:http://net.cs.pku.edu.cn/~webg/refpaper/
webgather towards quality and scalability of a web search service

This lecture built around this presentation by Xiaoming Li

We have inserted material from other cited references

WebGather:towards quality and scalability of a Web search service

LI Xiaoming • Department of Computer Science and Technology, Peking Univ.

A presentation at Supercomputing 2001 through a constellation site in China

November 15, 2001

how many search engines out there
How many search engines out there ?
  • Yahoo !
  • AltaVista
  • Lycos
  • Infoseek
  • OpenFind
  • Baidu
  • Google
  • WebGather (天网)
  • … there are more than 4000 in the world ! (Complete Planet White Paperhttp://www.press.umich.edu/jep/07-01/bergman.html)
agenda
Agenda
  • Importance of Web search service
  • Three primary measures/goals of a Web search service
  • Our approaches to the goals
  • Related works
  • Future work
importance of web search service
Importance of Web Search Service
  • Rapid growing of web information
    • >40 millions of Chinese web pages under .cn
  • The second popular application on the web
    • email, search engine
  • Information accessing: from address-based to content-based
    • who can remember all those URLs ?!
    • search engine: a first step towards content-based web information accessing
  • There are 4/24 sessions, 15/78 papers at WWW10 !
primary measures goals of a search engine
Primary Measures/Goals of a Search engine
  • Scale
    • volume of indexed web information, ...
  • Performance
    • “real time” constraint
  • Quality
    • do the end user like the result returned ?

they are at odds with one another !

scale go for massive
Scale: go for massive !
  • the amount of information that is indexed by the system (e.g. number of web pages, number of ftp file entries, etc.)
  • the number of websites it covers.
  • coverage: percentages of the above with respect to the totals out there on the Web
  • the number of information forms that is fetched and managed by the system (e.g. html, txt, asp, xml, doc, ppt, pdf, ps, Big5 as well as GB, etc.)
primary measures goals of a search engine14
Primary measures/goals of a search engine
  • Scale
    • volume of indexed information, ...
  • Performance
    • “real time” constraint
  • Quality
    • does the end user like the result returned ?

they are at odds with one another !

performance real time requirement
Performance: “real time” requirement
  • fetch the targeted amount of information within a time frame, say 15 days
    • otherwise the information may be obsolete
  • deliver the results to a query within a time limit (response time), say 1 second
    • otherwise users may turn away from your service, never come back !

larger scale may imply degradation of performance

primary measures goals of a search engine16
Primary measures/goals of a search engine
  • Scale
    • volume of information indexed, ...
  • Performance
    • “real time” constraint
  • Quality
    • do the end user like the result returned ?

they are at odds with one another !

quality do the users like it
Quality: do the users like it ?
  • recall rate
    • can it return information that should be returned ?
    • high recall rate requires highcoverage
  • accuracy
    • percentage of returned results that are relevant to the query
    • high accuracy requires bettercoverage
  • ranking (a special measure of accuracy)
    • are the most relevant results appearing before those less relevant ?
our approach
Our approach
  • Parallel and distributed processing: reach for large scale and scalability
  • User behavior analysis: give forth mechanisms for performance
  • Making use of content of web pages: hint innovative algorithms for quality
towards scalability
Towards scalability
  • WebGather 1.0: a million-page level system operated since 1998, uni-crawler.
  • WebGather 2.0: a 30-million-page level system operated since 2001, a fully parallel architecture.
    • not only boosts up the scale
    • but also improves performance
    • and delivers better quality
architecture of typical search engines
Architecture of typical search engines

Internet

...

robot

scheduler

robot

indexer

searcher

user interface

index database

crawler

raw database

towards scalability main technical issues
Towards scalability: main technical issues
  • how to assign crawling tasks to multiple crawlers for parallel processing
    • granularity of the tasks: URL or IP address ?
    • maintenance of a task pool: centralized or distributed ?
    • load balance
    • low communication overhead
  • dynamic reconfiguration
    • in response to failure of crawlers, …, (remembering that crawling process usually takes weeks)
parallel crawling in webgather
Parallel Crawling in WebGather

CR: crawler registry

task generation and assignment
Task Generation and Assignment

granularity of parallelism: URL or domain name

task pool: distributed, and tasks are dynamically created and assigned

A hash function is used for task assignment and load balance

H(URL) = F(URL’s domain part) mod N

simulation result scalability
Simulation result: scalability

Speedup

number of crawlers

experimental result scalability
Experimental result: scalability

Speedup

Number of crawlers

our approach28
Our Approach
  • Parallel and distributed processing: reach for large scale and scalability
  • User behavior analysis: give forth mechanisms for performance
  • Making use of content of web pages: hint innovative algorithms for quality
towards high performance
Towards high performance
  • “parallel processing”, of course, is a plus to performance, and
  • more importantly , user behavior analysis suggests critical mechanisms for improved performance
    • a search engine not only maintains web information, but also logs user queries
    • a good understanding of the queries gives rise to cache design and performance tuning approaches
what do you keep
What do you keep?
  • So you gather data from the web storing
    • Documents and more importantly words extracted from documents
  • After removing dull words, you store document# for each word together with additional data
    • position and meta-information such as font, tag enclosed in (i.e. if in meta-data section)
  • Position needed to be able to respond to multiword queries with adjacency requirements
  • There is a lot of important research in best way to get, store and retrieve information
what pages should one get
What Pages should one get?
  • A Web Search is an Information not a Knowledge retrieval engine
  • It looks at a set of text pages with certain additional characteristics
    • URL, Titles, Fonts, Meta-data
  • And matches a query to these pages returning pages in a certain order
  • This order and choices made by user in dealing with this order can be thought of as “knowledge”
    • E.g. user tries different queries and ecides which of returned set to explore
  • People complain about “number of pages” returned but I think this is a GOOD model for knowledge and it is good to combine people with the computer
how do you rank pages
How do you Rank Pages
  • One can find at least 4 criteria
  • Content of Document i.e. nature of occurrence of query terms in document (Author)
  • Nature of links to and from this document – this is characteristic of a Web page (Other authors)
    • Google and IBM Clever project emphasized this
  • Occurrence of documents in compiled directories (editors)
  • Data on what users of search service have done (users)
document content ranking
Document Content Ranking
  • Here the TF*IDF method is typical
    • TF Term (query word) Frequency
    • IDF is Inverse Document Frequency
  • This gives a crude ranking which can be refined by other schemes
  • If you have multiple terms then you can add their values of TF*IDF
  • Next slides come from earlier courses from Goble (Manchester) and Maryland cited at start
ir information retrieval as clustering
IR (Information Retrieval) as Clustering
  • A query is a vague spec of a set of objects, A
  • IR is reduced to the problem of determining which documents are in set A and which ones are not
  • Intra clustering similarity:
    • What are the features that better describe the objects in A
  • Inter clustering dissimilarity:
    • What are the features that better distinguish the objects A from the remaining objects in C

A:

Retrieved

Documents

x

x

x

x

x

x

C: Document Collection

index term weighting36

occ(t,d)

tf(t,d) =

occ(tmax, d)

N

idf(t) =

log

n(t)

Index term weighting

Normalised frequency of term t in document d

Intra-clustering similarity

  • The raw frequency of a term t inside a document d.
  • A measure of how well the document term describes the document contents

Inter-cluster dissimilarity

  • Inverse document frequency
  • Inverse of the frequency of a term t among the documents in the collection.
  • Terms which appear in many documents are not useful for distinguishing a relevant document from a non-relevant one.

Inverse document frequency

Weight(t,d) = tf(t,d) x idf(t)

term weighting schemes

occ(t,d)

weight(t,d) =

x

occ(tmax, d)

0.5occ(t,q)

x

0.5 +

weight(t,d) =

occ(tmax, q)

N

N

log

log

n(t)

n(t)

Inverse document frequency

Term frequency

Term weighting schemes
  • Best known
  • Variation for query term weights
tf idf example
TF*IDF Example

1

2

3

4

1

2

3

4

Unweighted query:

contaminated retrieval

Result: 2, 3, 1, 4

5

2

1.51

0.60

complicated

0.301

4

1

3

0.50

0.13

0.38

contaminated

0.125

5

4

3

0.63

0.50

0.38

fallout

0.125

Weighted query:

contaminated(3) retrieval(1)

Result: 1, 3, 2, 4

6

3

3

2

information

0.000

1

0.60

interesting

0.602

3

7

0.90

2.11

nuclear

0.301

IDF-weighted query:

contaminated retrieval

Result: 2, 3, 1, 4

6

1

4

0.75

0.13

0.50

retrieval

0.125

2

1.20

siberia

0.602

document length normalization
Document Length Normalization
  • Long documents have an unfair advantage
    • They use a lot of terms
      • So they get more matches than short documents
    • And they use the same words repeatedly
      • So they have much higher term frequencies
cosine normalization example
Cosine Normalization Example

1

2

3

4

1

2

3

4

1

2

3

4

5

2

1.51

0.60

0.13

0.57

0.69

complicated

0.301

4

1

3

0.50

0.13

0.38

0.29

0.14

contaminated

0.125

5

4

3

0.63

0.50

0.38

0.37

0.19

0.44

fallout

0.125

6

3

3

2

information

0.000

1

0.60

0.62

interesting

0.602

3

7

0.90

2.11

0.53

0.79

nuclear

0.301

6

1

4

0.75

0.13

0.50

0.77

0.05

0.57

retrieval

0.125

2

1.20

0.71

siberia

0.602

1.70

0.97

2.67

0.87

Length

Unweighted query: contaminated retrieval, Result: 2, 4, 1, 3 (compare to 2, 3, 1, 4)

google page rank
Google Page Rank
  • This exploits nature of links to a page which is a measure of “citations” for page
  • Page A has pages T1 T2 T3 …Tn which point to it
  • d is a fudge factor (say 0.85)
  • PR(A) = (1-d) + d *(PR(T1)/C(T1) + PR(T2)/C(T2) + … + PR(Tn)/C(Tn) )
  • Where C(Tk) is number of links from page Tk
hits hypertext induced topic search
HITS: Hypertext Induced Topic Search
  • The ranking scheme depends on the query
  • Considers the set of pages that point to or are pointed at by pages in the answer S
  • Implemented in IBM;s Clever Prototype
  • Scientific American Article:
  • http://www.sciam.com/1999/0699issue/0699raghavan.html
hits 2
HITS (2)
  • Authorities:
    • Pages that have many links point to them in S
  • Hub:
    • pages that have many outgoing links
  • Positive two-way feedback:
    • better authority pages come from incoming edges from good hubs
    • better hub pages come from outgoing edges to good authorities
authorities and hubs
Authorities and Hubs

Authorities ( blue )

Hubs (red)

hits two step iterative process
HITS two step iterative process

  • assigns initial scores to candidate hubs and authorities on a particular topic in set of pages S
  • use the current guesses about the authorities to improve the estimates of hubs—locate all the best authorities
  • use the updated hub information to refine the guesses about theauthorities--determine where the best hubs point most heavily and call these the goodauthorities.
  • Repeat until the scores eventually converge to the principle eigenvector of the link matrix of S, which can thenbe used to determine the best authorities and hubs.

A(u)

H(p) =

u  S | p  u

H(u)

A(p) =

v  S | v  p

cybercommunities
Cybercommunities

HITS is clusteringweb intoCommunities

google vs clever
Google

assigns initial rankings andretains them independently of any queries -- enables faster response.

looks only in the forward direction, from link to link.

Clever

assembles a different root setfor each search term and then prioritizes those pages in the context of that particular query.

also looks backward from an authoritative page to see what locations are pointing there. Humans are innately motivated to create hub-like content expressing their expertise on specific topics.

Google vs Clever
peking university user behavior analysis
Peking UniversityUser behavior analysis
  • taking 3 month worth of real user queries (about 1 million queries)
  • each query consists of <keywords, time, IP address, …>
  • keywords distribution: weobserve that high frequency keywords are dominating
  • grouping the queries in 1000, exam the difference between consecutive groups: we observe a quite stable process (the difference is quite small)
  • do the above for different group sizes: we observe a strong self-similarity structure
distribution of user queries
Distribution of user queries

Terms (query words as fraction)

0.2

0.8

Queriesas timesearching

  • Only 160,000 different keywords in 960,000 queries
  • 20% of high-frequency queries occupies 80% of the total visit times
towards high performance50
Towards high performance

Query caching improves system performance dramatically

more than 70% of user queries can be satisfied in less than 1 millisecond

almost all queries are answered in 1 second

User behavior may also be used for other purpose

evaluation of various ranking metrics, e.g., the link popularity and replica popularity of a URL have positive influence to its importance

our approach51
Our approach
  • Parallel and distributed processing: reach for large scale and scalability
  • User behavior analysis: give forth mechanisms for performance
  • Making use of content of web pages: hint innovative algorithms for quality
towards good quality
Towards good quality
  • Do not miss those important pages: keep recall rate high
  • Clever algorithm for removing near replicas: better accuracy
  • new metrics to evaluate pages’ relevance: improved ranking
    • anchor text based, instead of PageRank based
fetch the important pages first
Fetch the “important” pages first

crawling is normally done with a time frame, thus not missing important pages is a practical issue for guaranteeing good search quality later on

besides picking up “good” seed URLs, we use a formula to determine the importance of a page

removing near replicas
Removing near-replicas

Url1: (http://www.a.com/index.html)

词频:computer 45 network 33 server 9….

Url2: (http://www.b.com/gbindex.html)

词频:computer 45 network 30 server 16 ….

computer

  • 3/(a+b)<0.01

Url2

a

  • 3

Url1

b

network

vector based vs. fingerprint based

server

related works
Related works
  • Harvest
    • good academic ideas, but complicated design, not sustained
  • Google
    • the most famous search engine in the world at the moment, but little exposure on technology used after 1998 (Brian, 1998, WWW-7)
    • character based, instead of word based, Chinese processing ?
    • too much hardware than necessary (10000 PCs were reported)?