Google inter
Download
1 / 90

Google搜索 与 Inter网 的信息检索 - PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on

Google搜索 与 Inter网 的信息检索. 马志明 May 16 , 2008 Email: [email protected] http://www.amt.ac.cn/member/mazhiming/index.html. 约有 626,000 项符合 中国科学院数学与系统科学研究院 的查询结果,以下是第 1 - 100 项。 ( 搜索用时 0.45 秒). How can google make a ranking of 626,000 pages in 0.45 seconds?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Google搜索 与 Inter网 的信息检索' - vine


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Google inter

Google搜索与Inter网的信息检索

马志明

May 16, 2008

Email: [email protected]

http://www.amt.ac.cn/member/mazhiming/index.html


626 000 1 100 0 45

约有626,000项符合中国科学院数学与系统科学研究院的查询结果,以下是第1-100项。(搜索用时 0.45秒)

How can google make a ranking of 626,000 pages

in 0.45 seconds?


A main task of internet web information retrieval design and analysis of search engine se algorithm

A main task of Internet (Web) Information Retrieval = Design and Analysis of Search Engine (SE) Algorithm

involving

plenty of Mathematics


HITS

1998 Jon Kleinberg Cornell University

PageRank

  • Sergey Brin and Larry Page

  • Stanford University


Nevanlinna prize 2006 jon kleinberg

One of Kleinberg‘s most important research achievements focuses on the internetwork structure of the World Wide Web.

Prior toKleinberg‘s work, search engines focused only on the content of web pages,not on the link structure.

Kleinberg introduced the idea of

“authorities” and “hubs”:

An authority is a web page that containsinformation on a particular topic,

and a hub is a page that contains links tomany authorities. Zhuzihu thesis.pdf

Nevanlinna Prize(2006)Jon Kleinberg


Page rank the ranking system used by the google search engine
Page focuses on the internetwork structure of the World Wide Web.Rank, the ranking systemused by the Google searchengine.

  • Query independent

  • content independent.

  • using only the web graph structure


Page rank the ranking system used by the google search engine1
Page focuses on the internetwork structure of the World Wide Web.Rank, the ranking system used by the Google search engine.


WWW 2005 paper focuses on the internetwork structure of the World Wide Web.

PageRank as a Function of the Damping Factor

Paolo Boldi Massimo Santini Sebastiano Vigna

DSI, Università degli Studi di Milano

3 General Behaviour

3.1 Choosing the damping factor

3.2 Getting close to 1

  • can we somehow characterise the properties of ?

  • what makes different from the other (infinitely

    many, if P is reducible) limit distributions of P?


Conjecture 1 focuses on the internetwork structure of the World Wide Web.:

is the limit distribution of P when the starting

distribution is uniform, that is,


Website focuses on the internetwork structure of the World Wide Web.provide plenty of information:

pages in the same website may share the same IP, run on the same web server and database server, and be authored / maintained by the same person or organization.

there might be high correlations between pages in the same website, in terms of content, page layout and hyperlinks.

websites contain higher density of hyperlinks inside them (about 75% ) and lower density of edges in between.


HostGraph loses much focuses on the internetwork structure of the World Wide Web.

transition information

Can a surfer jump from page 5 of site 1 to a page in site 2 ?


From: focuses on the internetwork structure of the World Wide [email protected] [mailto:s06-pc-chairs-Sent: 2006年4月4日 8:36To: Tie-Yan Liu; [email protected]; [email protected]; [email protected]; [email protected]: [SIGIR2006] Your Paper #191Title: AggregateRank: Bring Order to Web SitesCongratulations!!

29th AnnualInternationalConference onResearch & Development on Information Retrieval (SIGIR’06, August 6–11, 2006, Seattle, Washington, USA).


Ranking websites a probabilistic view

Ranking Websites, focuses on the internetwork structure of the World Wide Web.a Probabilistic View

Internet Mathematics,Volume 3 (2007), Issue 3

Ying Bao, Gang Feng, Tie-Yan Liu, Zhi-Ming Ma, and Ying Wang


- --- focuses on the internetwork structure of the World Wide Web.We suggest evaluating the importance of a website with the mean frequency of visiting the website for the Markov chain on the Internet Graph describing random surfing.

---We show that this mean frequency is

equal to the sum of the PageRanks of all the

webpages in that website

(hence is referred as PageRankSum )


---We propose a novel algorithm ( focuses on the internetwork structure of the World Wide Web.AggregateRank Algorithm)

based on the

theory of stochastic complement

to calculate the rank of a website.

---The AggregateRank Algorithm can

approximate the PageRankSum accurately,

while the corresponding computational

complexity is much lower than

PageRankSum


--- By constructing focuses on the internetwork structure of the World Wide Web.return-time Markov chains restricted to each website, we describe also the probabilistic relation between PageRank and AggregateRank.

---The complexity and the error bound of

AggregateRank Algorithm with experiments

of real dada are discussed at the end of the paper.



The stationary distribution, known as the PageRank vector, is given by

We may rewrite the stationary distribution as

with as a row vector of length


where is given bye is an dimensional column vector of all ones

We define the one-step transition probability from the website to the website by


The is given byN×N matrix C(α)=(cij(α)) is referred to as the coupling matrix, whose elements represent the transition probabilities between websites. It can be proved that C(α) is an irreducible stochastic matrix, so that it possesses a unique stationary probability vector. We use ξ(α) to denote this stationary probability, which can be gotten from


Since is given by

One can easily check that

is the unique solution to

We shall refer as the AggregateRank


That is, the probability of visiting a website is equal to the sum of PageRanks of all the pages in that website. This conclusion is consistent to our intuition.


the transition probability from the sum of PageRanks of all the pages in that website. This conclusion is consistent to our intuition. Si to Sj actually summarizes all the cases that the random surfer jumps from any page in Si to any page in Sj within one-step transition. Therefore, the transition in this new HostGraph is in accordance with the real behavior of the Web surfers. In this regard, the so-calculated rank from the coupling matrix C(α) will be more reasonable than those previous works.


We have the sum of PageRanks of all the pages in that website. This conclusion is consistent to our intuition.

Let denote the number of

visiting the website during the n times

, that is


We define the sum of PageRanks of all the pages in that website. This conclusion is consistent to our intuition.

Assume a starting state in website A, i.e.

and inductively

It is clear that all the variables are stopping times for X.


Similarly, we have the sum of PageRanks of all the pages in that website. This conclusion is consistent to our intuition.

Let denote the transition matrix of

the return-time Markov chain for site


Suppose that AggregateRank, i.e. the sum of PageRanks of all the pages in that website. This conclusion is consistent to our intuition.

the stationary distribution of is

Since

Therefore


  • Based on the above discussions, the direct approach of computing the AggregateRank ξ(α) is to accumulate PageRank values (denoted by PageRankSum).

  • However, this approach is unfeasible because the computation of PageRank is not a trivial task when the number of web pages is as large as several billions. Therefore,

    Efficient computation becomes a significant problem .


Aggregaterank

AggregateRank

1. Divide the n × n matrix into

N × N blocks according to the N sites.


3. computing the AggregateRank Determine from

4. Form an approximation to the coupling

matrix , by evaluating

5. Determine the stationary distribution of

and denote it , i.e.,


Experiments
Experiments computing the AggregateRank

  • In our experiments, the data corpus is the benchmark data for the Web track of TREC 2003 and 2004, which was crawled from the .gov domain in the year of 2002.

  • It contains 1,247,753 webpages in total.


we get 731 sites in the .gov dataset. The largest website contains 137,103 web pages while the smallest one contains only 1 page.



Similarity between PageRankSum and Kendall's distance

other three ranking results.


  • From: Kendall's [email protected]: Thursday, April 03, 2008 9:48 AMDear Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang,

  • Zhiming Ma, Shuyuan He, Hang Li

  • We are pleased to inform you that your paperTitle: BrowseRank: Letting Web Users Vote for Page Importancehas been accepted for oral presentation as a full paper and for publication as an eight-page paper in the proceedings of

  • the 31st Annual International ACM SIGIR

    Conference on Research & Development on Information Retrieval.

    Congratulations!!


Building model
Building model Kendall's distance

  • Properties of Q process:

    • Stationary distribution:

    • Jumping probability:

    • Embedded Markov chain:

      • is a Markov chain with the transition probability matrix


Main conclusion 1
Main conclusion 1 Kendall's distance

  • is the mean of the staying time on page i.

    The more important a page is, the longer staying time on it is.

  • is the mean of the first re-visit time at page i. The more important a page is, the smaller the

    re-visit time is, and the larger the visit frequency is.


Main conclusion 2
Main conclusion 2 Kendall's distance

  • is the stationary distribution of

  • The stationary distribution of discrete model is easy to compute

    • Power method for

    • Log data for


Further questions
Further questions Kendall's distance

  • How about inhomogenous process?

    • Statistic result show: different period of time possesses different visiting frequency.

    • Poisson processes with different intensity.

  • Marked point process

    • Hyperlink is not reliable.

    • Users’ real behavior should be considered.


Relevance ranking
Relevance Ranking Kendall's distance

Many features for measuring relevance

Term distribution (anchor, URL, title, body, proximity, ….)

Recommendation & citation (PageRank, click-through data, …)

Statistics or knowledge extracted from web data

Questions

What is the optimal ranking function to combine different features (or evidences)?

How to measure relevance?


Learning to rank
Learning to Rank Kendall's distance

What is the optimal weightings for combining the various features

Use machine learning methods to learn the ranking function

Human relevance system (HRS)

Relevance verification tests (RVT)

Wei-Ying Ma, Microsoft Research Asia


Learning to rank1
Learning to Rank Kendall's distance

Learning

System

Model

min Loss

Ranking

System

66

Wei-Ying Ma, Microsoft Research Asia


Learning to rank cont
Learning to Rank (Cont) Kendall's distance

Break down

  • State-of-the-art algorithms for learning to rank take the pairwise approach

    • Ranking SVM

    • RankBoost

    • RankNet (employed at Live Search)

67

Wei-Ying Ma, Microsoft Research Asia


Learning to rank2
learning to rank Kendall's distance

  • The goal of learning to rank is to construct a real-valued function that can generate a ranking on the documents associated with the given query. The state-of-the-art methods transforms the learning problem into that of classification and then performs the learning task:


  • For each query, it is assumed that there are two categories of documents: positive and negative (representing relevant and irreverent with respect to the query). Then document pairs are constructed between positive documents and negative documents. In the training process, the query information is actually ignored.


[5] Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon. Adapting ranking svm to document retrieval.

In Proc. of SIGIR’06, pages 186–193, 2006.

[11] T. Qin, T.-Y. Liu, M.-F. Tsai, X.-D. Zhang, and H. Li. Learning to search web pages with query-level loss functions. Technical Report MSR-TR-2006-156, 2006.

As case studies, we investigate Ranking SVM and RankBoost.

We show that after introducingquery-level normalizationto its objective function, Ranking SVM will have query-level stability.

For RankBoost, the query-level stability can be achieved if we introduce bothquery-level normalization and regularizationto its objective function.



It should be noted that if , then the bound makes sense. This condition can be satisfied in many practical cases.

As case studies, we investigate Ranking SVM and RankBoost.

We show that after introducing query-level normalization to its objective function, Ranking SVM will have query-level stability.

For RankBoost, the query-level stability can be achieved if we introduce both query-level normalization and regularization to its objective function.

These analyses agree largely with our experiments and the experiments in [5] and [11].


Rank aggregation
Rank aggregation the bound makes sense. This condition can be satisfied in many practical cases.

  • Rank aggregation is to combine ranking results of entities from multiple ranking functions in order to generate a better one. The individual ranking functions are referred to as base rankers, or simply rankers.


Score based aggregation
Score-based aggregation the bound makes sense. This condition can be satisfied in many practical cases.

  • Rank aggregation can be classified into two categories [2]. In the first category, the entities in individual ranking lists are assigned scores and the rank aggregation function is assumed to use the scores (denoted as score-based aggregation) [11][18][28].


Order based aggregation
order-based aggregation the bound makes sense. This condition can be satisfied in many practical cases.

  • In the second category, only the orders of the entities in individual ranking lists are used by the aggregation function (denoted as order-based aggregation). Order-based aggregation is employed at meta-search,

    for example, in which only order (rank) information from individual search engines is available.


  • Previously order-based aggregation was mainly addressed with the unsupervised learning approach, in the sense that no training data is utilized; methods like

  • Borda Count [2][7][27],

  • median rank aggregation [9],

  • genetic algorithm [4],

  • fuzzy logic based rank aggregation [1],

  • Markov Chain based rank aggregation [7]

    and so on were proposed.


It turns out that the optimization problems for the Markov Chain based methods are hard, because they are not convex optimization problems.

We are able to develop a method for the optimization of one Markov Chain based method, called Supervised MC2.

We prove that we can transform the optimization

problem into that of Semidefinite Programming. As a result, we can efficiently solve the issue.


Next generation web search web search 2 0 3 0
Next Generation Web Search ? Chain based methods (Web Search 2.0 --> 3.0)

Directions for new innovations

Process-centric vs. data-centric

Infrastructure for Web-scale data mining

Intelligence & knowledge discovery

Wei-Ying Ma, Microsoft Research Asia


Web search past present and future

Web Search – Chain based methods Past, Present, and Future

Wei-Ying Ma

Web Search and Mining Group

Microsoft Research Asia

next generation.ppt

Web Search - Past Present and Future - public.ppt


ad