Link Structure and Web Mining

Link Structure and Web Mining Shuying Wang 2003.11

Outline Part one: Link Structure and Web Mining Part two: Analysis of Link Structure Topic covered: - Web mining methods - Text based Web mining - Web graph -- Bow tie theory - Eigenvalue and Eigenvector - Authorities & Hubs - Hits (Hyperlink-Induced Topic Search) - PageRank

Challenges for Web Search • The WWW is a vast collection of information: over 3 billion text pages plus a multitude of multimedia files. Over a million new resources are added every day. • Huge • Complex • Dynamic • Diversity • Different User Group • How do we find the information we need in such a large collection? • Search is the most common activity on the web after email.

Web Mining Method • Web content mining - Context, Keyword, Document classification • Web structure mining - Link structure and link text • Web usage mining - Weblog, URL, timestamp, IP and web page content

Limitations of text based analysis • Text-based ranking function • Eg. Could www.harvard.edu be recognized as one of the most authoritative pages, since many other web pages contain “harvard” more often. • Pages are not sufficiently self – descriptive • Usually the term “search engine” doesn't’t appear on search engine web pages Web database Keyword Web pages

Bow-tie Theory

What are the benefits of link building? • Following a link is one of the most popular ways for people to find new sites. • By providing links to other material people don't have to re-invent the wheel. • Inbound links help to build trust. • Link structure and link text provide a lot of information for making relevance judgments and quality filtering • The link structure implies an underlying social structure in the way that pages and links are created, and it is an understanding of this social organization that can provide us the most leverage.

Queries and Authoritative Sources Types of queries • Specific queries E.g., “Does Netscape support the JDK 1.3?” • Broad-topic queries E.g., “Find information about the Java programming language.” • Similar-page queries E.g., “Find pages java.sun.com” Authoritative pages –relative to broad-topic query • It is not sufficient to collect a large number of potentially relevant page from text-based methods. • Authorities are often not particularly self-descriptive

Authorities and Hubs • A good authority is a page that is pointed by many good hubs, while a good hub is a page that points to many good authorities. • This is the mutually reinforcing relationship. The authority pages are those that contain the most definitive, central, and useful information in the context of particular topics. Hubs that link to a collection of prominent sites on a common topic hubs authorities

Hits (Hyperlink-Induced Topic Search) • The focused subgraph is created by first taking the highest-ranked pages from a text-based search engine as a root set R. • R is expanded into the base set S by taking all sites pointing to or pointed at by a site in R. • Note that while R may fail to contain some “important” authorities, S will probably contain them. u Root set Rn … R1 … Sn S1 Base set

Computing Hubs and Authorities(1) For each page p, we associate a non-negative authority weight ap and a non-negative hub weight hp. (1) (2) Number the pages{1,2,…n} and define their adjacency matrix A to be the n*n matrix whose (i,j)th entry is equal to 1 if page i links to page j, and is 0 otherwise. Define a=(a1,a2,…,an) and h=(h1,h2,…,hn). (3) (4)

Computing Hubs and Authorities(2) (5) • In other words, a is an eigenvector of B: • B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j. • B is symmetric and has n orthogonal unit eigenvectors. (6) (7) Let

Computing Hubs and Authorities(3) • We initialize a(p) = h(p) = 1 for all p. • We iterate the following operations: • And renormalize after each iteration

Computing Hubs and Authorities(4) • The eigenvectors of B are precisely the stationary points of this process. • h is the principal eigenvector of ATA, and a is the principal eigenvector of AAT. • The principal eigenvector represents the “densest cluster” within the focused subgraph. • By initializing a(p)=h(p)=1, a will converge to the principal eigenvector of B. • Initializing differently may lead to convergence to a different eigenvector. • In practice convergence is achieved after only 10-20 iterations.

PageRank (Simple structure of Google search engine) query offline TextIndex() Query-time Inverted Text index Query Processor Web Page rank PageRank() Ranked results

PageRank Computing u: web page v: page links to uBu: the set of pages c: a factor for normilization (C <1) (1) Let A be a square matrix with rows and columns corresponding to web pages. Let If let R as vector over web pages, Then R = cAR. (2) R is an eigenvector of A with eigenvalue c.

Hits and PageRank PageRank - Offline computing - Focuses on authoritative pages - Computing all the web pages Hits: - Query time computing - Seeks good hub pages - Computing the base set pages

Conclusion • A technique for locating high-quality information related to a broad search topic on the www, based on a structural analysis of the link topology surrounding “authoritative” pages on the topic. • Related work. • Standing, influence in social networks, scientific citations, etc. • Hypertext and WWW rankings • …

Reference • Mining the Link Structure of the World Wide Web Jon Kleinberg • Authoritative Sources in a Hyperlinked Environment JonKleinberg • The PageRank Citation Ranking: Bringing Order to the Web Larry Page • Effective Finding Relevant Web Pages from Linkage Information Jingyu Hou Yanchun Zhang • Data Mining Concepts and Techniques JiaWei Han Micheline Kamber

Link Structure and Web Mining

Link Structure and Web Mining

Presentation Transcript

Building a Web Thesaurus from Web link Structure

WEB STRUCTURE MINING

CS 277: Data Mining Mining Web Link Structure

Web Mining

Chapter 8 Web Structure Mining

What is web link mining? ?

CS 277: Data Mining Mining Web Link Structure

CS 277: Data Mining Mining Web Link Structure

Link Mining

Mining Web’s Link Structure

Web mining

Web Mining

Web Mining

Link Mining in the Blogosphere Workshop on Community-based Web Service Computing and Mining

Link Mining

Web Mining and Link Analysis

Link to Structure

Web Mining and Recommendation