Disambiguation Web Appearance of People by Exploiting Link-based and Content-based Information of Social Network Relati

Disambiguation Web Appearance of People by Exploiting Link-based and Content-based Information of Social Network Relation Reporter : Kun-Yan Chiou Advisor : Prof. Hahn-Ming Lee, Prof. Jan-Ming Ho Date: 2007/07/18

Outline • Introduction • Motivation • Goal • Contribution • Related Work • DWAP System • Experiments • Discussion • Conclusion • Further Work

Introduction • With the exponential growth of the World Wide Web (WWW), an increasing number of useful information resources can be retrieved from the limitless database. • Searching for personal data about people of interest is one of the most popular types of search activity. • This is because the WWW allows us to expand our network of social relations, we have many unlimited opportunities to meet and communicate with other people via this virtual environment. • When a person’s name is input to a search engine as query keyword, the database usually returns several URLs for that name, but the user does not know with certainty which pages refer to the person of interest.

Introduction (cont.) • We often use some background knowledge like the person’s occupation or title to filter out unmatched data, but it causes some side effects as following descriptions. • Omitted data: • Some raw query result pages do not contain any related background knowledge, so they are omitted. • Ambiguous properties: • Two individuals may have a common characteristic, so background knowledge cannot be used to disambiguate the web pages.

Motivation • Most well-known search engines have better solution on the page ranking problem, and web page classification problem. However, they still do not provide a better function for the namesake on the webpage. • If the number of retrieved URLs is too much, users must spend a great deal of time and effort disambiguating a large number of web pages one by one. • Retrieving and choosing suitable background knowledge is a difficult task. • We attempt to provide an effective grouping of the search results with no background knowledge, the user could easily choose the web pages that refer to the person of interest.

Problem Statement • One personal name as a query term q is sent to search engine E and no other additional keywords as background knowledge, then through a system S to automatically group web pages from search result, where the pages of each group only referred to single real individual who have personal name q. • We assume each searched web page only referred to one individual that contain personal name in searching keyword. • A personal name N maybe consist of one word or multi-words, in other word, the query term can be a short name or a full name, but not multi different terms both referred to single individual .

Goal Append Keywords or Other Information Query Personal Name Related Pages Search Engine Browse Web Pages One by One Meaningful Keywords User Selected Data Disambiguation Web Appearance of People System Figure 1-1 The procedure of people information search in the WWW

Contribution • We propose a system to disambiguate web appearance of people with no background knowledge. • The proposed system need no additional learning process. • Applying Link-based and Content-based information of social network relation to retrieve pairwise relations of web pages. • User can more effectively to search public information of people from the WWW. • Kai-Hsiang Yang, Kun-Yan Chiou, Hahn-Ming Lee, and Jan-Ming Ho, "Web Appearance Disambiguation of Personal Names Based on Network Motif", the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI-06), Hong Kong, December, 2006.

Related Work

Related Work (cont.)

Ideas • Link-based • Users often click on the hyperlinks of pages and explore the information that the authors recommend for readers. • Hyperlinks on the web usually point to other pages which contain related information and topics, this is termed as topic locality and anchor description. • Content-based • Most people try to mark some meaningful keywords, and then match them with other pages that include the target keywords. • Bekkerman and McCalum proposed using the hyperlinks of web pages as disambiguation information, but their approach needs to collect a name list of strong social relations as background knowledge.

The DWAP System • We assume that every person has different characteristics, which causes almost none possibility of duplicates. This condition becomes rarer for people who have same names. • Based on this assumption, we consider that multi-referenced web pages often appear on the websites of different domains. • The system architecture of DWAP (Disambiguation Web Appearanceof People) is illustrated in Figure 1-2.

System Architecture R-pages Expanded Pages Link Structure Collection and Pairwise Relation Extraction L S R E Web Pages Collection Personal Name In-Out links Raw Query Result Pages (R-pages) Anchor Text Retrieve and Pairwise Relation Extraction A P R E Relational Weight matrix of R-pages Grouping R-Pages G R Grouped R-Pages Relational Weight Matrix of R-pages Figure 1-2 System Architecture of DWAP

Modules of DWAP • Web Pages Collectionmodule • This module gathers raw data (R-Pages), which it assigns to two other modules for the extraction of extension hyperlinks and anchor texts. • Link Structure Collection and Pairwise Relation Extraction module (LSRE) • The kernel module of the system. • Its function is to retrieve relational hyperlink patterns from the link structure of R-Pages and construct pairwise relations for the R-Pages. • Anchor Text Retrieval and Pairwise Relation Extractionmodule (APRE) • This module only uses part of the content of web pages. • In our design, there is no direct interaction between the APRE and LSRE because hyperlinks and anchor texts can be processed simultaneously. • Grouping R-Pages module (GR) • The purpose of this module is to combine the benefits of hyperlink and anchor text.

The LSRE Module I-link and O-link URLs of R-pages Link Data DB Link Structure Construction Full Content data Web Page Filtering Noise Hyperlinks Hyperlinks in LS and I-link degree of URLs Figure 1-3 The framework of Link Structure Collection and Pairwise Relation Extraction Retrieve Network Motif Detected Pairwise Relation Network Motif Types Relational Weight Matrix of R-pages

Link Structure Construction and Link Data Database Usage • The query results returned by most search engines only contain URL links, so there is not enough relational information about the data. • Web PagesCollection component collects the full content of web pages, but the content of R-Pages is not enough if we want to utilize hyperlink information as a relational description of the pages. • Therefore, we try to construct the link structure of R-Pages, where the link structure is part of the web structure in the WWW. • we adopt two kinds of extension process to gather more hyperlinks for link structure construction: • (a) An L-level (Page-based) extension process • (b) A Host-based extension process.

L-level (Page-based) Extension Process • The L-level extension process is based on two types of hyperlink information: • Outgoing link (O-link) • An outgoing link means a hyperlink of one page points to another web page • O-links are retrieved by parsing the HTML tags of web pages. • The form of the extraction rule is <a>href="http://…"</a> • Incoming link (I-link) • An incoming link (or inbound link) means a hyperlink from another web page to the current page. • I-links are collected by using the link:URL query function provided by Google, which displays all pages that point to an object page and counts the number of pages.

L-level Extension Process (cont.) Figure 1-4 Two-level dataset extension

The Link Structure Construction Algorithm

The Link Structure Construction Algorithm (cont.) • RP is the raw dataset of previously described R-Pages, which will be extended to a larger dataset, EP. • The l value controls the depth of the link relation, where the I-link and O-link can be retrieved simultaneously. • In addition, this extension process needs to collect many web pages; thus, the processing time increases with number of pages in RP. This proposed approach requires extra time cost. • However, if the system is embedded in a search engine, above disadvantage will become minor impact, because most pages linked to R-Pages have been collected into the search engine’s database in advance.

Host-based Extension Process Figure 1-5 The host-based link relation extension in Pij, where i denotes different hosts and j denotes different pages belonging to the same host. The virtual link displays a hidden link relation between Pa2 and Pb1.

Host-based Extension Process (cont.) • During our initial testing, we found that the number of I-links is substantially less than the number of O-links. • This may be because users normally use hyperlink settings that point to host pages rather than to directory pages of a host. • Ex: A web page related to a researcher usually contains a hyperlink to the main page of his college or institution, instead of to a specific web page in the host. • We define the HI-link that represents this type of hyperlink as follows. Given a web page p, the host-based hyperlink information (HI-link) of page p is defined as all the I-links of the main page in the host of page p. • Ex: www.csie.ntust.edu.tw\teacher\~xxx only contains the www.csie.ntust.edu.tw host address. The URL is then sent to Google’s query function “link: www.csie.ntust.edu.tw”, and the query result will contain all URLs that point to the host address.

Filtering Noisy Hyperlinks • Some of the hyperlinks are not relevant to main topic of web pages. • Kleinberg observed that the In-Out links extension approach retrieves a large number of pathological links that contain endorsements, advertisements, or some other information of “collusion” among the referred pages. • Ex: • The URLs of popular web sites like Amazon, Yahoo, and CNN • Popular software links like Adobe Reader, Flash player • Web page validation marks like W3C, and ICRA • Note that most of these popular pages are linked to many other pages, and their hosts’ main pages have the same property. Therefore, we use the number of incoming links as the parameter to decide which hyperlinks to filter out.

Filtering Noisy Hyperlinks (cont.) • For a page p, let Si be the number of I-links and Shi be the number of HI-links. If both Si and Shi are larger than the pre-defined thresholds t0 and t1 respectively, the hyperlink will be removed. • If only Si or Shi is larger than the threshold and other smaller, then the square root of the product of Si and Shi will be compared with the t0. Because some web sites provide a web space service, the HI-links of the main pages may have fewer I-links than the directory pages.

Network Motif Types • The Network Motif Type component predefines some meaningful directed graph patterns in a link structure graph. • In essence, network motifs are the basic units of structural elements (sub-graphs) that build the complex networks. Each edge in a network motif has a direction and connects two nodes. • We define the network motifs that connect two pages in the R-Pages. In our work, we consider 2-node (1 type), 3-node (3types), 4-node (4 types), and 5-node (10 types) network motifs .

Network Motif Types Figure 1-6 The 2, 3, 4-node network motifs Figure 1-7 The 5-node network motifs

Social Network Relations in Network Motifs

Retrieve Network Motif Detected Pairwise Relation • Each type of Network Motif is matching with constructed link structure. • If one type of Network Motif is found in link structure, then the relational weight of pairwise data add one. • Page detected method: • A link path of Network Motif is linked by full URL address. • Host detected method: • A link path of Network Motif is linked by host address of URL. • Ex: P1 and P2 are searched result pages. If there are 2-F(4), 3-FB(1), 4-BBF(2) motif types in the link structure, then the pairwise weight of P1 and P2 is 4+1+2 = 7.

The APRE Module • Anchor Text Parsing and Pairwise Relation Extraction • Parts of web pages that refer to the same person only contain a few hyperlinks; hence, these pages are ignored by link-based approaches. • In order to collect these isolated pages, the content of web pages must be utilized. • Anchor text keywords can provide more correlative descriptions of web pages, it is easy to perceive the difference between anchor text and full document text. Brin and Page (the founders of Google) consider that anchor text contains more accurate descriptions of web pages than the pages themselves. • Anchor text can be considered as a succinct and exact summary of web pages.

Term Definitions and Description of Anchor Text • Anchor text refers to “underlined clickable text” that shows a hyperlink to web pages when a web browser is visited. • Ex: <a href=“http://www.smith.com/main.htm” >Smith Homepage<a> • Each piece of anchor text is denoted as ac of a web page p∈RP, where c is the list sequence number of the anchor text on page p. • If ac is a phrase, the probability of matching it with the content of other pages will be very low. Consequently, each piece of anchor text ac of a web page p is split into single keyword terms ts,i by our defined rules. • AnchorTextSplit(ac ) = { ts,i | s∈R, each ts1,i ≠ts2,i where s1≠s2 } • Ex: “ACM, ACM Digital Library Search” => “ACM”, “Digital”, “Library”, and “Search”.

Detection of the Pairwise Relations of R-Pages • Search full text of web pages, if some word belong Pj match with single anchor keywords Ws,i , then make a relation for Pi and Pj. The relation weight of two pages count by number of different Ws,i A B C D E F G H I A P P S E O K K F L O L S S O Weight: 3 W s,i P1 P2 Figure 3-7. The pairwise weight relations of web pages detected by using the keyword terms of anchor text. The relational weight of P1 and P2 is 3.

Solving Problem Caused by Using Anchor Text • The problem is that anchor texts contain noisy keyword terms. • Eliminating stop words is the most common preprocessing step of document clustering. • Traditional document retrieval techniques apply the Term Frequency/Inverted Document Frequency (TF-IDF) or other modifications to filter out irrelevant keyword terms. • We propose a measure called the Inverted Term Frequency-Inverted Document Frequency (ITF-IDF) for filtering noisy data, which enables us to hold special keywords in web pages.

ITFIDF • where i is a keyword term, j is a document in a document set D, n is the number of keyword terms, and k is the number of: keyword terms that occur in document j. • The relational weight values of pairwise pages are reformulated as follows: • W(pu , pv) = # ts,i∈pu or pv, • where the anchor terms ts,i of pu appear in the text of pv ; ITFIDF(ts,i) > θ is a constant threshold

The GR Module • To group R-pages, we use Warshall’s algorithm to compute the transitive closure of the relation matrix RM. • The most important procedure in this module is to combine pairwise relations of hyperlink and anchor Text. We propose a hybrid method to combine the two kinds of relation. • Link-based (hyperlink) method has higher precision rate • Content-based (anchor text) method has higher recall rate • Thus, we attempt to use link relations constructed groups as seed groups and then use anchor text constructed relations to combine more related web pages.

Combining the Retrieved Pairwise Relations of Hyperlink and Anchor Text

Experiments • We describe three main experiments in following. • The first experiment evaluates the effect on different step distances of hyperlink of network motif and the performance of page-based and host-based methods. • The second is a measurement for anchor text noise filter approach by applying ITFIDF and TFIDF approach to hold useful anchor text keywords. • The last experiment evaluates the approach that combines anchor texts with hyperlinks. • The evaluation metrics used in the experiments are the precision rate, the recall rate, the F-measure, and the number of groups.

Dataset

Evaluation Approach • The accuracy and error rates of pairwise R-pages are formulated as follows: • Accuracy (typex) = • Error (typex) = • where Pairs (pi, pj) denotes the number of pairs belonging to the same class, Paird (pi, pj) denotes the number of pairs belonging to different classes, Pairo (pi, pj) denotes the number of pairs where one page belongs to the target class and the other belongs to the “Other” class, and S is the number of all possible pair-wise relations.

The Accuracy and Error Rate of Different Network Motifs Figure 2-1 The X-axis shows different types of Network Motif and the Y-axis shows the percentage of relational pairwise detected network motif.

Evaluation Approach • Precision (Pi, Gj) = • Recall (Pi, Gj) = • The local F-measure of class Pi and group Gj is denoted as below: • F (Pi , Gj) = • The local F-measure only compares the performance of a single class and a single group, so the F-measure of any class in P is chosen as the maximum value for the corresponding group. The global F-measure for P and G is: • Fglobal =

The Performance of Network Motif Detection Methods Figure 2-2 The performance of different network motifs.

The Performance of Different Network Motif Detection Methods Figure 2-3 The performance of different network motif detection methods.

Performance for Combinations of Motifs AP: append page HAP: host append pages PD: page-based detected HD: host-based detected Figure 2-4 The performance of combinations of motifs.

The Comparison of TFIDF and ITFIDF

Experiment Comparison

Discussion • Meaningful hyperlink relation of Network Motif • Most meaningful relations of hyperlinks are edited by web page editor, and we believe that these editors have known a personal name appearance in different pages which referred to the same individual. • Hyperlink and Anchor text are well applied with time change • Web pages in WWW are highly dynamic as time changes, so the changeover of content data of web pages is always on the go. • The robust and replacement property of the DWAP system • Parts of web pages in WWW only contain few texts in content, so hyperlink is more robust for these kinds of pages. • In our observation, personal name and trade name have many similar properties, so if we make some modification on the method then we can replace the system to solve the trade name uncertainty problem.

Conclusion • The experiment results showed that our proposed approach can get 70% F-measure on average, and the number of groups in experiment result is close to the actual number of individuals with same person name. • We applied both the Link-based method and the Content-based approach to grouping web pages which refers to single individuals. Users can effectively browse web pages when they attempt to search some opened personal information from WWW. • Our proposed system is more fit actual operation situation of user , they need no additional effort to finding a suitable background knowledge when searching people information.

Further Work • A hyperlink point to other web page is created by many different reasons, so how to distinguish the created relations of hyperlink is an interesting issue in our further work. • In addition, how famous the personal names are will affect the hyperlink distribution of other individuals with the same personal names, thus content analysis need to be done in following. • Finally, we think the hyperlink and anchor text information are worth to adopt to disambiguate web appearance of personal name; it can be discovered by other researchers who are interested to study in depth.

Thanks for your attention!

Table 1-1 Splitting Anchor Text Rules

Disambiguation Web Appearance of People by Exploiting Link-based and Content-based Information of Social Network Relati

Disambiguation Web Appearance of People by Exploiting Link-based and Content-based Information of Social Network Relati

Presentation Transcript

Combining Link and Content Information in Web Search

Web-Based Network Management

Network Payload-based Anomaly Detection and Content-based Alert Correlation

Content Based Learning Social Learning By Hilda Hernandez

SpyProxy : Execution-based Detection of Malicious Web Content

“Analysis of Social Network based Sybil Defenses”

EVALUATING WEB BASED INFORMATION

PageSim: A Link-based Measure of Web Page Similarity

Multimedia- and Web-based Information Systems

Web-Based Information Systems

Web-Based Information Systems

Web-based Information Architectures

Web-Based Information Systems

Managing Information – Web Based Information

Unleashing the Power of Web Based Content

Multimedia- and Web-based Information Systems

Multimedia- and Web-based Information Systems

Web-Based Information Systems

Web Based Information texts

Study of Road Maintenance Fund Needs Approach with Link Based and Network Based

Web-Based Information Systems

The Design of Web-based Management Interface for Network Processor based Content Switch