Building a Web Thesaurus from Web link Structure

Building a Web Thesaurus from Web link Structure Zheng Chen1, Shengping Liu2, Liu Wenyin3, Geguang Pu2, Wei-Ying Ma1 SIGIR’03 2Dept. of information Science Peking University {lsp, pgg}@is.pku.edu.tw 3Dept. of Computer Science City Univ. of Hong Kong csliuwy@cityu.edu.hk 1Microsoft Research Asia {zhengc, wyma}@microsoft.com

Introduction • Although existing search engine work well to a certain extent, they still face many problems • Word mismatch: the web editor and the users often do not use the same vocabulary. • Short query: often ambiguous in expressing the user’s intension. • Query expansion has long been suggested to address these problems. • Global analysis construct a thesaurus by corpus-wide statistics of co-occurrence of terms and select most similar terms. • Local analysis use only some initially retrieved documents. • QE performance drops significant when applied to web. • There is too much irrelevant information contained in a web page, e.g. banners, navigation bars, and hyperlinks. • Need to building a thesaurus from the web when deal with web pages.

Introduction • The discriminative characteristic between a web page and a pure text lies in hyperlinks. • Topic locality: the web pages connected by links are more likely of the same topic. • The web’s link structure is a semantic network, in which words or phrases appeared in anchor text are nodes and semantic relations are edges. • Since HTML is a visual representation language, much content organization information is missed, and we want to extract the latent content structure from the web link structure.

Introduction • The domain-specific thesaurus is constructed by three steps. • First, a set of high quantity websites from a given domain is selected. • Second, link analysis are used to remove noise links and convert the web into a content structure. • Third, calculate the mutual information of words or phrases within the structure to form the domain-specific thesaurus. • Experiments show a great improvement in search precision compared to the traditional association thesaurus built from full-text.

Related Works • Although the manually made thesauri are quite precise, it is a time-consuming job to create and keep track of the change of the domain. • Many automatic thesaurus construction methods have been proposed. • Our thesaurus is different in that it is built based on web link structure information. • Early works focus on aiding the user’s web navigation by dynamically generating site maps. • Recent hot spots are finding authorities and hubs from web link structure and its application for search and community detection. • Web link structure can also be used in page ranking and web page classification.

Related Works • These works stress on the navigational relationship among web pages, but actually there exist semantic relationship between them. • As far as we know, there is no work yet that has formal definition of the semantic relationship between web pages and provide efficient method to extract the content information.

Building the Web Thesaurus • We first need some high quantity and representative websites for the domain. • We use the domain name as a query to Google Directory search (http://directory.google.com) to obtain a list of authority websites. • These websites are believed to be popular in the domain based on Google’s ranking mechanism. • A content structure for every selected website is built, then all the obtained content structures are merged to construct the thesaurus for this particular domain. • Figure 1 shows the entire process.

Building the Web Thesaurus

Web Site Construction • Website content structure can be represented as a directed graph. • In which node is a web page assumed to represent a concept. • The concept means a generic meaning of the web page. • Thus the semantic relationship among web pages can be seen as the semantic relationship among the concept of web pages • There are two general semantic relationships for concepts: aggregation and association. • Aggregation is a kind of hierarchy relationship, in which the concept of the parent node is broader than that of a child node. • Association is a kind of horizontal relationship, in which concepts are semantically relates to each other. • In addition, two child nodes have association relationship if they share the same parent node.

Web Site Construction • Generally speaking, hyperlinks have two functions: navigation convenience and connecting semantic related web pages. • For the latter one, we further distinguish explicit and implicit semantic relationship. • An explicit semantic relationship must be represented by a hyperlink while an implicit one can be inferred from explicit semantic relationships and thus does not necessarily correspond to a hyperlink. • Accordingly, a link is a semantic link if the connected web pages have explicit semantic relationship; otherwise it is a navigational link. • Figure 2 shows an example, in which the arrows with solid line is a semantic link and the arrows with dashed line is a navigational link.

Constructing Website Content Structure • Given a website navigation structure, the construction of the website content structure includes three tasks: • 1. Distinguishing semantic links from navigational links. • 2. Discovering the semantic relationship between web pages. • 3. Summarizing a web page to a concept category.

Distinguishing Semantic Links from Navigational Links • To understand the designer’s intention, the structural information encoded in URL can be used. Links can be categorized into five types: • Upward link: target page in parent directory • Downward link: target page in subdirectory • Forward link: target page in sub-subdirectory • Sibling link: target page in the same directory • Crosswise link: target page in directory other than above cases • A link is classified as a navigational link if it is one of the following: • Upward link: website is generally hierarchically and upward links always function as a return to the previous page. • Link within a high-level navigation bar: high-level means that the link is not a downward link. • Link within a navigation list which exist in many web pages: the link is not specific to a page thus is not semantically related to the linked pages.

Discovering the Semantic Relationship between Web Pages • The remaining links are considered semantic link. We then analyze the semantic relationship between web pages according to the heuristics: • A link in a content page conveys association relationship • A content page always represents a concrete concept and is assumed to be the minimal information unit that has no aggregation relationship with other concepts. • A link in an index page usually conveys aggregation relationship • Index page always functions as a hub to access the content pages. • If two web pages have aggregation relationship in both directions, the relationship is changed to association.

Summarizing a Web Page to a Concept • Since the anchor text over the hyperlink has been proved to be a good description for the target web page, we simply choose anchor text as the semantic summarization of a target web page. • There maybe multiple hyperlinks pointing to the same page, the best anchor text is selected by evaluating the discriminative power of the anchor text by TFIDF weighting. • The anchor text is regarded as a term, and all anchor texts appeared in the same web page regarded as a document. We can estimate the weight for each term, and the highest one will be chosen for the concept representation.

Merging Website Content Structure • We proposed a statistic approach to extracting the common knowledge and eliminating the effect of wrong links. • In traditional automatic thesaurus method, some relevant documents are selected as training corpus, from which a gliding window is used to move over the documents to divide each document into small overlapped pieces. • Then, statistical approach is used to count terms co-occurred in the gliding window. Term pairs with higher mutual information will be formed as a relationship in the constructed thesaurus.

Merging Website Content Structure • We apply a similar algorithm to find the relationship of terms in the content structure of web sites. • The sub-tree of a node with constrained depth performs the function of gliding window on the content structure. • Then the mutual information of the terms within gliding window can be counted to construct the relationship of terms. • Anchor text is segmented by NLPWin, which is a natural language processing system developed by Microsoft Research, and then formalized into a set of terms:

Merging Website Content Structure • The term relationship may be more complex then traditional documents because we should consider structural information. • For each node ni in the content structure, generate corresponding three sub-trees STi with depth restriction of the three relationships: ancestor, offspring, and sibling, as shown below. • While it is easy to generate the ancestor and offspring sub-trees, generating the sibling sub-tree is difficult because sibling of sibling does not necessarily stands for a sibling relationship. Let us first calculate the first two relationships.

Merging Website Content Structure • For each sub-tree, the mutual information of a term-pair is counted as: • MI shows the strength of the relationship of two terms. Another factor is entropy, which shows the distribution of the term-pairs. The more sub-trees contain the term-pair, the more similar the two terms are.

Merging Website Content Structure • The two information can be combined as follows: • α is the tuning parameter. In our experiment, α=1. • The term pairs with similarity exceeding a threshold will be selected as candidates for constructing the thesaurus for ancestor and offspring relationship. • For a term w, possible sibling nodes is composed of three components. • Terms share the same parent node with w. • Terms share the same child node with w. • Terms have association relationship with w.

Merging Website Content Structure • Apply the same algorithm and choose the terms with similarity higher than threshold to form the final thesaurus.

Experimental Results • A performance evaluation for the selection of semantic links is conducted because distinguishing semantic from navigational links is important for ensuring the thesaurus quality. • Three domain (“online shopping”, “photography”, and “personal digital assistant”) are selected and the 13 top websites are selected. Table 1 shows the detailed information.

Experimental Results • Randomly select 25 web pages from every website and ask for 4 users to manually label the link as either semantic or navigational link. • However, anchor text on a semantic link is not necessarily a good concept in the content structure, e.g. anchor texts with numbers and letters, a high recall is usually accomplished by a high noise. Thus only show precision in Table 2. • We see that the precision of recognizing navigational link is 92.82%, which shows the simple method presented here is effective.

Experimental Results

Experimental Results • Thesaurus is general evaluated by the performance for using it for query expansion. • Compare the performance with full-text automatic thesaurus. • Here the full-text one is constructed by counting the co-occurrence of term pairs in a gliding window. • In this experiment, choose the Okapi system Windows2000 version as baseline full-text search engine to evaluate the query expansion. • For each domain, use 10 queries to retrieve documents and the top 30 ranked documents were evaluated by 4 users. • Then, the automatic thesaurus built from pure full-text was applied to expand the initial queries. The terms are extracted according to the similarity with queries, six relevant terms were added, and the weight ratio is 2.0.

Experimental Results • The queries are the following. • Online shopping domain: women shoes, mother day gift, children’s clothes, antivirus software, listening jazz, wedding dress, palm, movie about love, Canon camera, cartoon products. • Photography domain: Kodak products, digital camera, color film, light control, camera battery, Nikon lenses, accessories of Canon, photo about animal, photo knowledge, adapter. • PDA domain: PDA history, PDA game, price, top sellers, software, PDA OS, size of PDA, Linux, Sony, and java.

Experimental Results • Similarly, our constructed thesaurus is also used, here we only use two of the relationship, i. e. offspring and sibling. • The ancestor will make the query broader and thus decrease precision. • Ask for 4 users to provide evaluation on the results. The search precisions are shown below. The result for PDA domain is similar to photography domain.

Conclusions and Future Works • We find that QE by offspring relationship can improve the search precision significantly and sibling relationship almost does not improve or even worse. • The contribution of offspring varies from domain to domain: online shopping 40%, PDA 16%, much lower. • Some websites are easier to extract content structure while others difficult. • Figure 3 shows the average precision for all domains. We find that baseline is quite high(55%), and QE with full-text and sibling thesaurus cannot help at all. • The average precision improvement with offspring relationship is 22.8%, 24.2%, 9.6% on top 10, 20 and 30 web pages.

Conclusions and Future Works

Conclusions and Future Works • The precision of baseline retrieval is quite high, the reason id that the corpus is less divergent than the web. • QE with full-text thesaurus decreases performance. One reason is that we did not try to tune the parameters to optimize the result. It also seems that the naïve automatic text thesaurus for QE is not good. • For example, the most six relevant terms of “children’s clothes” are “book”, “clothing”, “toy”, “accessory”, “fashion”, and “vintage”. • QE with sibling relationship is bad. This relationship is more likely to be the words relevant to some extent but not similar. • For example, the most six relevant terms of “children’s clothes” are “book”, “toy”, “video”, “women”, “accessories”, and “design”.

Conclusions and Future Works • QE with offspring relationship improves performance because the terms are semantic narrower and can refine the query. • The most relevant terms of “children’s clothes” are “baby”, “boy”, “girl”, “shirt”, and “sweater”. • To keep up with the speed of growth for new terms and concepts on the web, automatic thesaurus construction continues to be an important research area. • The proposed scheme can identify new terms and reflect the latest relationship between terms as the web evolves, and experiment results show that when applied to QE, outperforms traditional thesaurus.

Conclusions and Future Works • The limitation of this work is that the current experiment is small in scale. A large collection of data is needed to perform the analysis and testing. • We plan to extend this work to construct a personalized thesaurus based on user’s navigation history which will find many interesting applications such as making the web search more personal.

Building a Web Thesaurus from Web link Structure

Building a Web Thesaurus from Web link Structure

Presentation Transcript

Blurbpoint - Web Marketing Services | Link Building Services

Building a Web Site

Link Structure and Web Mining

WEB STRUCTURE MINING

Building a Web Browser

CS 277: Data Mining Mining Web Link Structure

Building a PPT from the ActiveInterface web pages

Structure of a web application

Edline Web Link Creation

Thesaurus Extension using Web Search Engines

Thesaurus Building

CS 277: Data Mining Mining Web Link Structure

CS 277: Data Mining Mining Web Link Structure

LINK FOR WEB DESIGN

Building a Web Page

Building a Web 2.0 Web Site with ASP.NET

Building a Web Page

Building a Web Page

Web Building

Building a Web Site

Building a Web Site