Collective annotation of wikipedia entities in web text

Collective annotation of wikipedia entities in web text - Presented by Avinash S Bharadwaj (1000663882)

Abstract • The aim of the paper • Annotation of open domain unstructured web text with uniquely identified entities in a social media like Wikipedia. • Use of annotations for search and mining tasks

What is entity disambiguation? • An entity is something that is real and has a distinct existence. • Wikipedia articles can be considered as entities. • Entity disambiguation is the art of resolving correspondence between mentions of entities in natural language and real world entities. • In this paper the disambiguation is carried out between annotations in web pages along and Wikipedia articles.

Entity Disambiguation Example

Previous work in disambiguation • SemTag: • First webscale disambiguation system. • Annotated about 250 million web pages with IDs from the Stanford TAP. • SemTag preferred high precision over recall, with an average of two annotations per page • Wikify! • Wikify performed both keyword extraction and disambiguation. • Wikify could not achieve collective disambiguation across spots • Milne and Witten (M&W): • It’s a form of collective disambiguation which results better than Wikify. • M&W achieves a F1 measure of 0.53, unlike Wikify which has a F1 measure of 0.83 • Cucerzan’s algorithm: • Each entity is represented as a high dimensional feature vector. • Cucerzan annotates sparingly about 4.5% of all possible tokens are annotated.

Terminologies • Spots • Occurrence of text on a page that can be possibly linked to a Wikipedia article • Attachment • Possible entities in Wikipedia to which a spot can be linked • Annotation • Process of making an attachment to spots on a page • Gama list • List of all possible annotations

Terminologies Illustrated Attachment Gama list Spots

Collective entity disambiguation • Sometimes disambiguation can not be carried out by using single spots in a page. • Multiple spots in a page are required to disambiguate an entity • All spots in an article are considered to be related

Collective entity disambiguation example

Calculating relatedness between wikipedia entities • Relatedness between two entities is defined as r(γ, γ’)= g(γ) · g(γ’). • Cucerzan’sproposal defined relatedness between entity based on cosine measure • Milne et al. proposal: c = number of Wikipedia pages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise.

Contributions of this paper • The paper proposes posing entity disambiguation as an optimization problem. • The paper provides a single optimization objective. • Using integer linear programs • Using heuristics for approximate solutions • Paper also describes about rich node features with systematic learning • Paper also describes about back off strategy for controlled annotations

Modeling compatibility between wikipedia articles • Entities modeled using a feature vector defined as fs(γ). • The feature vector expresses local textual compatibility between (context of) spot s and candidate label γ. • Components of the feature vector • Spot side • Context of the spot • Wikipedia side • Snippet • Full text • Anchor text • Anchor text with context • Similarity Measures • Dot product • Cosine Similarity • Jaccard Similarity

Methods for evaluating the model • Authors use two ways for evaluating the model, Node score and Clique Score • Node Score • Defined by the function • W is a training set obtained from linear adaptation of rank SVM • Clique score • Uses the related measure of Milne and Witten. • Total objective

Back-off method • Not all spots in a web page may be tagged. • Uses a special tag “NA” for articles that can’t be tagged • Spots in the webpage marked “NA” will not contribute to the clique potential. • A factor called “RNA” defines the aggressiveness of the tagging algorithm.

implementation • Integer linear program (ILP) based formulation • Casting as 0/1 integer linear program • Relaxing it to an LP • Simpler heuristics • Hill climbing for optimization

Evaluating the algorithm • Evaluation measures used • Precision • Number of spots tagged correctly out of total number of spots tagged • Recall • Number of spots tagged correctly out of total number of spots in ground truth • F1 • F1 is described using the following formula

Datasets used for evaluation • The authors use WebPages crawled and stored in the IITB database. • Publicly available data from Cucerzan’sexperiments (CZ)

Experimental results

Named entity disambiguation in wikipedia • Named ambiguity problem has resulted in a demand for efficient high quality disambiguation methods • Not a trivial task, the application should be capable of deciding whether the group of name occurrences belong to the same entity • Traditional methods of named entity disambiguation uses the Bag Of Words (BOW) method

Wikipedia as a semantic network • Wikipedia is an open database covering most of the useful topics in the world. • The title of Wikipedia article describes the content within the article. • The title may sometimes be noisy. These are filtered using rules from Hu, et al.

Semantic relations between wikipedia concepts • Wikipedia contains rich relation structures within the page • The relatedness is represented by links between the Wikipedia pages.

Working of named entity disambiguation using wikipedia • Uses vectors as to represent a Wikipedia entity. • Similarity between each vector is measured for named entity disambiguation.

Measuring similarity between two wikipedia entities • The similarity measure takes into account the full semantic relations indicated by hyperlinks in Wikipedia. • The algorithm works in three steps. Described as follows

Step 1 • In order to measure the similarity between two vector representations, the correspondence between the concepts of one vector to another have to be defined • Semantic relations between articles is used to match the articles.

Step 2 • Compute the semantic relatedness from one concept vector representation to another • Using the alignments shown in previous step SR(MJ1→MJ2) is computed as (0.42×0.47×0.54 + 0.54×0.51×0.66 + 0.51×0.51×0.65)/(0.42×0.47 + 0.54×0.51 + 0.51×0.51)=0.62, and • SR(MJ2→MJ1) is computed as (0.47×0.42×0.54 + 0.52×0.54×0.58 + 0.52 × 0.51 × 0.60 + 0.51 × 0.54 × 0.66 )/(0.47×0.42 + 0.52×0.54 + 0.52 × 0.51 + 0.51 × 0.54)=0.60.

Step 3 • Compute the similarity between two concept vector representations. • Similarity SIM(MJ1, MJ2) is computed as (0.60 + 0.62)/2 = 0.61, SIM(MJ2, MJ3) is computed as 0.10 and SIM(MJ1, MJ3) is computed as 0.0.

Results

Questions ????

Collective annotation of wikipedia entities in web text