1 / 28

Collective annotation of wikipedia entities in web text

Collective annotation of wikipedia entities in web text. - Presented by Avinash S Bharadwaj (1000663882) . Abstract. The aim of the paper Annotation of open domain unstructured web text with uniquely identified entities in a social media like Wikipedia.

jean
Download Presentation

Collective annotation of wikipedia entities in web text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collective annotation of wikipedia entities in web text - Presented by Avinash S Bharadwaj (1000663882)

  2. Abstract • The aim of the paper • Annotation of open domain unstructured web text with uniquely identified entities in a social media like Wikipedia. • Use of annotations for search and mining tasks

  3. What is entity disambiguation? • An entity is something that is real and has a distinct existence. • Wikipedia articles can be considered as entities. • Entity disambiguation is the art of resolving correspondence between mentions of entities in natural language and real world entities. • In this paper the disambiguation is carried out between annotations in web pages along and Wikipedia articles.

  4. Entity Disambiguation Example

  5. Previous work in disambiguation • SemTag: • First webscale disambiguation system. • Annotated about 250 million web pages with IDs from the Stanford TAP. • SemTag preferred high precision over recall, with an average of two annotations per page • Wikify! • Wikify performed both keyword extraction and disambiguation. • Wikify could not achieve collective disambiguation across spots • Milne and Witten (M&W): • It’s a form of collective disambiguation which results better than Wikify. • M&W achieves a F1 measure of 0.53, unlike Wikify which has a F1 measure of 0.83 • Cucerzan’s algorithm: • Each entity is represented as a high dimensional feature vector. • Cucerzan annotates sparingly about 4.5% of all possible tokens are annotated.

  6. Terminologies • Spots • Occurrence of text on a page that can be possibly linked to a Wikipedia article • Attachment • Possible entities in Wikipedia to which a spot can be linked • Annotation • Process of making an attachment to spots on a page • Gama list • List of all possible annotations

  7. Terminologies Illustrated Attachment Gama list Spots

  8. Collective entity disambiguation • Sometimes disambiguation can not be carried out by using single spots in a page. • Multiple spots in a page are required to disambiguate an entity • All spots in an article are considered to be related

  9. Collective entity disambiguation example

  10. Calculating relatedness between wikipedia entities • Relatedness between two entities is defined as r(γ, γ’)= g(γ) · g(γ’). • Cucerzan’sproposal defined relatedness between entity based on cosine measure • Milne et al. proposal: c = number of Wikipedia pages; g(γ)[p] = 1 if page p links to page γ, 0 otherwise.

  11. Contributions of this paper • The paper proposes posing entity disambiguation as an optimization problem. • The paper provides a single optimization objective. • Using integer linear programs • Using heuristics for approximate solutions • Paper also describes about rich node features with systematic learning • Paper also describes about back off strategy for controlled annotations

  12. Modeling compatibility between wikipedia articles • Entities modeled using a feature vector defined as fs(γ). • The feature vector expresses local textual compatibility between (context of) spot s and candidate label γ. • Components of the feature vector • Spot side • Context of the spot • Wikipedia side • Snippet • Full text • Anchor text • Anchor text with context • Similarity Measures • Dot product • Cosine Similarity • Jaccard Similarity

  13. Methods for evaluating the model • Authors use two ways for evaluating the model, Node score and Clique Score • Node Score • Defined by the function • W is a training set obtained from linear adaptation of rank SVM • Clique score • Uses the related measure of Milne and Witten. • Total objective

  14. Back-off method • Not all spots in a web page may be tagged. • Uses a special tag “NA” for articles that can’t be tagged • Spots in the webpage marked “NA” will not contribute to the clique potential. • A factor called “RNA” defines the aggressiveness of the tagging algorithm.

  15. implementation • Integer linear program (ILP) based formulation • Casting as 0/1 integer linear program • Relaxing it to an LP • Simpler heuristics • Hill climbing for optimization

  16. Evaluating the algorithm • Evaluation measures used • Precision • Number of spots tagged correctly out of total number of spots tagged • Recall • Number of spots tagged correctly out of total number of spots in ground truth • F1 • F1 is described using the following formula

  17. Datasets used for evaluation • The authors use WebPages crawled and stored in the IITB database. • Publicly available data from Cucerzan’sexperiments (CZ)

  18. Experimental results

  19. Named entity disambiguation in wikipedia • Named ambiguity problem has resulted in a demand for efficient high quality disambiguation methods • Not a trivial task, the application should be capable of deciding whether the group of name occurrences belong to the same entity • Traditional methods of named entity disambiguation uses the Bag Of Words (BOW) method

  20. Wikipedia as a semantic network • Wikipedia is an open database covering most of the useful topics in the world. • The title of Wikipedia article describes the content within the article. • The title may sometimes be noisy. These are filtered using rules from Hu, et al.

  21. Semantic relations between wikipedia concepts • Wikipedia contains rich relation structures within the page • The relatedness is represented by links between the Wikipedia pages.

  22. Working of named entity disambiguation using wikipedia • Uses vectors as to represent a Wikipedia entity. • Similarity between each vector is measured for named entity disambiguation.

  23. Measuring similarity between two wikipedia entities • The similarity measure takes into account the full semantic relations indicated by hyperlinks in Wikipedia. • The algorithm works in three steps. Described as follows

  24. Step 1 • In order to measure the similarity between two vector representations, the correspondence between the concepts of one vector to another have to be defined • Semantic relations between articles is used to match the articles.

  25. Step 2 • Compute the semantic relatedness from one concept vector representation to another • Using the alignments shown in previous step SR(MJ1→MJ2) is computed as (0.42×0.47×0.54 + 0.54×0.51×0.66 + 0.51×0.51×0.65)/(0.42×0.47 + 0.54×0.51 + 0.51×0.51)=0.62, and • SR(MJ2→MJ1) is computed as (0.47×0.42×0.54 + 0.52×0.54×0.58 + 0.52 × 0.51 × 0.60 + 0.51 × 0.54 × 0.66 )/(0.47×0.42 + 0.52×0.54 + 0.52 × 0.51 + 0.51 × 0.54)=0.60.

  26. Step 3 • Compute the similarity between two concept vector representations. • Similarity SIM(MJ1, MJ2) is computed as (0.60 + 0.62)/2 = 0.61, SIM(MJ2, MJ3) is computed as 0.10 and SIM(MJ1, MJ3) is computed as 0.0.

  27. Results

  28. Questions ????

More Related