350 likes | 378 Views
Explore how Wikipedia's structured links can enrich information discovery. Learn about a machine-learning approach to wikification, improving recall and precision in detecting and disambiguating links. Discover related work on topic indexing and disambiguation algorithms. Dive into the disambiguation process, balancing commonness and relatedness to determine link validity. Gain insights into the classification algorithm used and how configuration parameters impact precision and recall, with a focus on evaluation methodologies.
E N D
Learning to Link withWikipedia David Milne, Ian H. Witten, CIKM’08 2009/12/18 Henrik Schmitz
Introduction serendipity • Wikipedia • Largest, most visited encyclopedia • Densely structured • Millions of links • Guides to unintended information • Approach • Wikipedia’s accessibility and serendipity for all documents • Automatically find topics in unstructured text and link them to Wikipedia articles CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Introduction WIKIFICATION! CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Introduction • New • Wikipedia not only source of information • Wikipedia used as training data to create links • Improvements in recall and precision • In this paper • Machine-learning approach to wikification • In two stages • Link disambiguation • Link detection CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Related Work: Wikify Onekeydifference! • Wikify system by Mihalcea and Csomai (2007) • Paper’s basis • Wikifiy has also two steps, but with swapped order • Detection • Disambiguation • Paper seems weird • But uses disambiguation to inform detection CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Related Work: Wikify Detection Disambiguation • Identify valuable phrases by link probability: • Thus: finding all n-gramsexceeding this threshold • Link detected phrases to reasonable Wikipedia articles concerning ambiguity • Here: enormous preprocessing, entire Wikipedia must be parsed # articles using term as anchor # articles mentioning this term • Precision: 53% Recall: 56% • Precision: 93% • Recall: 86% CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Related Work: topic indexing • Medelyan et al. (2008) • Similar approach to wikification • Additionally most important topics are identified • Paper improves this approach through weighting and machine-learning CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm • Uses links found in Wikipedia articles for training • Wikipedian make links with effort • Millions of ground truth examples to learn from • Preparation • Wikipedia version with around 2 million links • Articles with >50 links, no lists or disambiguation pages • 700 articles • 500 for training • 100 for configuration • 100 evaluation CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm • Each article’s link represents several training instances • Connection from anchor to destination is positive example • Remaining possible destinations are negative examples CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm # times used as destination in Wikipedia Balance commonness (prior probability) and relatedness CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm • Relatedness: compare senses with surrounding context • Cyclic more ambiguous terms • But generally unambiguous terms exists CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm log(max(|A|,|B|))-log(|A∩ B|) log(|W|)-log(min(|A|,|B|) * Relatedness of candidate sense is weightedaverage of relatedness to each context article • Relatedness • Select sense article which is most in common with the context articles • Relatedness between article a and b • where A, B are sets of articles linking to a and band W is set of all articles in Wikipedia CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm • Weight of comparisons • Do not consider all context terms equally • E.g. “the” has zero value • Find with help of Wikify’s link probability • Check relatednessof context term to central topic • Calculate average semantic relatedness using measure * • Average 1. and 2. CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm • Combining commonness and relatedness • Use machine-learning to adjust balancefor each document each time • Homogenous, plentiful context • Relatedness prioritized • Ambiguous, little context • Commonness prioritized • Context quality • Sum of weights of each context term • already calculated CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Algorithm Classifier • Resulting features • Number of involved terms • Extent of their relations to each other • How frequently used as Wikipedia links • Produces probability of validity of a sense CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Configuration Precision Recall • One parameter • Minimum probability of senses, which should be considered • Gain speed by higher threshold • More precision • But Less recall • Threshold around 2% • Classification algorithm • C4.5 (generates decision tree) CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Evaluation Heuristic approach by Medelyan et al. Difference: no machine-learning and weighting of context This paper‘s approach This paper‘s approach • The 100 randomly chosen articles include 11,000 anchors, which were automatically disambiguated • Always ≥88% precision; 45% perfect • Always ≥75% recall; 14% perfect • Increases by selecting all valid senses • Precision gets worse CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Disambiguation: Evaluation • Advantages of paper’s approach • No parsing of text required • Less resources required • Less training: 500 articles against whole Wikipedia • Facts • PC: 3GHz Dual Core, 4GB RAM • Training disambiguator in 13 minutes • Tested in four minutes • Three minutes for loading data in memory CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Algorithm bases on Wikify • Key difference: Wikipedia articles are used to learn which terms should be linked and which not and context is taken into account • Wikify’s approach relies exclusively on link probability • Always mistakes: discarding relevant links and retaining irrelevant ones sometimes • Better: use link probability feature among many CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Gather all n-grams, retain those exceeding a threshold (later) • Discard nonsense phrases and stop words • Remaining phrases are disambiguated using classifier from before • Set of associations between terms and Wikipedia articles, without any part-of-speech analysis CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Used features: • Link probability • Relatedness • Disambiguation confidence • Generality • Location and Spread CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Features: Link probability • Involving several candidate link locations (e.g. “Hillary Clinton”, “Clinton”) there are multiple link probabilities • Combined into average and maximum • Average more consistent, maximum more indicative (e.g. “Democratic Party”, “Party”) • Information lost, when probabilities are averaged • Features: Relatedness • Average relatedness between each topic and all of the other candidates CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Features: Disambiguation confidence • Not just a yes/no judgment, but also a confidence to this answer • Greater chance for more sure topics • Also: combined as average and maximum value • Features: Generality • Links for specific topics are more useful, than general ones • Defined as minimum depth at which article is located in Wikipedia’s category tree CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Algorithm • Features: Location and spread • I.e. n-grams from which terms are mined • Frequency • First occurrence, mentioned in the introduction • Last occurrence, mentioned in the conclusion • Spread: distance between first and last occurrence • How consistently used • Must be normalized by length of document CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Configuration Recall Precision • Articles • Same 500 articles as for training disambiguation classifier • Less disambiguation errors • Terms must be disambiguated into appropriate articles before using as training instances • Same 100 articles for configuration the disambiguation classifier • One parameter: initial link probability threshold • Discard nonsense phrases and stop words • Trade-off between speed & precision and recall • 6.5% CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Evaluation • 100 new randomly selected articles for evaluating • Ground truth from 9,300 manually linked topics • Stripping all markup and run link detector • Recall, precision and f-measure around 74% • Improvement against Wikify: CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Detection: Evaluation • Facts • Training link detector in 37 minutes • Tested in eight minutes CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • What about documents not obtained from Wikipedia? • Verify with new documents and human evaluator • Experimental data • 50 documents from AQUAINT text corpus (news) • Random stories with length of 300 words (attention span) • 500 new training articles • Length also 300 words, selected 50 with highest link proportion • Classifier identified 449 link-worthy topics, average 9 per article CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • Participants • Amazon’s crowdsourcing service Mechanical Turk • Labor-intensive experiment without gathering of people • Concern about anonymous workers • Identify and reject low-quality responses and undesirable participants CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • Evaluating detected links • 449 tasks– one for each link • Original text with one link • Participant specifies, whether link is valid or not • Three participants per link CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • Identifying missing links • 50 tasks– one for each article • Article contains all links • Participant reads article and can list additional Wikipedia topics • Five participantsper article CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • Results • 76% were correct • 34% were not • Mostly due to incorrect candidate identification • Similar resultsas before • Algorithm works same “in the wild” and on Wikipedia CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Wikification in the Wild • Wikifikation online • Results used to correct automatically-tagged articles and generated ground truth • Corpus with only manually-verified links • www.nzdl.org/wikification CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Example of an application ontology • Tool for building cross-reference documents • Structured knowledge about any unstructured document • graph representation of discussed concepts • Links between topics mean significant relation • No ambiguity • Example with content of paper (just few relations) CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz
Thank you for your attention! Fuhaha! Mechanical Turk or Automaton Chess Player was a fake chess-playing machine constructed in the late 18th century…that allowed a human chess master hiding inside to operate the machine. CIKM'08: Learning to Link with Wikipedia - Henrik Schmitz