807 - TEXT ANALYTICS Massimo PoesioLecture 7: Wikipedia for Text Analytics
WIKIPEDIA The free encyclopedia that anyone can edit • Wikipedia is a free, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. • Wikipedia's articles have been written collaboratively by volunteers around the world. • Almost all of its articles can be edited by anyone who can access the Wikipedia website. ----http://en.wikipedia.org/wiki/Wikipeida
WIKIPEDIA • Wikipedia is: 1. domain independent • it has a large coverage 2. up-to-date • to process current information 3. multilingual • to process information in many languages
Title • Abstract • Infoboxes • Geo-coordinates • Categories • Images • Links • Other languages • Other wiki pages • To the web • Redirects • Disambiguates
WIKIPEDIA FOR TEXT ANALYTICS • Wikipedia has proven an extremely useful resource for text analytics, being used for • Text classification / clustering • Enriching documents through ‘Wikification’ • NER • Relation extraction • ….
Wikipedia as Thesaurus for text classification / clustering • Unlike other standard ontologies, such as WordNet and Mesh, Wikipedia itself is not a structured thesaurus. • However, it is more… • Comprehensive: it contains 12 million articles (2.8 million in the English Wikipedia) • Accurate : A study by Giles (2005) found Wikipedia can compete with Encyclopædia Britannica in accuracy*. • Up to date: Current and emerging concepts are absorbed timely. * Giles, J. 2005. Internet encyclopaedias go head to head. Nature 438: 900–901.
Wikipedia as Thesaurus • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept. • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus.
Wikipedia as Thesaurus • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. • Equivalent concepts are grouped together by redirected links.
Wikipedia as Thesaurus • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. • Equivalent concepts are grouped together by redirected links. • It contains a hierarchical categorization system, in which each article belongs to at least one category.
The concept Artificial Intelligence belongs to four categories: Artificial intelligence, Cybernetics, Formal sciences & Technology in society
Wikipedia as Thesaurus • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. • Equivalent concepts are grouped together by redirected links. • It contains a hierarchical categorization system, in which each article belongs to at least one category. • Polysemous concepts are disambiguated by Disambiguation Pages.
The different meanings that Artificial intelligence may refer to are listed in its disambiguation page.
WIKIPEDIA FOR TEXT CATEGORIZATION / CLUSTERING • Objective: use information in Wikipedia to improve performance of text classifiers / clustering systems • A number of possibilities: • Use similarity between documents and Wikipedia pages on a given topic as a feature for text classification • Use WIKIFICATION to enrich documents • Use Wikipedia category system as category repertoire
WIKIPEDIA FOR TEXT CLASSIFICATION Automatic identification of the topic/category of a text (e.g., computer science, psychology) Books Learning objects Vietnam War 0.0023 Cat: Wars Involving the United States 0.00779 United States 0.3793 World War I 0.0023 Ronald Reagan 0.0027 Communism 0.0027 Cat: Global Conflicts 0.00779 Michail Gorbachev 0.0023 Cold War 0.3111 “The United States was involved in the Cold War.” 17
USING WIKIPEDIA FOR TEXT CLASSIFICATION • Either directly use Wikipedia categories or map one’s categories to Wikipedia categories • Use the documents associated with those categories as training documents
TEXT WIKIFICATION • Wikification = adding links to Wikipedia pages to documents
WIKIFICATION • Text: • Wikipedia: Giotto was called to work in Padua, and also in Rimini Truc-Vien T. Nguyen
Keyword Extraction • Finding important words/phrases in raw text • Two-stage process • Candidate extraction • Typical methods: n-grams, noun phrases • Candidate ranking • Rank the candidates by importance • Typical methods: • Unsupervised: information theoretic • Supervised: machine learning using positional and linguistic features
Keyword Extraction using Wikipedia 1. Candidate extraction • Semi-controlled vocabulary • Wikipedia article titles and anchor texts (surface forms). • E.g. “USA”, “U.S.” = “United States of America” • More than 2,000,000 terms/phrases • Vocabulary is broad (e.g., the, a are included)
Keyword Extraction using Wikipedia 2. Candidate ranking • tf * idf • Wikipedia articles as document collection • Chi-squared independence of phrase and text • The degree to which it appeared more times than expected by chance • Keyphraseness:
Our own Approach • (Cfr. Milne & Witten 2008, 2012; Ratinov et al, 2011) • Use Wikipedia dump to compute two statistics: • KEYPHRASENESS: prior probability that a term is used to refer to a Wikipedia article • COMMONNESS: probability that phrase is used to refer to specific Wikipedia article • Two versions of system: • UNSUPERVISED: use statistics only • SUPERVISED: use distant learning to create training data
KEYPHRASENESS • the probability that a term t is a link to a Wikipedia article (cfr. Milne & Witten’s prior link probability) • Examples: • The term "Georgia" • Is found as a link in 22631 Wikipedia articles • appears in 75000 Wikipedia articles keyphraseness = 22631/75000 = 0.3017466 • Cfr. the term “the”: keyphraseness = 0.0006
COMMONNESS • the probability that a term t is a link to a SPECIFIC Wikipedia article a • for example, the surface form "Georgia" was found to be linked to • a1 = "University_of_Georgia" 166 times commonness(t, a1) = 166/(166+18+5) = 0.8783 • "Republic_of_Georgia" 18 times • "Georgia_(United_States)" 5 times
Extracting dictionaries and statistics from a Wikipedia dump • Parsing: • In three phases • Identify articles of relevance • Extract (among other things) • Set of SURFACE FORMS (terms that are used to link to Wikipedia articles) • Set of LINKS [article|surface_form] • [[Pedanius Dioscorides|Dioscorides]]
The Wikipedia Dump from July 2011 • 11459639 pages in total • 12,525,583 links • specifying: surface word / target / frequency • ranked by frequency • for example, the mention "Georgia" is linked to • "University_of_Georgia" 166 times, • "Republic_of_Georgia" 18 times • "Georgia_(United_States)" 5 times Truc-Vien T. Nguyen
Surface forms, titles, articles • Some definitions and figures • surface form the occurence of a mention inside an article • target article the target Wiki article a surface form linked to Files in Polish are arranged in a repository different from English/Italian
The Unsupervised Approach • Use Keyphraseness to identify candidate terms • Retain terms whose keyphraseness is above a certain threshold (currently 0.01) • Use commonness to rank • Retain top 10
The Supervised Approach • Features: in addition to commonness, use measures of SIMILARITY between text containing the term and the candidate Wikipedia page • RELATEDNESS: a measure of similarity between the LINKS (cfr. Milne&Witten’s NORMALIZED LINK DISTANCE)
Training a supervised wikifier • Using WIKIPEDIA ITSELF as source of training materials (see next)
Wikifying queries: the Bridgeman datasets • BAL Data sets: • 1049 Query set • 1 annotator, up to 3 manual annotations • 1 automatic annotation • 100 Query set • 3 annotators, each up to 3 manual annotations
Results on Bridgeman 1000: Y3 Accuracy up by 17 points (36%)
The GALATEAS D2W web services • Available as open source • Deployed within LinguaGrid • API based on the Morphosyntactic Annotation Framework (MAF), an ISO standard • Tested on 1.5M queries, achieves throughput of 600 characters per second • Integrated with LangLog tool
Use of the service in LangLog (See Domoina’s demo)
Other applications • The UK Data Archive
WIKIPEDIA FOR NER [The FCC] took [three specific actions] regarding [AT&T]. By a 4-0 vote, it allowed AT&T to continue offering special discount packages to big customers, called Tariff 12, rejecting appeals by AT&T competitors that the discounts were illegal. …..
WIKIPEDIA FOR NER http://en.wikipedia.org/wiki/FCC: The Federal Communications Commission (FCC) is an independent United States government agency, created, directed, and empowered by Congressionalstatute (see 47 U.S.C.§ 151 and 47 U.S.C.§ 154).
WIKIPEDIA FOR NER Number of glucocorticoid receptors in lymphocytes and their sensitivity to hormone action .
WIKIPEDIA FOR NER • Wikipedia has been used in NER systems • As a source of features for normal NER • To automatically create training materials (DISTANT LEARNING) • To go beyond NE tagging towards proper ENTITY DISAMBIGUATION
Distant learning • Automatically extract examples • positive examples from mention-to-link Wikipedia page • Negative examples from similar mentions with other links • Use positive and negative examples to train model
The Supervised Approach: Using Wikipedia links to generate training data • Example • Giotto was called to work in Padua, and also in Rimini. (sentence taken from Wikipedia text, with links avalable) • Giotto_di_Bondone (painter), Giotto_Griffiths (Welsh rugby player), Giotto_Bizzarrini (automobile engineer) • Dataset • +1 Giotto was called to work -- Giotto_di_Bondone • -1 Giotto was called to work -- Giotto_Griffiths • -1 Giotto was called to work -- Giotto_Bizzarrini http://en.wikipedia.org/wiki/Giotto_di_Bondone Truc-Vien T. Nguyen
MORE ADVANCED USES OF WIKIPEDIA • As a source of ONTOLOGICAL KNOWLEDGE • DBPEDIA
SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA • Taxonomic information: category structure • Attributes: infobox, text