807 - TEXT ANALYTICS

807 - TEXT ANALYTICS Massimo PoesioLecture 7: Wikipedia for Text Analytics

WIKIPEDIA The free encyclopedia that anyone can edit • Wikipedia is a free, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. • Wikipedia's articles have been written collaboratively by volunteers around the world. • Almost all of its articles can be edited by anyone who can access the Wikipedia website. ----http://en.wikipedia.org/wiki/Wikipeida

WIKIPEDIA • Wikipedia is: 1. domain independent • it has a large coverage 2. up-to-date • to process current information 3. multilingual • to process information in many languages

Title • Abstract • Infoboxes • Geo-coordinates • Categories • Images • Links • Other languages • Other wiki pages • To the web • Redirects • Disambiguates

WIKIPEDIA FOR TEXT ANALYTICS • Wikipedia has proven an extremely useful resource for text analytics, being used for • Text classification / clustering • Enriching documents through ‘Wikification’ • NER • Relation extraction • ….

Wikipedia as Thesaurus for text classification / clustering • Unlike other standard ontologies, such as WordNet and Mesh, Wikipedia itself is not a structured thesaurus. • However, it is more… • Comprehensive: it contains 12 million articles (2.8 million in the English Wikipedia) • Accurate : A study by Giles (2005) found Wikipedia can compete with Encyclopædia Britannica in accuracy*. • Up to date: Current and emerging concepts are absorbed timely. * Giles, J. 2005. Internet encyclopaedias go head to head. Nature 438: 900–901.

Wikipedia as Thesaurus • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept. • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus.

Wikipedia Article that describes the Concept Artificial intelligence

Wikipedia as Thesaurus • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. • Equivalent concepts are grouped together by redirected links.

AI is redirected to its equivalent concept Artificial Intelligence

Wikipedia as Thesaurus • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. • Equivalent concepts are grouped together by redirected links. • It contains a hierarchical categorization system, in which each article belongs to at least one category.

The concept Artificial Intelligence belongs to four categories: Artificial intelligence, Cybernetics, Formal sciences & Technology in society

Wikipedia as Thesaurus • Moreover, Wikipedia has a well-formed structure • Each article only describes a single concept • The title of the article is a short and well-formed phrase like a term in a traditional thesaurus. • Equivalent concepts are grouped together by redirected links. • It contains a hierarchical categorization system, in which each article belongs to at least one category. • Polysemous concepts are disambiguated by Disambiguation Pages.

The different meanings that Artificial intelligence may refer to are listed in its disambiguation page.

WIKIPEDIA FOR TEXT CATEGORIZATION / CLUSTERING • Objective: use information in Wikipedia to improve performance of text classifiers / clustering systems • A number of possibilities: • Use similarity between documents and Wikipedia pages on a given topic as a feature for text classification • Use WIKIFICATION to enrich documents • Use Wikipedia category system as category repertoire

Using Wikipedia Categories for text classification

WIKIPEDIA FOR TEXT CLASSIFICATION Automatic identification of the topic/category of a text (e.g., computer science, psychology) Books Learning objects Vietnam War 0.0023 Cat: Wars Involving the United States 0.00779 United States 0.3793 World War I 0.0023 Ronald Reagan 0.0027 Communism 0.0027 Cat: Global Conflicts 0.00779 Michail Gorbachev 0.0023 Cold War 0.3111 “The United States was involved in the Cold War.” 17

USING WIKIPEDIA FOR TEXT CLASSIFICATION • Either directly use Wikipedia categories or map one’s categories to Wikipedia categories • Use the documents associated with those categories as training documents

TEXT WIKIFICATION • Wikification = adding links to Wikipedia pages to documents

WIKIFICATION • Text: • Wikipedia: Giotto was called to work in Padua, and also in Rimini Truc-Vien T. Nguyen

Wikification pipeline

Keyword Extraction • Finding important words/phrases in raw text • Two-stage process • Candidate extraction • Typical methods: n-grams, noun phrases • Candidate ranking • Rank the candidates by importance • Typical methods: • Unsupervised: information theoretic • Supervised: machine learning using positional and linguistic features

Keyword Extraction using Wikipedia 1. Candidate extraction • Semi-controlled vocabulary • Wikipedia article titles and anchor texts (surface forms). • E.g. “USA”, “U.S.” = “United States of America” • More than 2,000,000 terms/phrases • Vocabulary is broad (e.g., the, a are included)

Keyword Extraction using Wikipedia 2. Candidate ranking • tf * idf • Wikipedia articles as document collection • Chi-squared independence of phrase and text • The degree to which it appeared more times than expected by chance • Keyphraseness:

Our own Approach • (Cfr. Milne & Witten 2008, 2012; Ratinov et al, 2011) • Use Wikipedia dump to compute two statistics: • KEYPHRASENESS: prior probability that a term is used to refer to a Wikipedia article • COMMONNESS: probability that phrase is used to refer to specific Wikipedia article • Two versions of system: • UNSUPERVISED: use statistics only • SUPERVISED: use distant learning to create training data

KEYPHRASENESS • the probability that a term t is a link to a Wikipedia article (cfr. Milne & Witten’s prior link probability) • Examples: • The term "Georgia" • Is found as a link in 22631 Wikipedia articles • appears in 75000 Wikipedia articles  keyphraseness = 22631/75000 = 0.3017466 • Cfr. the term “the”: keyphraseness = 0.0006

COMMONNESS • the probability that a term t is a link to a SPECIFIC Wikipedia article a • for example, the surface form "Georgia" was found to be linked to • a1 = "University_of_Georgia" 166 times  commonness(t, a1) = 166/(166+18+5) = 0.8783 • "Republic_of_Georgia" 18 times • "Georgia_(United_States)" 5 times

Extracting dictionaries and statistics from a Wikipedia dump • Parsing: • In three phases • Identify articles of relevance • Extract (among other things) • Set of SURFACE FORMS (terms that are used to link to Wikipedia articles) • Set of LINKS [article|surface_form] • [[Pedanius Dioscorides|Dioscorides]]

The Wikipedia Dump from July 2011 • 11459639 pages in total • 12,525,583 links • specifying: surface word / target / frequency • ranked by frequency • for example, the mention "Georgia" is linked to • "University_of_Georgia" 166 times, • "Republic_of_Georgia" 18 times • "Georgia_(United_States)" 5 times Truc-Vien T. Nguyen

Some statistics (all Wikidumps from July 2011)

Surface forms, titles, articles • Some definitions and figures • surface form  the occurence of a mention inside an article • target article  the target Wiki article a surface form linked to Files in Polish are arranged in a repository different from English/Italian

The Unsupervised Approach • Use Keyphraseness to identify candidate terms • Retain terms whose keyphraseness is above a certain threshold (currently 0.01) • Use commonness to rank • Retain top 10

The Supervised Approach • Features: in addition to commonness, use measures of SIMILARITY between text containing the term and the candidate Wikipedia page • RELATEDNESS: a measure of similarity between the LINKS (cfr. Milne&Witten’s NORMALIZED LINK DISTANCE)

Training a supervised wikifier • Using WIKIPEDIA ITSELF as source of training materials (see next)

Results on standard datasets

Wikifying queries: the Bridgeman datasets • BAL Data sets: • 1049 Query set • 1 annotator, up to 3 manual annotations • 1 automatic annotation • 100 Query set • 3 annotators, each up to 3 manual annotations

Results on Bridgeman 1000: Y3 Accuracy up by 17 points (36%)

Results for the GALATEAS languages and Arabic

The GALATEAS D2W web services • Available as open source • Deployed within LinguaGrid • API based on the Morphosyntactic Annotation Framework (MAF), an ISO standard • Tested on 1.5M queries, achieves throughput of 600 characters per second • Integrated with LangLog tool

Use of the service in LangLog (See Domoina’s demo)

Other applications • The UK Data Archive

WIKIPEDIA FOR NER [The FCC] took [three specific actions] regarding [AT&T]. By a 4-0 vote, it allowed AT&T to continue offering special discount packages to big customers, called Tariff 12, rejecting appeals by AT&T competitors that the discounts were illegal. …..

WIKIPEDIA FOR NER http://en.wikipedia.org/wiki/FCC: The Federal Communications Commission (FCC) is an independent United States government agency, created, directed, and empowered by Congressionalstatute (see 47 U.S.C.§ 151 and 47 U.S.C.§ 154).

WIKIPEDIA FOR NER Number of glucocorticoid receptors in lymphocytes and their sensitivity to hormone action .

WIKIPEDIA

WIKIPEDIA FOR NER • Wikipedia has been used in NER systems • As a source of features for normal NER • To automatically create training materials (DISTANT LEARNING) • To go beyond NE tagging towards proper ENTITY DISAMBIGUATION

Distant learning • Automatically extract examples • positive examples from mention-to-link Wikipedia page • Negative examples from similar mentions with other links • Use positive and negative examples to train model

The Supervised Approach: Using Wikipedia links to generate training data • Example • Giotto was called to work in Padua, and also in Rimini. (sentence taken from Wikipedia text, with links avalable) • Giotto_di_Bondone (painter), Giotto_Griffiths (Welsh rugby player), Giotto_Bizzarrini (automobile engineer) • Dataset • +1 Giotto was called to work -- Giotto_di_Bondone • -1 Giotto was called to work -- Giotto_Griffiths • -1 Giotto was called to work -- Giotto_Bizzarrini http://en.wikipedia.org/wiki/Giotto_di_Bondone Truc-Vien T. Nguyen

MORE ADVANCED USES OF WIKIPEDIA • As a source of ONTOLOGICAL KNOWLEDGE • DBPEDIA

SEMANTIC NETWORK KNOWLEDGE IN WIKIPEDIA • Taxonomic information: category structure • Attributes: infobox, text

807 - TEXT ANALYTICS