Natural Language Processing using Wikipedia

Natural Language Processingusing Wikipedia Rada Mihalcea University of North Texas

Text Wikification • Finding key terms in documents and linking them to relevant encyclopedic information.

Text Wikification (continued) • Motivation: • Help Wikipedia contributors • NLP applications (summarization, text categorization, metadata annotation, text similarity) • Enrich educational materials • Annotating web pages (semantic web) • Combined problem • Finding the important concepts • Keyword extraction • Finding the correct article • Word sense disambiguation

Wikification pipeline

Keyword Extraction • Finding important words/phrases in raw text • Two-stage process • Candidate extraction • Typical methods: n-grams, noun phrases • Candidate ranking • Rank the candidates by importance • Typical methods: • Unsupervised: information theoretic • Supervised: machine learning using positional and linguistic features

Keyword Extraction using Wikipedia 1. Candidate extraction • Semi-controlled vocabulary • Wikipedia article titles and anchor texts (surface forms). • E.g. “USA”, “U.S.” = “United States of America” • More than 2,000,000 terms/phrases • Vocabulary is broad (e.g., the, a are included)

Keyword Extraction using Wikipedia 2. Candidate ranking • tf * idf • Wikipedia articles as document collection • Chi-squared independence of phrase and text • The degree to which it appeared more times than expected by chance • Keyphraseness:

Evaluations • Gold standard • 85 documents containing 7.286 links • Links selected by Wikipedia users • Have undergone the continuous editorial process of Wikipedia • Extract N keywords from the ranking • N=6% of number of words

Results

Example Keyword Extraction

Wikification Pipeline

Aida (café): In most shops a quick coffee while standing up at the bar is possible. • Channel: A channel is also the natural or man-made deeper course through a reef, bar, bay, or any shallow body of water. • Meter: Each bar has a 2-beat unit, a 5-beat unit, and a 3-beat unit, with a stress at the beginning of each unit. Word Sense Disambiguation

Aida (café): In most shops a quick coffee while standing up at the bar is possible.

Wikipedia as a Sense Tagged Corpus Wikipedia links = Sense annotations • In most shops a quick coffee while standing up at the [[bar (counter)| bar]] is possible. • A channel is also the natural or man-made deeper course through a reef, [[bar (landform)| bar]], bay, or any shallow body of water. • Each [[bar (music)| bar]] has a 2-beat unit, a 5-beat unit, and a 3-beat unit, with a stress at the beginning of each unit.

Sense Inventory • Alternative 1: disambiguation webpages • Does not include all possible annotations • [[measure (music) | bar ]] measure (music) not listed • Inconsistent • identifier of disambiguation page: paper (disambiguation)vs. paper • Alternative 2: extract all link annotations • bar (counter), bar (music), bar (landform) • map them to WordNet senses

Building a Sense Tagged Corpus Given ambiguous word W • Extract all the paragraphs in Wikipedia containing the ambiguous word W inside a link • Collect all the possible Wikipedia labels = leftmost component of each link • Map the Wikipedia labels to WordNet senses

An Example Given ambiguous word W = BAR • Extract all the paragraphs in Wikipedia containing the ambiguous word W inside a link • 1,217 paragraphs • remove examples with [[bar]] (ambiguous): 1,108 examples • Collect all the possible Wikipedia labels = leftmost component of each link • 40 Wikipedia labels • bar (music); measure music; musical notation • Map the Wikipedia labels to WordNet senses • 9 WordNet senses

WordNet definition Wikipedia label Word sense Wikipedia definition

Supervised Word Sense Disambiguation • Local and topical features in a Naïve Bayes classifier • Good performance on Senseval-2 and Senseval-3 data • Local features • Current word and part-of-speech • Surrounding context of three words • Collocational features • Topical features • Five keywords per sense, occurring at least three times • (Ng & Lee, 1996), (Lee & Ng, 2002)

Experiments on Senseval-2 / Senseval-3 Lexical sample WSD • 49 ambiguous nouns from Senseval-2 (29), Senseval-3 (20) • Remove the words with one Wikipedia sense • detention • Remove the words with all Wikipedia senses mapped to one WordNet sense • Roman church, Catholic church Catholic church • Final set: 30 nouns with Wikipedia labels mapped to at least two WordNet senses

Ten-fold cross validations • [WSD] Supervised word sense disambiguation on Wikipedia sense tagged corpora • [MFS] Most frequent sense: choose the most frequent sense by default • [Similarity] Similarity between current example and training data available for each sense

Results on Senseval-2 / Senseval-3

Some Notes • Words with no improvement • Small number of examples in Wikipedia • restraint (9), shelter (17) • Skewed sense distributions • bank: 1044 occurrences as “financial institution”, 30 occurrences as “river bank” • Different granularity • Coarser grained senses in Wikipedia • Missing senses: atmosphere: ambiance • Coarse distinctions: grasp: act of grasping (#1) = hold (#2) • Exceptions: dance performance, theatre performance

Experiments on Wikipedia All-words WSD • “Link disambiguation” • Find the link assigned by the Wikipedia annotators • Data set • The same data set used in keyword evaluation • 85 documents containing 7.286 links • Three methods • Supervised • Similarity • Unsupervised: measure similarity of context and candidate article • Combined: voting

Results

Wikification

Wikify! system (http://lit.csci.unt.edu/~wikify/ or www.wikifyer.com)

Overall System Evaluation • Turing-like test • Annotation of educational materials

Turing-like Test • Given a Wikipedia article, decide if it was annotated by humans or our automated system

Turing-like Test • 20 test subjects (mixed background) • 10 document pairs for each subject (side by side) • Average accuracy: 57% • Ideal case = 50% success rate (total confusion)

Annotation of Educational Materials • Studies in cognitive science • “An important part of the learning process is the ability to connect the learning material to the prior knowledge of the learner” (Walter Kinsch, 1998) • Amount of required background material • Depends on the level of explicitness of the text • Knowledge of the learner • Low-knowledge vs. high-knowledge learners • Use the text wikifier to facilitate access to background knowledge

A History Test • A test consisting of 14 questions from a quiz from an online history course at UNT • Multiple-choice questions • Half the questions linked to Wikipedia, half left in their original format • 60 students taking the test • Randomly either the first or the last 7 questions were wikified • Students were instructed: • they were allowed to use any information they wanted to answer the questions • they were not required to use the Wikipedia links

Results (p<0.1) (p<0.05)

Lessons Learned • Wikipedia can be used as a source of evidence for text processing tasks • Keyword extraction • Word sense disambiguation • Text wikification: linking documents to encyclopedic knowledge • Enrich educational materials • Annotation of web pages (semantic web) • NLP applications • summarization, information retrieval, text categorization • text adaptation, topic identification, multilingual semantic networks

Ongoing Work: Text Adaptation Planning for a Long Trip (Magellan’s Stories) Serrao’s letters helped build in my mind the location of the Spice Islands, which later became the destination for my great voyage. I asked the King of Portugal to support my journey, but he refused. After that, I begged the King of Spain. He was interested in my plan since Spain was looking for a better sea route to Asia than the Portuguese route around the southern tip of Africa. It was going to be hard to find sailors, though. None of the Spanish sailors wanted to sail with me because I was Portuguese. Def: long trip with a specific objective, esp. by sea or air En: trip, journey Es: travesia, viaje Def: to travel by boat En: navigate Es: salir, navigar Funded by the National Science Foundation under CAREER IIS-0747340, 2008-2013 Collaboration with Educational Testing Service (ETS)

Ongoing Work: Topic Identification Automatic identification of the topic/category of a text (e.g., computer science, psychology) Books Learning objects Vietnam War 0.0023 Cat: Wars Involving the United States 0.00779 United States 0.3793 World War I 0.0023 Ronald Reagan 0.0027 Communism 0.0027 Cat: Global Conflicts 0.00779 Michail Gorbachev 0.0023 Cold War 0.3111 “The United States was involved in the Cold War.” 38 Funded by the Texas Higher Education Coord. Board, Google 2008-2010

Ongoing WorkMultilingual Semantic Networks MUSICIAN En: musician Fr: musicien De: Musiket ORCHESTRA En: orchestra Fr: orchestre De: orchester partOf instanceOf isA isA COMPOSER En: composer Fr: compositeur De: komponist PIANIST En: pianist Fr: pianiste De: pianist CONDUCTOR En: conductor Fr: chef d’orchestre De: Dirigent BOSTON POPS ORCHESTRA En: Boston Pops Orchestra Fr: Orchestre Boston Pops instanceOf isA instanceOf partOf instanceOf JOHN WILLIAMS En: John Williams, Williams Fr: John Williams De: John Williams CONDUCTOR OF THE BOSTON POPS ORCHESTRA En: conductor of the Boston Pops Orchestra Fr: chef d’orchestre de l’Orchestre Boston Pops De: Dirigent de Boston Pops Orchestra instanceOf John Williams served as the principal conductor of the Boston Pops Orchestra Funded by the National Science Foundation under IIS-1018613, 2010-2013

Thank You!Questions?

Wikipedia for Natural Language Processing • Word similarity • (Strube & Ponzetto, 2006) • (Gabrilovich & Markovitch, 2007) • Text categorization • (Gabrilovich & Markovitch, 2006) • Named entity disambiguation • (Bunescu & Pasca, 2006)

Wikipedia vs. WordNet (Senseval) • Different granularity • Coarser grained senses in Wikipedia • Missing senses: atmosphere: ambiance • Coarse distinctions: grasp: act of grasping (#1) = hold (#2) • Exceptions: dance performance, theatre performance • Wikipedia vs. Senseval – different sense distribution • Low sense distribution correlation r = 0.51

Sense Disambiguation Learning Curve • Disambiguation accuracy using 10%, 20%… 100% of the data

Text Wikification • Finding key terms in documents and linking them to relevant encyclopedic information.

Lexical Semantics Find the meaning of all-words in unrestricted text Required for automatic machine translation, information retrieval, text understanding SenseLearner – minimally supervised learning Senseval-2, Senseval-3, Semeval (Semeval @ ACL 2007) Publicly available http://lit.csci.unt.edu/~senselearner GWSD – unsupervised graph-based algorithms Random walks on text structures Find the most central meanings in a text http://lit.csci.unt.edu/index.php/Downloads 48 Funded by the National Science Foundation

Lexical substitution: SubFinder Find semantically-equivalent substitutes for a target word in a given context Combine corpus-based and knowledge-based approaches Combine monolingual and multilingual resources Wordnet, Encarta, bilingual dictionaries, large corpora Faired well in the Semeval 2007 lexical substitution task TransFinder Find the translation of a target word in a given context Assist Hispanic students with the understanding of English texts Task at Semeval 2010 Lexical Semantics 49 Funded by the National Science Foundation

Lexical Semantics • Text-to-text semantic similarity • Find if two pieces of text contain the same information • Useful for information retrieval (search engines), text summarization • Focus on automatic student answer grading • Given the instructor answer and the student answer, assign a grade and identify potential misunderstandings and areas that need clarifications Funded by the National Science Foundation

Metadata Annotation for Learning Object Repositories • Learning object repositories: support sharing and reuse of educational materials • Identify keywords and related concepts for the automatic annotation of learning object repositories • Keyword extraction using • Graph-based algorithms • Knowledge drawn from Wikipedia 51 Funded by the Texas Higher Education Coordinating Board (THECB)

Natural Language Processing using Wikipedia

Natural Language Processing using Wikipedia

Presentation Transcript

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing