1 / 19

Linguistic Processing in Lattice-Based  Taxonomy Construction

Linguistic Processing in Lattice-Based  Taxonomy Construction. Anastasia Novokreshchenova , Maria Shabanova , Dmitry Zaytsev and Nina Belyaeva State University Higher School of Economics, Moscow School of Applied Mathematics and Computer Science.

devaki
Download Presentation

Linguistic Processing in Lattice-Based  Taxonomy Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linguistic Processing in Lattice-Based Taxonomy Construction Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and Nina Belyaeva State University Higher School of Economics, Moscow School of Applied Mathematics and Computer Science CLA 2010 Seville, Spain. October19-21, 2010.

  2. Outline • Motivation in Social Studies and the Data • Building a lattice-based taxonomy over a text corpus • Natural language processing techniques for automatic attributes acquisition • Keywords extraction • Probabilistic latent modeling of text • Named entity recognition

  3. Motivation • Represent the structure of a given domain in a form of a lattice-based taxonomy • Interdisciplinary research project “Discrete mathematical models for political analysis of democratic institutions and human rights" • Speeches of Western leaders and international organizations • The context in which Russia is addressed • The role and importance of democracy and human rights agenda • Construct a context from the text corpora • Extract the set of attributes from texts for describing the documents • Analyze and develop natural language processing methods

  4. The Data: 26 fullspeeches of foreign leaders

  5. Constructing lattice-based taxonomy over a text corpus • Preliminary text processing • Attributes extraction for describing the documents • Building and pruning the lattice

  6. Three kinds of taxonomies • Three kinds of taxonomies depending on the attributes type: • frequent words • latent topics • named entities

  7. Building a taxonomy with frequent words • eliminating of stop-words • stemming - collapsing all morphological variants of the term to a single root form • describing each document with its N most frequent terms • building and pruning the lattice

  8. 31 formal concepts of the lattice based on frequent words Figures in squares show the number of documents in each concept

  9. According to word frequencies taxonomy: • security issues and relationships of Russia with Europe are the most discussed topics along with some global problems • democracy and human rights are not included in the presented taxonomy due to pruning • words "democracy", "human" and "right" appear in the concepts which include speeches by Barack Obama and Hillary Clinton.

  10. Probabilistic latent semantic analysis (pLSA) • P( z ) – the distribution over topics z in a particular document • P( w | z ) – the probability distribution over words w given topic z • T is the number of topics

  11. Building a taxonomy with latent topics • probabilistic modeling of text: • documents are represented as random mixtures over latent topics • each topic is characterized by a distribution over words. • 20 topics were derived from the 26 documents • 20 topics were used as attributes for describing the documents

  12. 6 of the 20 received topics from the documents: words distributions over topics

  13. 17 formal concepts of the lattice based on latent topics

  14. 17 formal concepts of the lattice based on latent topics

  15. According to the latent topics - taxonomy • The most actual topics are those connected with: • European Union • global problems • security issues • energy resources • Russian-Georgian conflict • possible ways of solving conflicts and problems • The topic of democracy and human rights is not included in the presented taxonomy due to pruning • the concept with this topic includes speeches by Barack Obama and Nicolas Sarcozy

  16. Building a taxonomy with Named Entities • 38 paragraphs derived from the 26 and enlighten solely issues concerning Russia • three types of named entities for describing the documents • names of persons • organizations • geographical objects

  17. 21 concepts of a lattice built from paragraphsand named entities

  18. Conclusion remarks • several techniques have been proposed to build a context over a text corpus • frequent words allowed to define what questions are raised most frequently by foreign leaders regarding Russia • latent topic modeling allowed to specify and describe these issues more thoroughly • Named-entity would be more informative to use in the context of latent topics • the corpus of the texts should be expanded

  19. Thank you!

More Related