1 / 23

VACNET: Extracting and analyzing non-trivial linguistic structures at scale

VACNET: Extracting and analyzing non-trivial linguistic structures at scale . Matthew Brook O’Donnell, Nick C. Ellis, Ute R ömer & Gin Corden English Language Institute mbod@umich.edu. The 2nd University of Michigan Workshop on Data, Text, Web, and Social Network Mining April 22, 2011.

aiko
Download Presentation

VACNET: Extracting and analyzing non-trivial linguistic structures at scale

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VACNET: Extracting and analyzing non-trivial linguistic structures at scale Matthew Brook O’Donnell,Nick C. Ellis, Ute Römer & Gin CordenEnglish Language Institute mbod@umich.edu The 2nd University of Michigan Workshop on Data, Text, Web, and Social Network MiningApril 22, 2011

  2. Typical NLP Pipeline Challenge of natural language for data mining text text text text text text • Much work in NLP, IR and text classification relies upon frequency analysis of • single words • n-grams (contiguous word sequences of various lengths) • Units are computationally trivial to retrieve • Map-Reduce ‘Hello World’! • Techniques tend to use a ‘bag of words’ approach, disregarding structure • Frequency and statistical measures highlight distinctive items and document ‘aboutness’ • But this is a weak proxy for meaning, which remains somewhat elusive! Sentence splitting Word Tokenization POS tagging Chunking/Parsing meaning??? Named-entity recognition Can linguistic theory help?... NLP tools:

  3. Challenge of natural language for data mining Analyzing natural language data is, in my opinion, the problem of the next 2-3 decades. It's an incredibly difficult issue […] It's imperative to have a sufficiently sophisticated and rigorous enough approach that relevant context can be taken into account. Matthew Russell, Author Can linguistic theory help?... What is relevant context?

  4. Learning meaning in language How are we able to learn what novel words mean? • each word contributes individual meaning • verb meaning central; yet verbs are highly polysemous • larger configuration of words carries meaning • these we call CONSTRUCTIONS V aboutn She moogels about her book • moogleinherits its interpretation from the  echoes of the verbs that occupy the V aboutn Verb Argument Construction (VAC), words like: • talk, think, know, write, hear, speak, worry … fuss, shout, mutter, gossip ‘recurrent patterns of linguistic elements that serve some well-defined linguistic function’ (Ellis 2003)

  5. VACNET • Collaborative project to build an inventory of a large number of English verb argument constructions (VACs) using: • the COBUILD Verb Grammar Patterns descriptions • tools from computational and corpus linguistics • techniques from data mining, machine learning and network analysis • The project has two components: • a computational corpus analysis of corpora to retrieve instances and verb distributions for the full range of VACs • psycholinguistic experiments to measure speaker knowledge of these VACs through the verbs selected.

  6. V about n– some examples • He grumbled incessantly about the ‘disgusting’ provincial life we had to lead on the island • You should try to think ahead about your financial situation • He worried persistently about the poverty of his social life • She would keep banging on about her son • He wondered briefly about the effects of prolonged exposure to solar radiation • The housekeeper left the room, mutteringabout ingratitude • I do not want to carpabout the work of the Committee • ‘Any views expressedaboutMaster Matthew?’ • There are several other valid justifications for teaching explicitly about language • Those who gossipabouthim tend to meet with nasty accidents.

  7. VACNET: Language engineering challenge • TASK • retrieval of 700+ verb argument constructions from a 100 million corpus with minimal intervention but requirement for high precision and high recall • Multidisciplinary TEAM • linguists, psychologists, information scientists • undergraduate/graduate student RAs, faculty • TOOLS • dependency parsed corpus in GraphML format • web-based precision analysis tool • processing pipeline

  8. Architecture: Large scale extraction of constructions CORPUS BNC 100 mill. words DISCO WordNet POS tagging & Dependency Parsing Word Sense Disambiguation Statistical analysis of distributions CouchDB document database COBUILD Verb Patterns Network Analysis & Visualization Web application Construction Descriptions

  9. Method: Collaborative semi-automatic extraction

  10. Method: Collaborative semi-automatic extraction DEFINE search graph ENCODE in XML CONVERT to Python code SEARCH corpus and RECORD matches ERROR CODE

  11. Precision analysis interface

  12. Recall analysis

  13. Results: V aboutn • Types (list of different verbs occurring in VAC) • Frequency (Zipfian?) • Contingency (attraction of verb construction) • Semantics prototypicality of meaning & radial structure (Zipfian?)

  14. Results: V aboutn

  15. Initial Findings • The frequency distributions for the types occupying each VAC are Zipfian • The most frequent verb for each VAC is much more frequent than the other members, taking the lion’s share of the distribution • The most frequent verb in each VAC is prototypical of that construction’s functional interpretation • generic in its action semantics • VACs are selective in their verb form family occupancy: • Individual verbs select particular constructions • Particular constructions select particular verbs • There is greater contingency between verb types and constructions • VACs are coherent in their semantics.

  16. What do speakers know about verbs in VACS? Asked to fill the gap with the first word that comes to mind given the prompt Two Experiments 276 Native & 276 L1 German speakers of English s/he/it _____ about the …

  17. But what about meaning?... • We want to quantify the semantic coherence or ‘clumpiness’ of the verbs extracted in the previous steps • {think, know, hear, worry, care,…} ABOUT • Construction patterns are productive units in language and subject to polysemy just like words. Can we separate meaning groups within verb distributions? • Communication: {talk, write, ask, say, argue,…} ABOUT • Cognition: {think, know, hear, worry, care,…} ABOUT • Motion: {move, walk, run, fall, wander,…} ABOUT • The semantic sources must not be based on localized distributional language analysis • Use WordNet and Roget’s • Pedersen et al. (2004) WordNet similarity measures • Kennedy, A. (2009). The Open Roget's Project: Electronic lexical knowledge base

  18. Building a semantic network • Use semantic similarity scores for pairs of verbs (from WordNet, Roget, DISCO, etc.) to create network • nodes = lemma forms from VAC/CEC distribution • edges = link between nodes for top n similarity scores for a pair of verbs Cognition Communication

  19. Community detection top 100 verbs in VAC V about n

  20. Semantic Networks • Exploring community detection algorithms • Edge Betweenness(Girvan and Newman, 2002) • Fast Greedy (Clauset, Newman and Moore, 2004) • Label Propagation (Raghavan, Albert and Kumara, 2007) • Leading Eigenvector (Newman 2006) • Spinglass (Reichardt and Bornholdt, 2006) • Walktrap (Pons and Latapy, 2005) • Louvain (Blondel, Guillaume, Lambiotte and Lefebvre, 2008)

  21. VACNET Summary • Challenge of natural language for data mining • Project investigates usage of VACsat scale • constructions = meaningthrough patterns • IR challenge: retrieving non-trivial structures at scale • Corpus analysis examines the distributions of verbs in VACs • frequency distribution • contingency • semantics • Psycholinguistic experiments explore the psychological reality of VACs • VACNET structured inventory • verb to construction and construction to verb • valuable for NLP and DM tasks • Future explorations: • Train classifiers on our datasets • Tackle ‘big data’ sets Thank you! mbod@umich.edu

More Related