Using corpora for language research
Download
1 / 44

Using Corpora for Language Research - PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on

Using Corpora for Language Research. COGS 523-Lecture 4 Using Corpora with Other Resources; Corpus Software. Related Readings. Readings: Buchholz and Green (2006); Miller and Fellbaum (2007); Sampson and McCarthy Ch 29. Extra – Information sheet for Resources

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Using Corpora for Language Research' - quyn-hill


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Using corpora for language research

Using Corpora for Language Research

COGS 523-Lecture 4

Using Corpora with Other Resources;

Corpus Software

COGS 523 - Bilge Say


Related readings
Related Readings

Readings:

Buchholz and Green (2006); Miller and Fellbaum (2007); Sampson and McCarthy Ch 29.

Extra – Information sheet for Resources

Optional (can be used in software reviews!!)

Garretson, G. (2008) Desiderata for Linguistics Software Design. International Journal of English Studies 8(1), 67-74. (The link is available on METU Online)

COGS 523 - Bilge Say


Lexical and ontological resources
Lexical and Ontological Resources

  • Useful for Natural Language Processing, Pyscholinguistics, Corpus Annotation (eg automating semantic annotation)

  • A selected review is to follow, but there are others...

COGS 523 - Bilge Say


Wordnet preliminaries
WordNet - Preliminaries

  • Lexeme vs Sense

  • Homonyms (Homophones or homographs): Words that have the same form with unrelated meanings

  • Polysemy: Multiple related meanings with a single lexeme (eg sperm bank)

  • Hard to distinguish between polysemy and homonymy sometimes.

COGS 523 - Bilge Say


Wordnet preliminaries1
WordNet - Preliminaries

  • Synonymy: Different lexemes, same (or nearly same) meanings

  • Hyponymy: A subclass of: poodle->dog; car -> vehicle (opp. direction hypernymy)

  • Mereonymy: A part of: leg -> table

  • Antonymy: Opposites

COGS 523 - Bilge Say


Wordnet
WordNet

  • A lexical database for English (and 30 other languages, see Balkanet and EuroWordnet projects); most extensive use: word sense disambiguation (Wordnet book available at the library)

  • Synsets: A set of synonyms

    • Each sense entry contains synsets, a dictionary style definition, some example uses (and a frequency number)

    • Four separate databases: nouns (hyponymy, meronymy), verbs (hyponymy,manner, causation, etc.), adjectives and adverbs

    • Synsets will be chained together with hyponynms and hypernyms – multiple chains possible

COGS 523 - Bilge Say


Bass -> musical instrument -> instrument -> device ....-> entity

Bass -> singer, vocalist -> musician -> performer ....-> entity

COGS 523 - Bilge Say


Extensions
Extensions entity

  • WordNetPlus: Dense Weighted X-database of automatically learned evocation (how much a certain concept brings to mind the second) ratings...First human-rated 120,000 pairs from 1000 synsets – most frequent concepts in BNC.

  • ImageNet: Enhancing WordNet with images and icons.

COGS 523 - Bilge Say


An example of Wordnet Query entity

COGS 523 - Bilge Say


Turkish wordnet project
Turkish WordNet project entity

  • http://www.hlst.sabanciuniv.edu/TL/

  • Combined with phonetic rendering, morphological analysis, English equivalent etc.

  • http://www.ceid.upatras.gr/Balkanet/index.htm

    Part of Balkanet project for 6 Balkan languages

  • 12,000 synsets

COGS 523 - Bilge Say


An example of Turkish Wordnet Query entity

COGS 523 - Bilge Say


An alternative to turkish wordnet
An Alternative to Turkish WordNet entity

  • 60000 hypernyms, 72 layers

  • Machine learning from TDK dictionary

  • Ongoing work, needs disambiguation

  • More coverage than Turkish WordNet

  • By Tunga Güngör and Onur Güngör, Boğaziçi Univ


Ontologies cyc
Ontologies - Cyc entity

  • A knowledge base of human commonsense and associated inference engine.

  • http://www.opencyc.org/ (Free version) http://research.cyc.com/ (Academic version)

  • Doug Lenat’s project – 1984+

  • 300,000 concepts

  • Nearly 3,000,000 assertions (facts and rules), using 26,000+ relations, that interrelate, constrain, and, in effect, (partially) define the concepts.

  • Natural Language Query and Information Entry Tools

COGS 523 - Bilge Say


The graph representation of the Cyc Knowledge Base entity

http://www.cyc.com/cyc/technology/whatiscyc_dir/whatdoescycknow

COGS 523 - Bilge Say


An example of a knowledge representation sample entity

coded with CycL

COGS 523 - Bilge Say


Conceptnet
ConceptNet entity

  • http://web.media.mit.edu/~hugo/conceptnet/

  • Part of Open Mind Initiative

  • A huge wiki type of effort to create a commonsense knowledgebase represented as a semantic network

  • 1.6 million edges (assertions) connecting more than 300 000 nodes, where nodes are semi-structured English fragments.

  • interrelated by an ontology of twenty semantic relations such as EffectOf (causality), SubeventOf (event hierarchy), CapableOf (agent’s ability), PropertyOf, LocationOf, andMotivationOf (affect).

COGS 523 - Bilge Say



from Liu, H. & Singh, P. (2004) ConceptNet: A Practical Commonsense Reasoning Toolkit. BT Technology Journal

COGS 523 - Bilge Say


Framenet
FrameNet Commonsense Reasoning Toolkit.

  • FrameNet is a lexicon-building project for English, based on frame semantics, carried out by International Computer Science Institute of University of Berkeley.

  • Frame: schematic representation of a situation type (eating, spying, removing, classifying, etc.) together with lists of the kinds of participants, props, and other conceptual roles that are seen as components of such situations. The semantic arguments of a predicating word correspond to what we call the frame elements(FE) of the frame associated with that word.

COGS 523 - Bilge Say


Framenet1
FrameNet Commonsense Reasoning Toolkit.

  • Uses BNC and ANC

  • Currently (version 1.3), there are more than 10,000 lexical units, more than 6,000 of which are fully annotated, in more than 800 hierarchically-related semantic frames, exemplified in more than 135,000 annotated sentences in the database.

  • WordNet – ConceptNet hybrid, with a grammar theory in the background (Fillmore’s Frame Semantics).

COGS 523 - Bilge Say


Interface of the Frame Grapher Commonsense Reasoning Toolkit.

COGS 523 - Bilge Say


Sample Output From Frame Grapher Commonsense Reasoning Toolkit.

input: Crime_Scenario

COGS 523 - Bilge Say


Software for working with corpora
Software for Working with Corpora Commonsense Reasoning Toolkit.

“Corpus Linguistics in its current form cannot work without the help of the computer.” (Mason)

  • Acc. to Function: Corpus Building Software vs Corpus Query Software

  • Acc. to Design: Standard Software for Non-Technical Users vs Specialized Toolkits Providing Standard Functions vs Using Non-Corpus Specific Tools and Programming Languages (e.g. grep, egrep, perl, phyton, tcl/tk, java)

COGS 523 - Bilge Say


Corpus software
Corpus Software Commonsense Reasoning Toolkit.

  • Standard Software: MonoConcPro, WConcord, Wordsmith, IMS CQP (Corpus Query Processor, Qwick, Xaira, Gsearch

  • More General Purpose NLP Suites/Toolkits for Programmers: CUE (Corpus Universal Examiner), NLTK, GATE

COGS 523 - Bilge Say


Corpus query analysis software
Corpus Query/Analysis Software Commonsense Reasoning Toolkit.

  • Text Analysis Software -> Corpus Query Software -> Concordancers

  • Collocations in KWIC format (Keyword in Contex)

  • General Features

    • Search

    • Display, Save, Export

    • Statistics

COGS 523 - Bilge Say


Features
Features Commonsense Reasoning Toolkit.

  • Search

    • Word, phrase, POS etc search

    • Regular expression search

    • Context-sensitive search

    • Header info search

  • Display, save, export

    • KWIC or sentence format

    • Sorting

    • Saving results or search patterns

  • Statistics

    • Frequency and various statistics

    • Plotting graphs

COGS 523 - Bilge Say


A comparison framework
A Comparison Framework Commonsense Reasoning Toolkit.

  • Platform/Operating System

  • Price

  • Ease of Installation

  • User friendliness

  • Speed

  • Ease of setting up a corpus/texts

  • Query syntax

  • Query search power (collocational, discontinous constituents)

  • Statistical Analysis

  • Standard markup scheme handling

  • Whole text browsing

  • Character set handling

  • Output for presentation

COGS 523 - Bilge Say


Desiderata some maxims
Desiderata – some maxims Commonsense Reasoning Toolkit.

  • Do not build linguistic theory into the program any more than necessary

  • Do separate markup from annotation

  • Do not gloss over complexities in data – sensible defaults that can be overriden are fine

  • Allow users to supply their own analytical categories – e.g. Annotation of concordance lines

  • Make use of standards

  • Use Unicode

COGS 523 - Bilge Say


Ims corpus workbench cwb
IMS Corpus Workbench (CWB) Commonsense Reasoning Toolkit.

  • http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/

  • IMS Corpus Query Processor (CQP): query system for CWB

  • Allowing use of multiple knowledge sources (corpora, machine readable dictionaries etc)

  • Allowing the use of stored information and calculating information on-line (from remote corpora)

  • Both for Human-Machine Use but not really for novice users...

  • Regular Expression based syntax.

COGS 523 - Bilge Say


From cwb web site
From CWB web site Commonsense Reasoning Toolkit.

Query language

  • unrestricted number of attributes per corpus position

  • regular expressions over attribute values of individual corpus positions (e.g. wild cards for word forms, part-of-speech values)

  • regular expressions over sequences of corpus positions

  • (partial) support of structural annotations (e.g. SGML)

  • incremental concordancing

  • application of a query to all items of a list

  • 'virtual attributes', i.e. runtime access to external applications (e.g. WordNet)

  • queries on parallel translated texts

COGS 523 - Bilge Say


From cwb web site1
From CWB web site Commonsense Reasoning Toolkit.

Display of results

  • user-definable size of 'keyword in context' display

  • 'keyword in context' lines can be sorted in various ways

  • frequency counts, e.g. for word combinations

  • multilingual concordances from aligned corpora

  • html and latex output supported

  • query history

COGS 523 - Bilge Say


From cwb web site2
From CWB web site Commonsense Reasoning Toolkit.

  • registration of corpora

  • 'encoding' of corpora, i.e. indexing (and compression) (for text sources in one-word-per-line format, using ISO8859/Latin-1 8bit character sets, and maybe others) For example, the BNC corpus with part-of-speech and lemma annotation will need about 1 GB of disk space.

  • incremental addition of types of corpus annotations ('attributes'). E.g. add part-of-speech values to a corpus once you have access to a POS-tagger.

COGS 523 - Bilge Say


Regular expressions
Regular Expressions Commonsense Reasoning Toolkit.

  • Equivalent to regular languages and finite automaton languages

  • Take empty language, languages with a single string, and apply concatenation, union or Kleene star operations on them. Everything you can generate in this way will be regular languages. (Partee et al., 1993)

COGS 523 - Bilge Say


Regular expressions1
Regular Expressions Commonsense Reasoning Toolkit.

From CQP Tutorial...

  • Basic syntax of regular expressions

  • letters and digits are matched literally (including all non-ASCII characters) word word; C3PO C3PO; déjà déjà

  • . matches any single character (``matchall'') r.ng ring, rung, rang, rkng, r3ng, ...

  • character set: [...] matches any of the characters listed moderni[sz]e modernise, modernize[a-c5-9] a, b, c, 5, 6, 7, 8, 9[^aeiou] b, c, d, f, ..., 1, 2, 3, ..., ä, à, á, ...

  • repetition of the preceding element (character or group): ? (0 or 1), * (0 or more), + (1 or more), { } (exactly ), { , } ( ) colou?r color, colour; go{2,4}d good, goood, goood[A-Z][a-z]+ ``regular'' capitalised word such as British

  • grouping with parentheses: (...) (bla)+ bla, blabla, blablabla, ...(school)?bus(es)? bus, buses, schoolbus, schoolbuses

  • | separates alternatives (use parentheses to limit scope) mouse|mice mouse, mice; corp(us|ora) corpus, corpora

COGS 523 - Bilge Say


Regular expressions2
Regular Expressions Commonsense Reasoning Toolkit.

Complex regular expressions can be used to model (regular) inflection:

  • ask(s|ed|ing)? ask, asks, asked, asking(equivalent to the less compact expression ask|asks|asked|asking)

  • sa(y(s|ing)?|id) say, says, saying, said

  • [a-z]+i[sz](e[sd]?|ing)  any form of a verb with -ise or -ize suffix

COGS 523 - Bilge Say


Some examples from cqp
Some examples from CQP Commonsense Reasoning Toolkit.

  • the specified word is interpreted as a regular expression >"interest(s|(ed|ing)(ly)?)?";

  • > [(lemma="under.+") & (pos="V.*")];

  • a noun, followed by either is or was, followed by a verb ending in ed:[pos="N.*"] "is|was" [pos="V.*" & word=".*ed"];

  • similar, but is or was followed by a past participle (which is described by a special POS tag):[pos="N.*"] "is|was" [pos="VBD"];

  • catch or caught, followed by a determiner, any number of adjectives and a noun, or a noun, followed by was or were, followed by caught:"catch|caught" [pos="DT"] [pos="JJ"]* [pos="N.*"] | [pos="N.*"] "was|were" "caught";

  • look or bring, followed by either up or down with at most 10 non-verbs in between:"look|bring" [pos != "VB.*"]{0,10} "up|down";

COGS 523 - Bilge Say


Searching for more complex patterns
Searching for more complex patterns Commonsense Reasoning Toolkit.

  • Gsearch Corpus Query System

    • http://www.hcrc.ed.ac.uk/gsearch/

    • Facilitating the investigation of lexical and syntactic phenomena in unparsed but tagged corpora (can work with external taggers too)

    • Users specify their own context free grammar

    • Can take something like 167 minutes for a search on 100 million words BNC,

    • False positives should be manually eliminated

    • Visualization tools to display tree structures

COGS 523 - Bilge Say


Alternative using a class library
Alternative: Using a class library Commonsense Reasoning Toolkit.

  • Mason, O. Programming for Corpus Linguistics: How to do text analysis with Java, Edinburgh University Press, 2000.

  • CUE (Corpus Universal Examiner): class library in Java that takes care of indexing, compressing large corpora, support for XML and Unicode

  • Qwick: a concordancing application that is developed using CUE

COGS 523 - Bilge Say


A professional alternative
A Professional Alternative Commonsense Reasoning Toolkit.

  • http://athel.com/

  • MonoConcPro ($95)

  • Features: Context Search, Regular Expression search, Part-of-Speech Tag Search, Collocations, and Corpus Comparison.

  • Not language specific

  • You can also buy a Chinese (and other languages) concordance T-shirt 

COGS 523 - Bilge Say


From an older version of MonoConc Pro Commonsense Reasoning Toolkit.

COGS 523 - Bilge Say


COGS 523 - Bilge Say Commonsense Reasoning Toolkit.


Quality control in corpora
Quality Control in Corpora Commonsense Reasoning Toolkit.

  • Format: Punctuation, delimiters, character encoding,

  • Presence and order of all fields,

  • Typos in labels and annotation.

  • Explicit Documentation

  • Format Checker – Structure Checker

  • Solution: Versioning and Patching mechanism in Treebanks and Corpora

COGS 523 - Bilge Say


Interrater agreements reliability
Interrater agreements - reliability Commonsense Reasoning Toolkit.

  • Cochran’s Q test – binary values

  • Kappa – multivalued (Carletta, 1996)

    • Sensible chosen unit of agreement

    • Expert vs naive coders

    • K>0.8 good

  • Generalizability Theory (G-Theory) (Bayerl and Paul, 2007) – finer grained

COGS 523 - Bilge Say


Lecture 5
Lecture 5 Commonsense Reasoning Toolkit.

See articles on METU Turkish Corpus and Metu-Sabanci Treebank under Lecture Notes.

COGS 523 - Bilge Say


ad