structure and frequency of lexical semantic classes l.
Skip this Video
Loading SlideShow in 5 Seconds..
Download Presentation

Loading in 2 Seconds...

play fullscreen
1 / 33


  • Uploaded on

STRUCTURE AND FREQUENCY OF LEXICAL SEMANTIC CLASSES. Paola Merlo Suzanne Stevenson University of Geneva University of Toronto. What is the role of quantitative approaches?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
structure and frequency of lexical semantic classes


Paola Merlo Suzanne Stevenson

University of Geneva University of Toronto

what is the role of quantitative approaches
What is the role of quantitative approaches?
  • Can quantitative investigations be the subject matter of linguistic research, or are they only methodological tools?
  • Investigating the relationship between richly structured representations and distributional properties of language
  • - provides richer data
  • - supports falsifiable and predictive
  • reasoning within weaker theories
case study verb classes
Case Study: Verb Classes

Manner of Motion

TRANS The rider raced the horse past the barn

INTR The horse raced past the barn

Change of State

TRANS The cook melted the butter

INTR The butter melted


TRANS The contractors built the house

INTR The contractors built all summer

.23 .77





quantitative investigations
Quantitative Investigations
  • Observation: Different lexical semantic classes have different patterns of frequency distributions
  • Q1: Are these distributional properties related to other underlying properties?
  • Q2: Are the differences in distribution strong enough to support generalisation to new verbs and other verb classes?
  • Q3: Does the relation between underlying properties and frequency hold typologically?
frequency and thematic roles
Frequency and thematic roles
  • Different lexical semantic classes show different frequency distributions in the use of the transitive construction
  • The difference in the frequency of the transitive use is related to different thematic assignments
english verb classes
English Verb Classes

Manner of Motion The rider raced the horse past the barn

(Causal) Agent


The horse raced past the barn


Change of State The cook melted the butter

(Causal) Theme


The butter melted


Creation/Transformation The contractors built the house

Agent Theme

The contractors built all summer


transitive use
Transitive Use
  • Transitivity by causation: MoM, CoS
  • Agentive object : MoM
relationship between frequency and transitivity
Relationship between frequency and transitivity
  • Transitivity by causation: MoM, CoS
  • - Greater complexity, two events
  • Agentive object : MoM (transitive unergative)
  • - Infrequent in English: only MoM and SE
  • - Infrequent typologically (* Italian, French, German,
  • Portuguese,and Czech. Vietnamese only comitative)
  • - Difficult to process (Stevenson Merlo 97, Filip et al. CUNY 98)
  • Explains frequency of transitive use MoM < CoS < C/T
other frequency facts
Other frequency facts

Are there are other properties specific to verb classes that we can expect to surface as statistical differences?

  • Themes are more likely to be inanimate
animacy and thematic hierarchies
Animacy and thematic hierarchies
  • Thematic hierarchy AGENT > THEME
  • Animacy hierarchy 1, 2>3,Proper>Human>Animate>Inanimate
  • Harmonic Alignment
  • 1,2/AG>3,Proper/AG>Human/AG>Animate/AG>Inanimate/AG
  • 1,2/TH<3,Proper/TH<Human/TH<Animate/TH<Inanimate/TH
  • Expected Frequency of Animacy: CoS< MoM,C/T
causative use
Causative use
  • Transitivity by causation: MoM, CoS
    • Causer subject, same thematic role between subj intr and obj trans
  • Expected frequency of overlap: MoM, C/T < CoS
empirical validation
Empirical Validation
  • How do we verify empirically that the distributional properties are as predicted based on the verb class representation?
  • The properties we have hypothesized are abstract, how do we count them in a sufficiently large corpus?
  • - by hand, sampling
  • - automatically by approximation with indicators
data collection materials
Data Collection – Materials
  • Verbs
  • Manner of motion (20) -- jump, march
  • Change of state (19) -- open, explode
  • Creation/Transformation/Performance (C/T)(20) -- played, painted
  • Verb Form``-ed'' form assumed to be representative
  • Corpora
  • 65 million words tagged Brown + tagged WSJ corpus (LDC)
  • 29 million words parsed WSJ (LDC corpus, Collins 97 parser)
data collection method
Data Collection -- Method
  • TRANS Verb token immediately followed by potential object counted as transitiveelse intransitive.
  • Potential object = Closest nominal group after verb token .
  • (or also count passive or past participle frequency)
  • CAUS Calculate overlap of multiset of subjects and multiset of objects
  • Take ratio between cardinality of the overlapmultiset,
  • and the sum of the cardinality of the subject and objectmultisets.
  • ANIM Ratio of occurrences of pronoun subjects to all subjects
statistical analysis of the data
Statistical Analysis of the Data
  • Mean relative frequencies
  • All statistically significant at p< .01
  • Answer to Q1
  • different lexical semantic classes have different frequency
  • distributions of properties systematically related to
  • the verb’s thematic assignments
  • How well do these distributional properties generalise
  • - across verbs
  • - across classes
  • - across languages ?
the classification problem
The Classification Problem
  • The Given Statistics reflecting thematic information
  • about a given set of verb classes
  • The Goal Automatically classify unseen verbs
  • Experimental Setup
  • Materials
  • Vector template: [ verb,TRANS,…, CAUS,ANIM,class]
  • Example: [ open, .69, …, .16, .36, CoS ]
  • Method
  • Learner: C5.0 (decision tree induction algorithm)
  • Training/Testing: 10-fold cross-validation repeated 50 times
  • Overall results: accuracy 69.8% (baseline 33.9, expert upper bound 86.5%)
  • 54% reduction in error rate on previously unseen verbs
  • (recent extension range from 62% to 82% accuracy)
  • Effectiveness of frequency distributions
  • All distributions are useful in classification
  • Class by class accuracy
  • MoM verbs are most accurately classified
  • Analysis of Errors TRANS sharpens 3 way distinction
  • ANIM particularly helpful in discriminating CoS
  • Relation between frequencies and thematic assignments is confirmed
generalising to a new class
Generalising to a new class
  • New Class Psychological State Verbs
  • New thematic roles Experiencer Stimulus
  • Example The rich love money
  • Experiencer Stimulus
  • The rich love too
  • Experiencer
  • Properties: TRANS, CAUS, ANIM
  • PROG use of the progressive (stative/non stative)
  • Results 74.6% accuracy (baseline 57%)
  • TRANS, CAUS, ANIM best features
  • Relationship between frequencies and thematic properties holds
  • across classes
  • Some specific frequency distributions carry across thematic roles
  • Discovery We do not need to investigate new frequency
  • distributions for each new class
  • Conjecture: Thematic roles are decomposed in more primitive features
multi lingual generalisations
Multi-lingual Generalisations
  • Accurate investigation of relation between grammar and frequency
  • requires
  • - a well-founded theory of lexical representation
  • - a distributional analysis of language
  • Multi-linguality provides
  • - abstract, general level of linguistic description
  • - more data
  • Greater coverage and accuracy are possible by looking at several
  • languages
multi lingual generalisations24
Multi-lingual Generalisations
  • Extension of mono-lingual method to a new language (Italian)
  • - Shows similarities in the relations between frequency
  • distributions and thematic relations across languages
  • - Extends coverage to new languages
  • Extension to the use of multi-lingual data to classify verbs in a
  • given language (Chinese and English data to classify English
  • verbs)
  • - Shows that surface differences across languages are
  • related to a similar underlying representation
  • - Improves accuracy in the classification of a given language
exploiting similarities extension to italian merlo stevenson tsang and allaria 2002 allaria 2001
Exploiting similarities: Extension to Italian(Merlo, Stevenson, Tsang, and Allaria 2002, Allaria 2001)
  • Verbs: 20 CoS, 20 C/T, 19 Psych
  • Properties: TRANS, CAUS, ANIM, PROG
  • Corpus: PAROLE 22 million words (CNR, Pisa)
  • Counts: relative frequencies, hand counts (exact)
  • 79% reduction in error rate on unseen verbs
  • TRANS ANIM best
  • Relationship between frequencies and
  • thematic properties holds across languages
leveraging cross language differences tsang stevenson and merlo 2002
Leveraging Cross-language Differences(Tsang, Stevenson and Merlo, 2002)
  • What is abstract/underlying in one language might be explicit in another
  • Revealing an underlying common/similar classification
  • e.g. - Causative forms in Chinese are morphologically marked
  • Data from several languages classify one language
  • Training Chinese English
  • Testing English
materials and m ethod

Materials and Method

English verb classes: MoM, CoS, C/T, 20 verbs in each class

English properties: TRANS,PASS,VBN,CAUS,ANIM

Chinese translations of the verbs (several)

Counts of new frequencies adapted to Chinese:

Relative frequencies of

- POS tags (indication of subcategorization and stative/active)

- passive particle

- periphrastic causative particle

materials and m ethod29

Materials and Method

English data from BNC (tagged and chunked),

Chinese data from Mandarin News (165 million characters)

Learning: C5.0 (decision tree induction)

Training/Testing: 10-fold cross-validation 50 repeats



  • Best result in classification of English verbs:
  • combination of Chinese and English frequencies
  • ANIM, TRANS, CKIP 83.5% accuracy
  • (English frequencies 67.6%)
  • same or at least similar underlying abstract classification
  • otherwise different views would make the classifier diverge
  • advantage of working at different levels of description


  • Distributional properties are correlated to thematic properties
  • for several verb classes, several thematic roles, several
  • languages
  • Relevant for
  • Notion of verb class: point in a multi-dimensional space?
  • Representation and inventory of thematic roles
  • Language acquisition studies: what are the properties necessary to learning verb meaning (Gillette et al 99)

Thank you to our students

Gianluca Allaria (Geneva)

Eva Esteve Ferrer (Geneva)

Eric Joanis (Toronto)

Vivian Tsang (Toronto)