collocations and terminology n.
Skip this Video
Loading SlideShow in 5 Seconds..
Collocations and Terminology PowerPoint Presentation
Download Presentation
Collocations and Terminology

Loading in 2 Seconds...

play fullscreen
1 / 22

Collocations and Terminology - PowerPoint PPT Presentation

  • Uploaded on

Collocations and Terminology. Vasileios Hatzivassiloglou University of Texas at Dallas. Collocations. Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics , 1993 Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Collocations and Terminology

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
collocations and terminology

Collocations and Terminology

Vasileios Hatzivassiloglou

University of Texas at Dallas

  • Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics, 1993
  • Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning
  • Technical and non-technical
examples of collocations
Examples of collocations
  • The Dow Jones average of industrials
  • The Dow average
  • The Dow industrials
  • *The Jones industrials
  • The Dow Jones industrial
  • *The industrial Dow
  • *The Dow industrial
collocation properties
Collocation properties
  • Arbitrary (dialect dependent)
    • ride a bike, set the table
  • Domain dependent
    • dry suit, wet suit
  • Recurrent
  • Cohesive
    • Part of a collocation primes for the rest
  • Lexicography
  • Grammatical restrictions (compare with/to but associate with)
  • Generation
  • Translation
types of collocations
Types of collocations
  • Predicative relations
    • make a decision, hostile takeover
    • flexible (syntactic variability, intervening words)
  • Rigid word groups
    • over the counter market
  • Phrases with open slots
    • fluency in a domain
issues in finding collocations
Issues in finding collocations
  • Possibly more than two words
    • Need measure that extends beyond the binary case
  • Possibly intervening words
  • Possibly morphological and syntactic variation
  • Semantic constraints (cf. doctors-dentists and doctors-hospitals)
xtract stage one
Xtract stage one
  • For a given word, find all collocates at positions -5 to +5
  • Three criteria:
    • strength (normalized frequency); 95% rejection vs. expected 68% under normal distribution
    • position histogram must not be flat
    • select peak from histogram
xtract stage two
Xtract stage two
  • Start from word pairs
  • Look at each position in between, to the left, and to the right
  • Keep words that appear very often
  • If that fails, keep parts of speech that satisfy this criterion
xtract stage three
Xtract stage three
  • Applied to pairs of words
  • Requires (partial) parsing
  • Examines the syntactic relationship between words and keeps those pairs with consistent relationships (e.g., verb-object)
  • Ask lexicographer to evaluate output
  • 40% precision after stages one and two
  • 80% precision after stage three
  • 94% conditional recall
  • Béatrice Daille, “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology”, ACL Balancing Act workshop, 1994
  • Terms refer to concepts
  • Terms key for populating a domain ontology
  • Terms are typically nominal compounds of certain structure, e.g., NN, N of N
defining terms
Defining terms
  • Unique reference
  • Unique translation
  • Term extension by
    • modification (e.g., addition of an adjective)
    • substitution
    • extension of structure
    • coordination
  • Apply syntactic constraints to match pairs of words in a candidate term
  • Filter by application of an association measure
  • Measures examined: pointwise mutual information, Φ2 (chi-square), log-likelihood ratio
  • Compare with reference list
  • Frequency a strong predictor
  • Log-likelihood ratio works best
  • Additional criteria:
    • diversity of the distribution of each word
    • distance between the two words (determines flexibility but not term status)
justeson and katz
Justeson and Katz
  • Justeson and Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text”, Natural Language Engineering, 1995.
  • Examined association measures
  • Well-known problems:
    • eliminating general-language constructs (e.g., collocations)
    • what to do with single word terms?
  • Frequency works well
  • But a stronger predictor is P(k>1) compared to P(k≥1) in the same document
  • Use syntactic patterns to propose terms, then check if they reappear in the same document
  • Require this across multiple documents
term expansion
Term Expansion
  • Jacquemin, Klavans, and Tzoukermann, “Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax”, ACL 1997.
  • Need to expand a given list of terms, especially for scientific domains
term variation
Term variation
  • Syntactic (same words, different structure)
  • Morphosyntactic (derivational forms of words)
  • Semantic (synonyms are used)
  • In IR, normalization through stemming and removal of stop words
  • Process corpus matching new candidate terms to old ones via unification
  • Matching based on
    • inflectional morphology (transducer)
    • derivational morphology (rule-based)
    • syntactic transformations
    • additions of words
  • Manual inspection of several thousand proposed terms
  • Precision of 89%
  • Effectiveness in indexing increases by a factor of three when using the variants (P/R from 99.7/72 to 97/93)