HLT, Data Sparsity and Semantic Tagging
Sponsored Links
This presentation is the property of its rightful owner.
1 / 21

HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield) PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield) Roberto Basili (University of Tor Vergata, Rome) Hamish Cunningham (University of Sheffield). Outline. A ubiquitous problem: data sparsity The approach: coarse-grained semantic tagging

Download Presentation

HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

  • HLT, Data Sparsity and Semantic Tagging

  • Louise Guthrie (University of Sheffield)

  • Roberto Basili (University of Tor Vergata, Rome)

  • Hamish Cunningham (University of Sheffield)



  • A ubiquitous problem: data sparsity

  • The approach:

    • coarse-grained semantic tagging

    • learning by combining multiple evidence

  • The evaluation: intrinsic and extrinsic measures

  • The expected outcomes: architectures, tools, development support



PresentWe’ve seen growing interest in a range of HLT tasks:

e.g. IE, MT


  • Fully portable IE, unsupervised learning

  • Content Extraction vs. IE


Data Sparsity

  • Language Processing depends on a model of the features important to an application.

    • MT - Trigrams and frequencies

    • Extraction - Word patterns

  • New texts always seem to have lots of phenomena we haven’t seen before


Different kinds of patterns

Person was appointed as postof company

Company named person to post

  • Almost all extraction systems tried to find patterns of mixed words and entities.

    • People, Locations, Organizations, dates, times, currencies


Can we do more?

Astronauts aboard the space shuttle Endeavor were forced to dodge a derelict Air Force satellite Friday

Humans aboard space_vehicle dodge satellitetimeref.


Could we know these are the same?

The IRA bombed a family owned shop in Belfast yesterday.

FMLN set off a series of explosions in central Bogota today.



Machine translation

  • Ambiguity of words often means that a word can translate several ways.

  • Would knowing the semantic class of a word, help us to know the translation?


Sometimes . . .

  • Crane the bird vs crane the machine

  • Bat the animal vs bat for cricket and baseball

  • Seal on a letter vs the animal


SO ..

P(translation(crane) = grulla | animal) >

P(translation(crane) = grulla)

P(translation(crane) = grua | machine) >

P(translation(crane) = grua)

Can we show the overall effect lowers entropy?


Language Modeling – Data Sparseness again ..

  • We need to estimate Pr (w3 | w1 w2)

  • If we have never seen w1w2 w3 before

  • Can we instead develop a model and estimate Pr (w3 | C1 C2) or Pr (C3 | C1 C2)


A Semantic Tagging technology. How?

  • We will exploit similarity with NE tagging, ...

    • Development of pattern matching rules as incremental wrapper induction

  • ... with semantic (sense) disambiguation

    • Use as much evidence as possible

    • Exploit existing resources like MRD or LKBs

  • ... and with machine learning tasks

    • Generalize from positive examples in training data


Multiple Sources of Evidence

  • Lexical information (priming effects)

  • Distributional information from general and training texts

  • Syntactic features

    • SVO patterns or Adjectival modifiers

  • Semantic features

    • Structural information in LKBs

    • (LKB-based) similarity measures


Machine Learning for ST

  • Similarity estimation

    • among contexts (texts overlaps, …)

    • among lexical items wrt MRD/LKBs

  • We will experiment

    • Decision tree learning (e.g. C4.5)

    • Support Vector Machines (e.g. SVM light)

    • Memory-based Learning (TiMBL)

    • Bayesian learning


What’s New?

  • Granularity

    • Semantic categories are coarser than word senses (cfr. homograph level in MRD)

  • Integration of existing ML methods

    • Pattern induction is combined with probabilistic description of word semantic classes

  • Co-training

    • Annotated data are used to drive the sampling of further evidence from unannotated material (active learning)


  • How we know what we’ve done: measurement, the corpus

    • Hand-annotated corpus

    • from the BNC, 100-million word balanced corpus

    • 1 million words annotated

    • a little under ½ million categorised noun phrases

    • Extrinsic evaluationPerplexity of lexical choice in Machine Translation

    • Intrinsic evaluationStandard measures or precision, recall, false positives

    • (baseline: tag with most common category = 33%)


Ambiguity levels in the training data

NPs by semantic categories:


















Total NPs (interim)453360


  • Maximising project outputs:software infrastructure for HLT

    • Three outputs from the project:

    • 1. A new resource

      • Automatical annotation of the whole corpus

    • Experimental evidence re. 1.- how accurate the final results are- how accurate the various methods employed are

    • Component tools for doing 1., based on GATE(a General Architecture for Text Engineering)


  • What is GATE?

  • An architectureA macro-level organisational picture for LE software systems.

  • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture.

  • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction.

  • Some free components... ...and wrappers for other people's components

  • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.

  • Free software (LGPL). Download at http://gate.ac.uk/download/


  • Where did GATE come from?

    • A number of researchers realised in the early- mid-1990s (e.g. in TIPSTER):

    • Increasing trend towards multi-site collaborative projects

    • Role of engineering in scalable, reusable, and portable HLT solutions

    • Support for large data, in multiple media, languages, formats, and locations

    • Lower the cost of creation of new language processing components

    • Promote quantitative evaluation metrics via tools and a level playing field

    • History:

    • 1996 – 2002: GATE version 1, proof of concept

    • March 2002: version 2, rewritten in Java, component based, LGPL, more users

    • Fall 2003: new development cycle


  • Role of GATE in the project

    • Productivity- reuse some baseline components for simple tasks- development environment support for implementors (MATLAB for HLT?)- reduce integration overhead (standard interfaces between components)- system takes care of persistency, visualisation, multilingual edit, ...

    • Quantification- tool support for metrics generation - visualisation of key/response differences- regression test tool for nightly progress verification

    • Repeatability- open source supported, maintained, documented software- cross-platform (Linux, Windows, Solaris, others)- easy install and proven useability (thousands of people, hundreds of sites)- mobile code if you write in Java; web services otherwise


  • Login