reports a simpler intuitive approach to morpheme induction l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
RePortS: A Simpler, Intuitive Approach to Morpheme Induction PowerPoint Presentation
Download Presentation
RePortS: A Simpler, Intuitive Approach to Morpheme Induction

Loading in 2 Seconds...

play fullscreen
1 / 20

RePortS: A Simpler, Intuitive Approach to Morpheme Induction - PowerPoint PPT Presentation


  • 156 Views
  • Uploaded on

RePortS: A Simpler, Intuitive Approach to Morpheme Induction. Emily Pitler Samarth Keshava Yale University. Goals. Segment English words into morphemes Simple algorithm Minimize assumptions and “magic numbers”. Approach. Identify common morphemes in the language

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'RePortS: A Simpler, Intuitive Approach to Morpheme Induction' - edena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
reports a simpler intuitive approach to morpheme induction

RePortS: A Simpler, Intuitive Approach to Morpheme Induction

Emily Pitler

Samarth Keshava

Yale University

goals
Goals
  • Segment English words into morphemes
  • Simple algorithm
  • Minimize assumptions and “magic numbers”
approach
Approach
  • Identify common morphemes in the language
    • “prefix” and “suffix” lists
  • Use these to segment the test words
intuition and motivation
Intuition and Motivation
  • The resulting word fragment, after removing a potential morpheme, is often still a word
  • Examples:
    • training = train+ing
    • chairman = chair+man
    • insufferable = insuffer+able
  • Don’t use to segment words
intuition and motivation5
Intuition and Motivation
  • Use fluctuations in transitional probabilities (Harris 1955, Hafer and Weiss 1974)
  • Examples:
    • Expect Pr(t | repor) ≈ 1
    • Expect Pr(s | report) < 1
      • Because there are other words such as reported, reporting, report, etc.
four steps
Four Steps
  • Preprocessing: build the lexicographic trees
  • Score word fragments to determine morphemes
  • Prune the morpheme lists
  • Segment words using the trees and morpheme lists
step 1 build the trees
Step 1: Build the trees
  • We build a “forward tree” and a “backward tree”
  • We use these trees to calculate transitional probabilities in O(1) time
step 2 scoring morphemes
Step 2: Scoring morphemes
  • Example: scoring “s” in “reports”
    • Check if “report” is a word in the corpus
    • Check if Pr(t | repor) ≈ 1
    • Check if Pr(s | report) < 1
  • If “s” passes all three tests, we add 19 to its suffix score; otherwise we subtract 1
step 2 scoring morphemes10
Step 2: Scoring morphemes
  • We declare fragments to be morphemes if they have positive scores
  • +19/-1 scheme
    • Chosen so that positive score iff pass 5% of tests
    • More frequent morphemes have higher scores
    • Any multiple of these numbers would produce same results
step 3 pruning
Step 3: Pruning
  • Don’t want “er”, “s” and “ers” all in the morpheme list
  • Remove any morpheme composed of two other morphemes with higher scores
top english morphemes14
Prefixes and suffixes later in the list

well

water

servo

make

quick

ier

box

town

line

more

Top English Morphemes
step 4 segmenting words
Step 4: Segmenting Words
  • politeness = polite+ness or politenes+s ?
  • Use transitional probabilities again
    • Expect Pr(n | polite) < Pr(s | politenes)
  • Peel off morpheme with smallest probability (unless all probabilities are 1)
results
Results
  • English results
    • On the provided 532-word Gold Standard
    • On the organizers’ test data
results17
Results
  • Breakdown
    • Contribution of the different intuitions
results18
Results
  • Finnish
  • Turkish
simple and effective
Simple and Effective
  • Based on intuition, not a complex model
    • How we personally would segment words
  • Program was relatively short--252 lines of Perl
  • Other variations had slightly better F-scores
  • Best mixture of performance and elegance
thank you for listening

Thank you for listening.

Emily Pitler

Samarth Keshava