slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Bang-Xuan Huang Department of Computer Science & Information Engineering PowerPoint Presentation
Download Presentation
Bang-Xuan Huang Department of Computer Science & Information Engineering

Loading in 2 Seconds...

play fullscreen
1 / 13

Bang-Xuan Huang Department of Computer Science & Information Engineering - PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on

Syntactic And Sub-lexical Features For Turkish Discriminative Language Models ICASSP 2010 Ebru Arısoy , Murat Sarac¸lar , Brian Roark, Izhak Shafran. Bang-Xuan Huang Department of Computer Science & Information Engineering National Taiwan Normal University. Outline. Introduction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Bang-Xuan Huang Department of Computer Science & Information Engineering' - kyra-cline


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Syntactic And Sub-lexical Features For Turkish Discriminative Language ModelsICASSP 2010EbruArısoy, Murat Sarac¸lar, Brian Roark, IzhakShafran

Bang-Xuan Huang

Department of Computer Science & Information Engineering

National Taiwan Normal University

outline
Outline
  • Introduction
  • Sub-lexical language models
  • Feature sets for DLM
    • Morphological Features
    • Syntactic Features
    • Sub-lexical Features
  • Experiments
  • Conclusions and Discussion
introduction

most words are formed by joining morphemes together

Introduction
  • In this paper we make use of both sub-lexical recognition units and discriminative training in Turkish language models.
  • Turkish is an agglutinative language.
  • Its agglutinative nature leads to a high number of out-ofvocabulary (OOV) words which degrade the ASR accuracy.
  • To handle the OOV problem, vocabularies composed of sub-lexical units have been proposed for agglutinative languages.

article

A

Syntactic(句法)

sentence

Ex: 今天 下午 需要 開會

lexical or word

introduction1
Introduction
  • DLM is a complementary approach to the baseline language model.
  • In contrast to the generative language model, it is trained on acoustic sequences with their transcripts to optimize discriminative objective functions using both positive (reference transcriptions) and negative (recognition errors) examples.
  • DLM is a feature-based language modeling approach. Therefore, each candidate hypothesis in DLM training data is represented as a feature vector of the acoustic input, x, and the candidate hypothesis, y.

Feature vector

candidate hypothesis

Ex: N-best, lattice

1

2

3

4

sentence x

0 1 2 3 ….. i

…..

….

….

sub lexical models
Sub-lexical models
  • In this approach, the recognition lexicon is composed of sub-lexical units instead of words.
  • Grammatically-derived units, stems, affixes or their groupings, and statistically-derived units, morphs, have both been proposed as lexical items for Turkish ASR.
  • Morphs are learned statistically from words by the Morfessor algorithm. Morfessor uses a Minimum Description Length principle to learn a sub-word lexicon in an unsupervised manner.
feature sets for dlm
Feature sets for DLM
  • Morphological Features
  • Syntactic Features
  • Sub-lexical Features
    • Clustering of sub-lexical units
      • Brown et al.’s algorithm
      • minimum edit distance (MED)
    • Long distance triggers
feature sets for dlm1
Feature sets for DLM
  • Root (原型)

ex: able => dis-able, en-able, un-able, comfort-able-ly, ….

  • Inflectional groups (IG)
  • Brown et al.’s algorithm

- semantically-based, syntactically-based

  • minimum edit distance (MED)
    • 將一個字串轉成另一個字串所需的最少編輯(insertion, deletion, substitution)次數
    • Ex: intension -> execution

del ‘i’ => nttention

sub ‘n’ to ‘e’ => etention

sub ‘t’ to ‘x’ => exention

ins ‘u’ => exenution

sub ‘n’ to ‘c’ => execution

feature sets for dlm2
Feature sets for DLM
  • Long distance triggers
  • Considering initial morphs as stems and non-initial morphs as suffixes, we assume that the existence of a morph can trigger another morph in the same sentence.
  • we extract all the morph pairs between the morphs of any two words in a sentence as the candidate morph triggers.
  • Among the possible candidates, we try to select only the pairs where morphs are occurring together for a special function.
conclusions and discussion
Conclusions and Discussion
  • The main contributions of this paper are

(i) syntactic information is incorporated into Turkish DLM

(ii) effect of language modeling units on DLMisinvestigated

(iii) morpho-syntactic information is explored when using sub-lexical

units.

  • It is shown that DLM with basic features yields more improvement for morphs than for words.
  • Our final observation is that the high number of features are masking the expected gains of the proposed features, mostly due to the sparseness of the observations per parameter.
  • This will make feature selection a crucial issue for our future research.
weekly report
Weekly report
  • Generate word graph
  • Recognition result