S tatistical language models for croatian weather domain corpus
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

S TATISTICAL LANGUAGE MODELS FOR CROATIAN WEATHER - DOMAIN CORPUS PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on
  • Presentation posted in: General

S TATISTICAL LANGUAGE MODELS FOR CROATIAN WEATHER - DOMAIN CORPUS. Lucia Načinović, Sanda Martinčić-Ipšić and Ivo Ipšić Department of Informatics, University of Rijeka lnacinovic, smarti, ivoi @inf.uniri.hr. Introduction.

Download Presentation

S TATISTICAL LANGUAGE MODELS FOR CROATIAN WEATHER - DOMAIN CORPUS

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


S tatistical language models for croatian weather domain corpus

STATISTICALLANGUAGEMODELSFORCROATIANWEATHER-DOMAINCORPUS

Lucia Načinović, Sanda Martinčić-Ipšić and Ivo Ipšić

Department of Informatics, University of Rijeka

lnacinovic, smarti, ivoi @inf.uniri.hr


Introduction

Introduction

Statistical language modelling estimates the regularities in natural languages

the probabilities of word sequences which are usually derived from large collections of text material

Employed in:

Speech recognition

Optical character recognition

Handwriting recognition

Machine translation

Spelling correction

...


N gram language models

N-gram language models

The most widely-used LMs

Based on the probability of a word wn given the preceding sequence of words wn-1

Bigram models (2-grams)

determine the probability of a word given the previous word

Trigram models (3-gram)

determine the probability of a word given the previous two words


Language model perplexity

Language model perplexity

The most common metric for evaluating a language model - probability that the model assigns to test data, or the derivative measures of :

cross-entropy

perplexity


Cross entropy

Cross-entropy

  • The cross-entropy of a model p(T) on data T:

  • WT-the length of the text T measured in words


Perplexity

Perplexity

  • The reciprocal value of the average probability assigned by the model to each word in the test set T

  • The perplexity PPp(T) of a model - related to cross-entropy by the equation

  • lower cross-entropies and perplexities are better


Smoothing

Smoothing

Data sparsity problem

N-gram models - trained from finite corpus

some perfectly acceptable N-grams are missing: probability=0

Solution – smoothing techiques

adjust the maximum likelihood estimate of probabilities to produce more accurate probabilities

adjust low probabilities such as zero probabilities upward, and high probabilities downward


Smoothing techniques used in our research

Smoothing techniques used in our research

Additive smoothing

Absolute discounting

Witten-Bell technique

Kneser-Nay technique


Additive smoothing

Additive smoothing

one of the simplest types of smoothing

we add a factorδ to every count: δ (0< δ ≤1)

Formula for additive smoothing:

V - the vocabulary (set of all words considered)

c - the number of occurrences

values of δ parameter used in our research: 0.1,0.5 and 1


Absolute discounting

Absolute discounting

When there is little data for directly estimating an n-gram probability, useful information can be provided by the corresponding (n-1)-gram

Absolute discounting - the higher-order distribution is created by subtracting a fixed discount D from each non-zero count:

Values of D used in our research: 0.3, 0.5, 1


Witten bell technique

Witten-Bell technique

  • Number of different words in the corpus is used as a help at determing the probability of words that never occur in the corpus

  • Example for bigram:


Kneser nay technique

Kneser-Nay technique

  • An extension of absolute discounting

  • the lower-order distribution that one combines with a higher-order distribution is built in a novel manner:

    • it is taken into consideration only when few or no counts are present in the higher-order distribution


Smoothing implementation

Smoothing implementation

  • 2-gram, 3-gram and 4-gram language models were built

  • Corpus: 290 480 words

    • 2 3981-grams,

    • 18 6942-grams,

    • 23 0213-gramsand

    • 29 7364-grams

  • On each of these modelsfour different smoothing techniques were applied


Corpus

Corpus

  • Major part developed from 2002 until 2005 and some parts added later

  • Includes the vocabulary related to weather, bio and maritime forecast, river water levels and weather reports

  • Devided into 10 parts

    • 9/10 used for building language models

    • 1/10 used for evaluating those models in terms of their estimated perplexities


Results given by the perplexities of lm s

Results given by the perplexities of LM-s


Conclusion

Conclusion

  • In this paper we described the process of language model building from the Croatian weather-domain corpus

  • We built models of different order:

    • 2-grams

    • 3-grams

    • 4-grams


Conclusion1

Conclusion

  • We applied four different smoothing techniques:

    • additive smoothing

    • absolute discounting

    • Witten-Bell technique

    • Kneser-Ney technique

  • We estimated and compared perplexities of those models

  • Kneser-Ney smoothing technique gives the best results


Further work

Further work

  • Prepare more balanced corpus of Croatian text and thus build more complete language model

  • Other LM

    • Class based

  • Other smoothing techniques


S tatistical language models for croatian weather domain corpus1

STATISTICALLANGUAGEMODELSFORCROATIANWEATHER-DOMAINCORPUS

Lucia Načinović, Sanda Martinčić-Ipšić and Ivo Ipšić

Department of Informatics, University of Rijeka

lnacinovic, smarti, ivoi @inf.uniri.hr


References

References

  • Chen, Stanley F.; Goodman, Joshua. An empirical study of smoothing techniques for language modelling. Cambridge, MA: Computer Science Group, Harvard University, 1998

  • Chou, Wu; Juang, Biing-Hwang. Pattern recognition in speech and language processing. CRC Press, 2003

  • Jelinek, Frederick. Statistical Methods for Speech Recognition. Cambridge, MA: The MIT Press, 1998

  • Jurafsky, Daniel; Martin, James H. Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, New Jersey: Prentice Hall, 2000

  • Manning, Christopher D.; Schütze, Hinrich. Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press, 1999

  • Martinčić-Ipšić, Sanda. Raspoznavanje i sinteza hrvatskoga govora konteksno ovisnim skrivenim Markovljevim modelima, doktorska disertacija. Zagreb, FER, 2007

  • Milharčič, Grega; Žibert, Janez; Mihelič, France. Statistical Language Modeling of SiBN Broadcast News Text Corpus.//Proceedings of 5th Slovenian and 1st international Language Technologies Conference 2006/Erjavec, T.; Žganec Gros, J. (ed.). Ljubljana, Jožef Stefan Institute, 2006

  • Stolcke, Andreas. SRILM – An Extensible Language Modeling Toolkit.//Proceedings Intl. Conf. on Spoken Language Processing. Denver, 2002, vol.2, pp. 901-904


Srilm toolkit

SRILM toolkit

Modeli su građeni i evaluirani pomoću SRILM alata

http://www.speech.sri.com/projects/srilm/

ngram-count –text TRAINDATA –lm LM

ngram –lm LM –ppl TESTDATA


Language model

Language model

  • Speech recognition – converting an acoustic signal into a sequence of words

  • Through language modelling, the speech signal is being statistically modelled

  • Language model of a speech estimates probability Pr(W) for all possible word strings W=(w1, w2,…wi).


System diagram of a generic speech recognizer based on statistical models

System diagram of a generic speech recognizer based on statistical models


S tatistical language models for croatian weather domain corpus

Bigram language models (2-grams)

Central goal: to determine the probability of a word given the previous word

Trigram language models (3-grams)

Central goal: to determine the probability of a word given the previous two words

The simplest way to approximate this probability is to compute:

-This value is called the maximum likelihood (ML) estimate


S tatistical language models for croatian weather domain corpus

  • Linear interpolation - simple method for combining the information from lower-order n-gram models in estimating higher-order probabilities


S tatistical language models for croatian weather domain corpus

  • A general class of interpolated models is described by Jelinek and Mercer:

  • The nth-order smoothed model is defined recursively as a linear interpolation between the nth-order maximum likelihood model and the (n-1)-th-order smoothed model

  • Given fixed pML, it is possible to search efficiently for the factor that maximizes the probability of some data using the Baum–Welch algorithm


S tatistical language models for croatian weather domain corpus

In absolute discounting smoothing instead of multiplying the higher-order maximum-likelihood distribution by a factor , the higher-order distribution is created by subtracting a fixed discount D from each non-zero count:

Values of D used in research: 0.3, 0.5, 1


  • Login