- 48 Views
- Uploaded on
- Presentation posted in: General

The Hidden Vector State Language Model

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

The Hidden Vector State Language Model

Vidura Senevitratne, Steve Young

Cambridge University Engineering Department

- Young, S. J., “The Hidden Vector State language model”, Tech. Report CUED/F-INFENG/TR.467, Cambridge University Engineering Department, 2003.
- He, Y. and Young S.J., “Hidden Vector State Model for hierarchical semantic parsing”, In Proc. of the ICASSP, Hong Kong, 2003.
- Fine, S., Singer Y., and Tishby N., “The Hierarchical Hidden Markov Model: Analysis and applications”, Machine Learning 32(1): 41-62, 1998.

- Introduction
- HVS Model
- Experiments
- Conclusion

- Language model:
- Issue of data sparseness, inability to capture long distance dependencies and model the nested structural information
- Class-based language model
- POS tag information

- Structured language model
- Syntactic information

- HHMM is structured multi-level stochastic process.
- Each state is an HHMM
- Internal state: hidden state that do not emit observable symbols directly
- Production state: leaf state

- States of HMM are production states of HHMM.

- Parameters of HHMM:

- Transition probability: horizontal
- Initial probability: vertical
- Observation probability:

- Current node is root:
- Choose child according to initial probability

- Child is production state:
- Produce an observation
- Transit within the same level
- When it reaches end-state, back to parent of end-state

- Child is internal state:
- Choose child
- Wait until control is back from children
- Transit within the same level
- When it reaches end-state, back to parent of end-state

- Other application: trend of stocks (IDEAL 2004)

The semantic information relating to any single word can be stored

as a vector of semantic tag names

- If state transitions were unconstrained
- Fully HHMM

- Transitions between states can be factored into a stack shift: two stage, pop, push
- Stack size is limited, # of new concept to be pushed is limited to one
- More efficient

- The joint probability is defined:

- Approximation (assumption):
- So,

- Generative process associated with this constrained version of HVS models consists of three step for each position t:
1. choose a value for nt

2. Select preterminal concept tag ct[1]

3. Select a word wt

- It is reasonable to ask an application designer to provide examples of utterances which would yield each type of semantic schema.
- It is not reasonable to require utterances with manually transcribed parse trees.
- Assume abstract semantic annotations and availability of a set of domain specific lexical classes.

Abstract semantic annotations:

- show me flights arriving in X at T.
- List flights arriving around T in X.
- Which flight reaches X before T.
= FLIGHT(TOLOC(CITY(X),TIME_RELATIVE(TIME(T))))

Class set:

CITY: Boston, New York, Denver…

Experimental Setup

Training set: ATIS-2, ATIS-3

Test set: ATIS-3 NOV93, DEC94

Baseline: FST (Finite Semantic Tagger)

GT for FST, Witten-Bell for HVS

Show me flights from Boston to New York

Goal: FLIGHT

Slots: FROMLOC.CITY = Boston

TOLOC.CITY = New York

Dash line: goal detection accuracy, Solid line: F-measure

- The key features of HVS model
- Its ability for representing hierarchical information in a constrained way
- Its capability for training directly from target semantics without explicit word-level annotation.

- The basic HVS model is a regular HMM in which each state encodes history in a fixed dimension stack-like structure.
- Each state consists of a stack where each element of the stack is a label chosen from a finite set of cardinality M+1: C={c1,…,cM,c#}
- A D depth HVS model state can be characterized by a vector of dimension D with most recently pushed element at index 1 and the oldest at index D

- Each HVS model state transition is restricted:
(i) exactly nt class label are popped off the stack

(ii) exactly one new class label ct is pushed into the stack

- The number of elements to pop nt and the choice of new class label to push ct are determined:

- nt is conditioned on all the class labels that are in the stack at t-1 but ct is conditioned only on the class labels that remain on the stack after the pop operation
- Former distribution can encode embedding, whereas the latter focuses on modeling long-range dependencies.

- Joint probability:
- Assumption:

- Training: EM algorithm
- C,N: latent data, W: observed data

- E-step:

- M-Step:
- Q function (auxiliary):
- Substituting P(W,C,N|λ)

- Calculate probability distributions separately.

- State space S, if fully populated:
- |S|=MD states, for M=100+, D=3 to 4

- Due to data sparseness, backoff is needed.

- Backoff weight:
- Modified version of absolute discounting

- Training set:
- ATIS-3,276K words, 23K sentences.

- Development set:
- ATIS -3 Nov93

- Test set :
- ATIS-3 Dec94, 10K words, 1K sentences.

- OOV were removed
- k=850

- The HVS language model is able to make better use of context than standard class n-gram models.
- HVS model is trainable using EM.