Recognition of spoken and spelled proper names

Recognition of spoken and spelled proper names Author :Michael Meyer, Hermann Hild Reporter : CHEN, TZAN HWEI

Outline • Introduction • Experiments • Summary

Introduction • The recognition of increasingly large sets of spoken names is difficult : • Very large recognition vocabularies contain many easily confused words or even homophones. • In this paper, it compares the performance for proper name recognition when a name is spoken only, spelled only or both spoken and spelled.

Introduction (cont) • In what contexts do people speak and spell names Table 1: Three scenarios for speaking and spelling a proper name

Experiments Speech data : • A database of about 2800 German last name spoken by 57 different speaker, according to scenario 2. • Recorded with a close-talking microphone at a sampling rate of 16 kHz. • The boundaries between all spoken and spelled names were to identified to conduct scenario 1

Experiments (cont) • The pronunciation dictionary covers about half of the 2800 names of our speech data. • The set of 1337 spoken and spelled names is used all the experiments described below. • For experiments, we use a MS-TDNN as a specialized letter recognizer. • And use the LVCSR of the JANUS system as a spoken name recognizer.

Experiments (cont) JANUS : • 60.0% names correct were achieved on the test set of the 1337 spoken last names. • To recognize the spelled name with JANUS, 93.3% correct names were achieved on the spelled names. MS-TDNN • Achieved 96.5% correct spelled names on the test .

Experiments (cont) Small list : • We assume that the list of names to be recognized is small enough, so that every name can be explicitly represented in the dictionary. • How can we combine the different information provided by the spoken and spelled names?

Experiments (cont) • After all, the pronunciations of the spelled letters represent in approximation the sounds of the letters in the fluently spoken words. • “TOM” versus “T-O-M” • Exceptions are letters with unusual pronunciation and those letter combination which define their own pronunciation, such as : • “Sch , ch” • In the following we will just combines the two representations on the basis of their acoustic scores only

Experiments (cont) Scenario 1

Experiments (cont) Scenario 2 • With 86.1% correct, the recognition on the entire utterance is worse than on the spelled apart alone. • It is possible to adapt a similar approach as in scenario 1. • The boundary of the first best hypothesis was used for the weighting of all hypotheses. • Resulting in a recognition rate of 89.1% • To incorporate the MS-TDNN letter recognizer, resulting in 95.8% recognition rate.

Experiments (cont) Figure 1: % names correct for a –weighted combination of the N-best list of spoken and spelled names (scenario 1 and 2)

Experiments (cont) Large lists • If the number of names exceeds the recognizer’s maximum vocabulary size, a different approach has to be taken. • A two-step approach is employed. • A coarse recognition run is used to get a reduced list of name candidates. • Then, these are processed in which all the previously described techniques for small word lists can be applied.

Experiments (cont) • In the case of scenario 1, the list of candidates can be easily reduced if only the spelled names are considered in the first pass. • For scenario 2, only phonemes and letters in JANUS’s recognition vocabulary.

Experiments (cont) • For scenario 2 • A special language model is employed. • A list of the most similar names can be retrieved, and then used in another JANUS recognition run. • The letter segments are then re-recognized with the MS-TDNN

Experiments (cont) Table 3: Summary of results for the separated and combined recognition of fluently spoken and spelled last names

Summary • By combining the N-best lists of both the spoken and spelled recognition, the overall performance can be improved. • An input of either L or FL can be distinguish with almost 99% correct, resulting of 95.5% names correct without a priori knowing whether L or FL was spoken.

Caller Identification from Spelled-Out Personal Data Over the Telephone Reporter : CHEN, TZAN HWEI

Outline • Introduction • The personal identification algorithm • Tests and results • Conclusions

Introduction • The problem of automatically identifying the caller in a telephone conversation from the information spoken in the call is extremely difficult. • The identification must take place despite rather substantial speech recognition errors that may be made by the machine.

Introduction (cont) • We can find a solution to the problem if we make two assumptions. • We assume that there is a database of records containing personal information about there the caller which can serve as a reference during the identification process. • We ask our caller to spell the personal identifying items so that the spoken vocabulary is small and we can look for correlations with the items in the database.

The personal identification algorithm Fig 1. the algorithm of personal identification from spelled tokens

The personal identification algorithm (cont) • Bayesian computation which starts with an estimate, for each record in , of the probability that record represents the identity of the caller. • It uses the acquired information and updates each record’s probability that it corresponds to the current caller’s identity. • The incremental computation is

The personal identification algorithm (cont) • Bayesian update of probabilities

Tests and results • The system was tested using a database of one million records that was constructed by using random combinations of 4,375 female first names, 1,129 male first names, and 88,799 last names. • The account numbers were generated so that the values for the last four digits of the number occurred with equal frequency throughout the database. • The city, postal code, and phone number fields were generated to correspond the locations in the U.K.

Tests and results (cont) • Our test involved identifying 300 different records in the database. • If the system was unable to make an identification of the target record after asking the user for all of the information, the caller was asked to make a second attempt using the same information. • If the system failed to produce a result after the second attempt, the call was terminated at that point.

Tests and results (cont) • For each telephone call, the users were asked eight questions • Enter your ID, using you telephone keypad, followed by the pound key. If you make a mistake press the star key. • You entered (the value entered in (1)) if this is correct, press ‘1’. If it is not, press’2’. • Please say the first four letters of your last name. • Please say the first four digits of your first name.

Tests and results (cont) • Please say the last four digits of your card number. • Could you please spell the city currently listed on your account? • Please say you phone number. • Please say the postal code currently listed on your account.

Tests and results (cont) Fig 2. summary of results from 300 calls.

Tests and results (cont) Fig 3. Rate of ASR character misrecognition by filed

Tests and results (cont) Fig. 4. Rate of misrecognition of field by ASR (misrecognition = at least one error made in spelled filed value)

Tests and results (cont) Fig 5. Average cumulative number of records examined by system

Conclusions • The method tolerates high misrecognition rates. • The method can be used with off-the-shelf component; it doesn’t require specialized ASR. • To allow the personal identification information to be spoken instead of spelled tokens.

Record Rv that will be verified Select another T Request T from caller Collect subset near T Rm == Rv Another T? No operator Add subset to S Yes No Risk(Rv) < Risk(reject) No Update the risk for each Record in S Reject ! Yes Rm <- min risk in S Accept !

Recognition of spoken and spelled proper names