Digital Speech Processing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Digital Speech Processing 數位語音處理 李琳山 PowerPoint PPT Presentation


  • 121 Views
  • Uploaded on
  • Presentation posted in: General

Digital Speech Processing 數位語音處理 李琳山. Speech Signal Processing. ^. x[n]. Major Application Areas Speech Coding:Digitization and Compression Considerations : 1) bit rate (bps) 2) recovered quality 3) computation complexity/feasibility

Download Presentation

Digital Speech Processing 數位語音處理 李琳山

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Digital speech processing

  • Digital Speech Processing

  • 數位語音處理

  • 李琳山


Digital speech processing

Speech Signal Processing

^

x[n]

  • Major Application Areas

    • Speech Coding:Digitization and Compression

    • Considerations : 1) bit rate (bps)

    • 2) recovered quality

    • 3) computation

    • complexity/feasibility

    • Voice-based Network Access —

    • User Interface, Content Analysis, User-content Interaction

x(t)

x[n]

Processing Algorithms

LPF

output

  • Speech Signals

    • Carrying Linguistic Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc.

    • Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic Level

    • Processing and Interaction of the Double-level Information

x[n]

xk

110101…

Inverse Processing

Processing

Storage/transmission


Speech signal processing processing of double level information

今 天 的

天 氣 非

常 好

今天的 天氣

非常 好

今天 的

Speech Signal Processing – Processing of Double-Level Information

  • Speech Signal

  • Sampling

  • Processing

Algorithm

Chips or Computers

  • Linguistic Structure

  • Linguistic Knowledge

Lexicon

Grammar


Digital speech processing

Voice-based Network Access

Internet

User Interface

Content Analysis

User-Content Interaction

  • User Interface

    —when keyboards/mice inadequate

  • Content Analysis

    — help in browsing/retrieval of multimedia content

  • User-Content Interaction

    —all text-based interaction can be accomplished by spoken language


Digital speech processing

User Interface —Wireless Communications Technologies are Creating a Whole Variety of User Terminals

Text Content

Internet Networks

Multimedia Content

  • at Any Time, from Anywhere

  • Handsets, Hand-held Devices, PDA’s, Personal Notebooks, Vehicular Electronics, Hands-free Interfaces, Home Appliances, Wearable Devices…

  • Small in Size, Light in Weight, Ubiquitous, Invisible…

  • Evolving towards a “Post-PC Era”

  • Keyboard/Mouse Most Convenient for PC’s not Convenient any longer

    — human fingers never shrink, and application environment is changed

  • Service Requirements Growing Exponentially

  • Voice is the Only Interface Convenient for ALL User Terminals at Any Time, from Anywhere


Digital speech processing

Content Analysis—Multimedia Technologies are Creating a New World of Multimedia Content

Future Integrated Networks

  • Real–time

  • Information

  • weather, traffic

  • flight schedule

  • stock price

  • sports scores

  • Private Services

  • personal notebook

  • business databases

  • home appliances

  • network

  • entertainments

  • Intelligent Working

  • Environment

  • e–mail processors

  • intelligent agents

  • teleconferencing

  • distant learning

  • Knowledge

  • Archieves

  • digital libraries

  • virtual museums

  • Electronic

  • Commerce

  • virtual banking

  • on–line transactions

  • on–line investments

  • Most Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text)

  • Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to Browse

  • The Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and Retrieval

  • Multimedia Content Analysis based on Speech Information


Digital speech processing

User-Content Interaction— Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processing

text

information

Multimedia Content

Text-to-Speech Synthesis

Text Content

voice information

Spoken and multi-modal Dialogue

Voice-based Information Retrieval

Internet

voice input/ output

MultimediaContent Analysis

Text Information Retrieval

  • Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Speech

  • User-Content Interaction can be Accomplished by Spoken and Multi-modal Dialogues

  • Many Hand-held Devices with Multimedia Functionalities Commercially Available Today

  • Using Speech Instructions to Access Multimedia Content whose Key Concepts Specified by Speech Information


Digital speech processing

Voice Instructions

Text Instructions

我想找有關紐約受到恐怖攻擊的新聞?

Text Information

d1

Voice Information

d2

d1

d3

d2

美國總統布希今天早上…

d3

Voice-based Information Retrieval

  • Speech may become a New Data Type

  • Both the User Instructions and Network Content Can be in form of Speech


Digital speech processing

Spoken and Multi-modal Dialogues

  • Almost All User-Content Interaction can be Accomplished by Spoken or Multi-modal Dialogues

  • An Example of Client-Server Computing Environment

Internet

Sentence Generation

and Speech Synthesis

Users

Output Speech

Response to the user

Databases

Discourse Context

Dialogue

Manager

Wireless Networks

User’s Intention

Dialogue Server

Input Speech

Speech Recognition and Understanding


Digital speech processing

Convergence of PSTN and Internet

handsets

Internet

PSTN

PCs

servers

telephones

  • PSTN (for Voice) and Internet (for Data and Multi-media Contents) are Converging

  • Driving Force for the Convergence

    • “anywhere, any time” of wireless services

    • voice provides the most convenient and natural interaction interface

    • attractive contents over the Internet

    • contents (human information) are why the Internet is attractive, while voice directly carries human information

    • Speech-enabled Access of Web-based Applications


Digital speech processing

Wireless Access of Global Information

  • As Handset Size Shrinks While Required Functionalities Grows and the User Environment Changes, Voice Interface will be Useful for all Different User Terminals

  • As More Network Content becomes Multi-media, Content Analysis based on Speech Information will be Essential

  • Integration of Many Different Technologies

    • information processing, networking, transmission, internet, wireless, speech processing

  • Speech Processing is the only Major Missing Link in the Semi-mature Technology Chain


Digital speech processing

  • Multi-media Technologies

satellites

Networks

fiber

C

radio

Global

Knowledge,

Information

and Services

servers

cable

... 0110...

...1101...

  • Communications and Networking Technologies

  • Information Processing Technologies

Future World of Communications and Computing

  • Wireless Technologies

  • Speech Processing Technologies


Outline

Both Theoretical Issues and Practical Problems will be Discussed

Starting with Fundamentals, but Entering Research Topics Gradually

Part I: Fundamental Topics

1.0 Introduction to Digital Speech Processing

2.0 Fundamentals of Speech Recognition

3.0 Map of Subject Areas

4.0 More about Hidden Markov Models

5.0 Acoustic Modeling

6.0 Language Modeling

7.0 Speech Signals and Front-end Processing

8.0 Search Algorithms for Speech Recognition

Part II: Advanced Topics

9.0 Speaker Variabilities: Adaption and Recognition

10.0 Latent Semantic Analysis for Linguistic Processing

11.0 Spoken Document Understanding and Organization

12.0 Voice-based Information Retrieval

13.0 Robustness for Acoustic Environment

14.0 Some Fundamental Problem-solving Approaches

15.0 Utterance Verification and Keyword/Key Phrase Spotting

16.0 Spoken Dialogues

17.0 Distributed Speech Recognition and Wireless Environment

18.0 Some Recent Developments in NTU

19.0 Conclusion

Outline


Outline1

教科書:無

主要參考書:

X. Huang, A. Acero, H. Hon, “Spoken Language Processing”, Prentice Hall, 2001,松瑞

F. Jelinek, “Statistical Methods for Speech Recognition”, MIT Press, 1999

L. Rabiner, B.H. Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1993, 民全

C. Becchetti, L. Prina Ricotti, “Speech Recognition- Theory and C++ implementation”, Johy Wiley and Sons, 1999, 民全

其他參考文獻課堂上提供

教材:

available on web before the day of class (http://speech.ee.ntu.edu.tw)

適合年級:三、四(電機系、資工系)

課程目的:提供同學進入此一充滿機會與挑戰的新領域所需的基本知識,體驗數學模型與軟體程式如何相輔相成,學習進入一個新領域由基礎進入研究的歷程,體會吸收非結構性知識(Unstructured Knowledge)的經驗

成績評量方式

Midterm Exam 25%

Homeworks (I) (II) (Ⅲ) 15%、5%、15%

Final Exam 10%

Term Project 30%

Outline


Digital speech processing

1.0 Introduction — A Brief Summary of Core Technologies and Current Status

References for 1.0

1.“Speech and Language Processing over the Web”, IEEE Signal Processing Magazine, May 2008

2 .“Voice Access of Global Information for Broadband Wireless: Technologies of Today and Challenges of Tomorrow”, Proceedings of IEEE, Jan 2001

3. “Conversational Interfaces: Advances and Challenges” , Proceedings of the IEEE, Aug 2000


Digital speech processing

W

X

x(t)

Feature Extraction

Pattern Matching

Decision Making

unknown speech signal

output word

feature vector sequence

y(t)

Y

Reference Patterns

Feature Extraction

training speech

Speech Recognition as a pattern recognition problem


Digital speech processing

Input Speech

Feature

Vectors

Linguistic Decoding

and

Search Algorithm

Output

Sentence

Front-end

Signal Processing

Language

Model

Acoustic

Model

Training

Speech

Corpora

Acoustic

Models

Language

Model

Construction

Text

Corpora

Lexical

Knowledge-base

Grammar

Lexicon

Basic Approach for Large Vocabulary Speech Recognition

  • A Simplified Block Diagram

  • Example Input Sentence

  • this is speech

  • Acoustic Models

  • (th-ih-s-ih-z-s-p-ih-ch)

  • Lexicon (th-ih-s) → this

  • (ih-z) → is

  • (s-p-iy-ch) → speech

  • Language Model(this) – (is) – (speech)

  • P(this) P(is | this) P(speech | this is)

  • P(wi|wi-1) bi-gramlanguage model

  • P(wi|wi-1,wi-2) tri-gram language model,etc


Digital speech processing

Speech Recognition Technologies, Applications and Problems

  • Word Recognition

    • voice command/instructions

  • Keyword Spotting

    • identifying the keywords out of a pre-defined keyword set from input voice utterances

  • Large Vocabulary Continuous Speech Recognition

    • entering longer texts

    • remote dictation/automatic transcription

  • Speaker Dependent/Independent/Adaptive

  • Acoustic Reception/Background Noise/Channel Distortion

  • Read/Spontaneous/Conversational Speech


Digital speech processing

Lexicon and Rules

Prosodic Model

Voice Unit Database

Output Speech Signal

Input Text

Text Analysis and Letter-to-sound Conversion

Prosody Generation

Signal

Processing

and

Concatenation

Text-to-speech Synthesis

  • Transforming any input text into corresponding speech signals

  • E-mail/Web page reading

  • Prosodic modeling

  • Basic voice units/rule-based, non-uniform units/corpus-based


Digital speech processing

  • Understanding Speaker’s Intention rather than Transcribing into Word Strings

  • Limited Domains/Finite Tasks

  • Grammatical Approaches (e.g. partial parsing)/Statistical Approaches (e.g. corpus-based by training)

  • Semantic Concepts/Key Phrases

acoustic models

phrase lexicon

concept set

phrase/concept language model

input utterance

understanding results

phrase graph

syllable lattice

Syllable Recognition

Key Phrase Matching

Semantic Decoding

concept graph

Prob (Ci | Ci-1, Ci-2)

Prob (phj | Ci)

  • An Example

  • utterance:請幫我查一下台灣銀行 的 電話號碼 是幾號?

  • key phrases: (查一下) - ( 台灣銀行) - (電話號碼)

  • concept: (inquiry) - (target) - (phone number)

Speech Understanding


Digital speech processing

Speaker Verification

  • Verifying the speaker as claimed

  • Applications requiring verification

  • Text dependent/independent

  • Integrated with other verification schemes

input speech

Feature Extraction

Verification

yes/no

Speaker Models


Digital speech processing

speech instruction

text instruction

我想找有關新政府組成的新聞?

text documents

d1

speech documents

d2

d1

d3

d2

總統當選人陳水扁今天早上…

d3

Voice-based Information Retrieval

  • Speech Instructions

  • Speech Documents (or Multi-media Documents including Speech Information)

  • Indexing Features/Relevance Evaluation

  • Recall/Precision Rates


Digital speech processing

Internet

Sentence Generation

and Speech Synthesis

Users

Output Speech

Response to the user

Databases

Discourse Context

Dialogue

Manager

Networks

User’s Intention

Dialogue Server

Input Speech

Speech Recognition and Understanding

Spoken Dialogue Systems

  • Almost all human-network interactions can be made by spoken dialogue

  • Speech understanding, speech synthesis, dialogue management

  • System/user/mixed initiatives

  • Reliability/efficiency, dialogue modeling/flow control

  • Transaction success rate/average number of dialogue turns


Spoken document understanding and organization

Unlike the Written Documents which are Better Structured and Easier to Index and Browse, Spoken Documents are just Audio Signals, or a Sequence of Words if Transcribed

— the user can’t listen to (or read carefully) each one from the beginning to the

end during browsing

— better approaches for understanding/organization of spoken documents becomes

necessary

Spoken Document Segmentation

— automatically segmenting a spoken document into short paragraphs, each with

a central topic

Spoken Document Summarization

— automatically generating a summary (in text or speech form) for each short

paragraph

Title Generation for Spoken Documents

— automatically generating a title (in text or speech form) for each short paragraph

Semantic Structuring of Spoken Documents

— construction of semantic structure of spoken documents into graphical hierarchies

Spoken Document Understanding and Organization


Multi lingual functionalities

Code-Switching Problem

English words/phrases inserted in spoken Chinese sentences as an example

人人都用Computers,家家都上Internet

the whole sentence switched from Chinese to English as an example

準備好了嗎?Let’s go!

Cross-language Network Information Processing

globalized network with multi-lingual content/users

cross-language network information processing with a certain input language

Dialects/Accents

hundreds of Chinese dialects as an example

code-switching problem─ Chinese dialects mixed with Mandarin (or plus English) as an example

Mandarin with a variety of strong accents as an example

Global/Local Languages

Language Dependent/Independent Technologies

Shared Acoustic Units/Integrated Linguistic Structures

Multi-lingual Functionalities


Digital speech processing

Lexicon

Network

Server

Clients

Distributed Speech Recognition (DSR) and Wireless Environment

  • An Example Partition of Speech Recognition Processes into Client/Sever

Client

Input Speech

Feature

Vectors

Linguistic Decoding

and

Search Algorithm

Output

Sentence

Front-end

Signal Processing

Language

Model

Acoustic

Model

Training

Speech

Corpora

Acoustic

Models

Language

Model

Construction

Text

Corpora

Lexical

Knowledge-base

Grammar

Server

  • encoded feature parameters transmitted in packets

  • Client/Server Structure


  • Distributed speech recognition dsr and wireless environment

    Wireless Environment

    examples: Personal Area Networks (Bluetooth, etc.), Wireless LAN (IEEE 802.11), Cellular (GSM, GPRS, 3G), etc.

    Link Level

    time-varying fading and noise characteristics

    time-varying signal level and signal-to-noise ratios

    bursty errors with much higher error rates

    much smaller and dynamic bandwidth, much lower

    and changing bit rates

    Transport Level

    TCP/IP: errors retransmission delay

    UDP/IP: errors real-time/no delay packet loss

    packets out of sequence

    Distributed Speech Recognition (DSR) and Wireless Environment


  • Login