Speech User Interface 語音介面

Speech User Interface語音介面

無所不在的資訊取得Pervasive Information Access

動機 • 當載具變得越來越小，輸入與輸出方式也受到相對的限制 • 輸入端:實體鍵盤大小受限，虛擬鍵盤也有同樣問題，且缺乏觸覺回饋。 • 輸出端:螢幕大小限制(目前市售最大螢幕手機Samsung note 5.3吋)

應用實例 • 電話語音系統(客服專線) • 文字輸入 • 汽車語音導航 • 語音搜尋 • 對話系統 • 語音記事 • 視障者介面

應用實例:語音搜尋 • 例如:google voice search

應用實例:文字輸入 • Dragon dictation(聲龍聽寫) http://itunes.apple.com/us/app/dragon-dictation/id341446764?mt=8

應用實例:對話系統 • Siri:Apple 於2011年10月推出基於語音辨識之虛擬個人助理 (Apple 官方影片)

應用實例:語音記事 • reQall

語音介面的優勢 • 輸入速度: 一般人說話速度可達每分鐘 100 字 (前提: 辨識度) • 指令集的數量幾乎無限制 • 身體其他部位仍可同時動作:開車時邊與乘客聊天、邊聽音樂 • 自然:作為人與人間的主要的溝通方式(演化結果)

語音介面的限制 • 語音辨識仍不完美 • 錯誤率超過 5%時，花費在偵測與更正錯誤的時間可能比使用鍵盤輸入還久 • 語音辨識的準確率易受雜訊影響 • 語音介面沒有可見的狀態(no visible state) • 語音介面難以學習 • 如何知道要下哪些指令? • 如何得知介面涵蓋的範圍?

完整之語音對話系統架構 Dialogue Management Automatic Speech Recognition Natural Language Understanding Natural Language Generation Text-to- speech Planning signal words words logical form

主要組成元件 • 語音辨識(speech recognition) • 電腦需辨識(理解)使用者之語音輸入 • 語音合成(speech synthesis, text-to-speech, TTS) • 電腦必須能將文字轉為語音，與使用者溝通

語音辨識的型態 • 連續vs.非連續語音(continuous vs. non-continuous) • 語者相關或無關 (speaker independent vs. dependent) • 即興或朗讀文章(spontaneous vs. read) • 關鍵字搜尋或全句辨識(keyword spotting vs. continuous recognition of spoken words) • 字彙集大或小(small vs. large vocabulary set)

語音辨識技術 • 隱藏式馬可夫模型 (Hidden Markov Model) • 參考論文:A tutorial on hidden Markov models and selected applications in speech recognition

語音辨識系統評估 • 透過 word error rate(WER) 來評估語音辨識系統的表現 ErrorRate = 100*(Subs + Ins + Dels) / Nwords REF: I WANT TO GO HOME *** REC: * WANT TWO GO HOME NOW SC: D C S C C I 100*(1S+1I+1D)/5 = 60%

語音辨識的技術挑戰 • 如何提升辨識率? • 如何克服雜訊干擾問題? • 如何處理贅字、停頓、發語詞等情況? • 如何加快辨識速度? • 雖然在桌上型電腦或筆記型電腦上的速度已沒有太大問題，但在智慧型手機尚仍有改善空間，通常做法是將語音上傳至伺服器進行後續處理及辨識。 • 斷字segmentation(silly versus sill lea) • 同音異義字 (mail vs. male) • 從語音辨識到語意辨識

語音合成 • 又稱為文字轉語音(text-to-speech,TTS)技術 • 必須將輸入文字段落進行分析(如中文的斷詞)，決定對應的發音與其聲調，再交由波形合成單元產生語音。 • 一般而言，波形合成乃利用在資料庫內的許多已錄好的語音連接起來。系統則因為儲存的語音單元大小不同而有所差異，若是要儲存phone以及diphone的話，系統必須提供大量的儲存空間。

實例說明 (清大MIR 實驗室)

中文 TTS 線上展示 • NTHU MIR Lab(清華大學 MIR 實驗室) • NTU CSIE(台大) • GUTTS(台科大) • 工研院資通所 • 科大訊飛

英文 TTS 線上展示 • AT & T Natural Voices • Good evening, class. Today we are going to discuss an important type of human-computer interface: speech UI, also known as voice UI. We will demonstrate a TTS engine developed by AT & T, which, in my opinion, is the best TTS so far.

語音合成技術 Text Analysis Text Normalization Part-of-Speech tagging Homonym Disambiguation Raw Text in Phonetic Analysis Dictionary Lookup Grapheme-to-Phoneme (LTS) Prosodic Analysis Boundary placement Pitch accent assignment Duration computation Waveform synthesis Speech out

波形合成方法 • Concatenative synthesis: based on the concatenation (or stringing together) of segments of recorded speech (將預錄的語音片段串連起來) • Formant synthesis: created using additive synthesis and an acoustic model with various fundamental frequency, voicing, and noise levels. • Articulatorysynthesis: synthesizing speech based on models of the human vocal tract

波形合成:連鎖合成法 • 目前所有商業語音合成系統均採用 Concatenative Synthesis連鎖合成法，可再細分為以下三類: • Diphone Synthesis • Units are diphones; middle of one phone to middle of next. • Why? Middle of phone is steady state. • Record 1 speaker saying each diphone • Unit Selection Synthesis • Larger units (Record 10 hours or more, so have multiple copies of each unit) • Use search to find best sequence of units • Domain-specific synthesis: concatenates prerecorded words and phrases to create complete utterances

語音合成的技術挑戰 • 如何正確斷字 (斷詞)?(中文自然語言處理) • 如何合成正確的聲韻? • 使用 concatenative synthesis 技術時，如何在音節與音節之間交接處更為平順? • 如何在語音中加入聲音表情? • 如何產生有特色、辨識度高的語音?

語音對話系統 • Speech conversational system • SIRI: 基於美國國防部 Cognitive Assistant that Learns and Organizes(CALO)project • 以語音為基礎的個人虛擬助理 • http://en.wikipedia.org/wiki/Siri_(software)

展示影片 • A conversation with Siri on the iPhone 4S

主要技術 • Conversational Interface: 語音辨識核心由 Nuance 所提供。 • Personal Context Awareness:CALO 計畫相關技術。 • Service Delegation: 資訊搜尋與服務提供，有多家公司參與。

資料與服務蒐尋 • OpenTable, Gayot, CitySearch, BooRah, Yelp, Yahoo Local, ReserveTravel, Localeze for restaurant and business questions and actions; • Eventful, StubHub, and LiveKick for events and concert information; • MovieTickets, RottenTomatoes and the New York Times for movie information and reviews; • True Knowledge, Bing Answers, and Wolfram Alpha for factual question answering; • Bing, Yahoo and Google for web search.

ChatterBot • 聊天機器人 • 對於無法理解之問題，採取如ELIZA等對話產生器之方式來回應。 • Siri meets ELIZA

語音介面:實用面之問題 • Major problems: • modes (no feedback) • certain commands only work when in specific states • deep hierarchies (also known as voice mail hell) • Verbose feedback wastes time/patience • only confirm consequential things • use meaningful, short cues • Interruption • half-duplex communication (i.e., no barge-in support) • Too much speech on the part of customer is tiring • Speech takes up space in working memory • can cause problems when problem solving

語音介面開發標準 • VoiceXML (VXML) is the W3C's standard XML format for specifying interactive voice dialogues between a human and a computer. • 目前版本 VoiceXML 2.1 • VoiceXML 3.0 (working draft)

語音介面開發工具 • 語音辨識:CMUSphinx; Open Source Toolkit For Speech Recognition http://cmusphinx.sourceforge.net/ • 語音合成 festvox:http://festvox.org/index.html • 語音介面: Microsoft Speech API (SAPI 5.3) • Java Speech API

參考資料 • X. Huang, A. Acero and H. W. Hn, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, 2001. • Rabiner and Schafer, Theory and Applications of Digital Speech Processing, 2010. • Why is Siri Important?

Speech User Interface 語音介面