1 / 68

DSR Front-end Extension for Tonal-language Recognition and Speech Reconstruction

DSR Front-end Extension for Tonal-language Recognition and Speech Reconstruction. Aurora Group Meeting, April 2003 By IBM & Motorola. Outline. Introduction Part I – Terminal Side Algorithm Description Part II – Server Side Algorithm Description Part III – Results vs . Requirements

Download Presentation

DSR Front-end Extension for Tonal-language Recognition and Speech Reconstruction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DSR Front-end Extension forTonal-language Recognitionand Speech Reconstruction Aurora Group Meeting, April 2003 By IBM & Motorola

  2. Outline • Introduction • Part I – Terminal Side Algorithm Description • Part II – Server Side Algorithm Description • Part III – Results vs. Requirements • Algorithmic Requirements • Tonal Language Recognition Evaluation • Intelligibility Evaluation

  3. Introduction Historical Snapshots July 2000 – Speech reconstruction defined as one of the areas to be addressed by the committee Feb. 2001 – Tonal Language Recognition added to speech reconstruction July 2001 – New work item for extension of FE (WI-030) opened April 2002 – Joint-development contract signed between IBM and Motorola August 2002 – Work item for extension of AFE (WI-034) opened

  4. Introduction System Overview ETSI Standard DSR Front-End DSR Back-End Speech In MFCC & log-E @ 4800 bps C H A N N E L Tonal Information Speech Reconstruction Pitch Tracking and Smoothing Pitch & Class Estimation Pitch & Class Speech Out @ 800 bps

  5. Outline You are here ! • Introduction • Part I – Terminal Side Algorithm Description • Part II – Server Side Algorithm Description • Part III – Results vs. Requirements • Algorithmic requirements • Tonal Language Recognition Evaluation • Intelligibility Evaluation

  6. Part I – Terminal Side Algorithm Description • XFE block diagram • XAFE block diagram • Voice activity detection • Low band noise detection • Pre-processing of speech signal • Pitch estimation • Voicing classification • Quantization of voicing class and pitch • Bit-stream formatting and error protection

  7. FE blocks XFE Block Diagram Input speech Offcom ADC Framing PE W FFT MF LOG DCT logE EC MFCC VAD LBND log-E Extension blocks Interface blocks PITCH PP Abbreviations EC - Energy computation logE - Log energy measure computation VAD - Voice activity detection LBND - Low-band noise detection PP - Pre-processing PITCH - Pitch estimation CLS - Classification CLS P VC Feature Compression Bit Stream Formatting Framing To transmission channel

  8. AFE blocks XAFE Block Diagram Sin(n) Spectrum Estimation Rest of the Noise Reduction Blocks MF Extension blocks SEC VADVC LBND Interface blocks Abbreviations SEC - Spectrum and energy computation MF - Mel-filtering VADVC - Voice activity detection for voicing classification LBND - Low-band noise detection PP - Pre-processing PITCH - Pitch estimation CLS - Classification PITCH PP CLS VC P

  9. INPUT Voice Activity Detection F(m) En(m) Ech(m) 208 201 CHANNEL ENERGY ESTIMATOR NOISE ENERGY SMOOTHER PEAK TO AVERAGE RATIO ESTIMATOR 202 En(m+1) Inputs – Filter bank output (23) Outputs – vad_flag, hangover_flag NOISE ENERGY ESTIMATE STORAGE SPECTRAL DEVIATION ESTIMATOR P2A(m) U P D A T E_ F L A G To 208 205 210 209 DE(m) VOICE ACTIVITY DETERMINER OUTPUT UPDATE DECISION DETERMINER 207 vad_flag hangover_flag FUPDATE_FLAG V(m) VOICE METRIC CALCULATOR 204 To 208 SNRq(m) sq(m) To 205, 206, & 208 203 CHANNEL SNR ESTIMATOR SIGNAL SNR ESTIMATOR 206 En(m) Ech(m)

  10. vad_flag == false? E >= enrg_thld Yes Find max. power in low band Yes Start No No End lbn_flag = false End End No ratio > ratio_thld? Find max. power in high band Filter ratio Find ratio low / high lbn_flag = true Yes Low Band Noise Detection Inputs – power spectrum, vad_flag, frame energy Output – lbn_flag Low-band – Below 380 Hz

  11. lbn_flag = TRUE Low-pass Filter # 1 Slpds Sin Down- sample Low-pass Filter # 2 lbn_flag = FALSE High-pass Filter Sub Pre-Processing of Speech Signal Inputs – input speech signal, lbn_flag Outputs – low-pass filtered, down-sampled speech signal Slpds high-pass filtered speech signal Sub

  12. Pitch Estimation Inputs – vad_flag, lbnd_flag, low-pass filtered, down-sampled speech signal, fourier spectrum, power spectrum, spectral average, log-E Output – pitch period P (P = 0 for unvoiced frames) Frequency ranges (Hz) [200,420], [100,210], [52,120] Stable track with frequency F0 [0.666*F0,2.2*F0] the above 3 ranges Low-pass filtered, down- sampled speech STFT, PS F0 candidates generation Corrrelation calculation Select next freq. range Pitch selection Convert pitch and output Found pitch? No Yes History update

  13. Pitch Estimation • Find F0 among common integer dividers of spectral peak frequencies • Give preference to higher dividers F0

  14. Pitch Estimation • Utility function generalizes concept of integer divider • Utility function – superposition of components generated by spectral peaks One period of influence function I(r)

  15. Pitch Estimation Utility function component generated by peak of unit magnitude at 700 Hz F0max F0min

  16. Pitch Estimation F0 candidates generation and Correlation calculation STFT, PS To pitch selection Low-pass filtered, down- sampled speech Process power spectrum – Double resolution, (double frame-size), de-emphasize, and smooth Compute correlation Process spectrum Compute correlation scores at each lag using speech segments having the highest energy & separated by the lag Pick local peaks, scale down high-freq. peaks, limit number of peaks, refine locations and amplitudes, normalize Pick peaks Build utility function Build utility function, select at most two FO candidates with high spectral scores, giving preference to higher frequencies, and frequencies near previous F0 estimates Convert F0 candidates into corresponding lags

  17. Pitch Estimation Pitch selection Class1 (CS>0.79 AND SS>0.78) OR (SS>0.68 AND SS+CS>1.6) Class2 (CS>0.7 AND SS>0.7) AND (0.82Ref<F0<1.22Ref) Class3 (CS>0.85 OR SS>0.82) Sort F0 Stable Track? No Find best class1 cand. Cont. pitch? Yes No Set ref. to stable pitch No Found? Yes Find best class3 cand. Yes Set ref. to previous pitch Find best class2 cand. Full list? No Found? No Yes Find best class2 cand. No Found? Yes ss>.95& cs>.95? Set uv pitch Set pitch Set pitch Yes No Found? Set pitch No Yes Yes Set uv pitch Set pitch Set uv pitch Set pitch

  18. Voicing Classification Inputs – vad_flag, hangover_flag, input speech signal, high-pass filtered speech signal, frame energy, pitch period Outputs – voicing class (non-speech, unvoiced, mixed-voiced, and fully-voiced speech) (zcm >= zcm_thld || ef_ub <= ef_ub_thld || hangover_flag == true)? vad_flag == false? pitch period == 0? No No No Start Yes Yes Yes VC = fully-voiced VC = mixed-voiced VC = unvoiced VC = non-speech End End End End

  19. Quantization of Voicing Class and Pitch Class Quantization Pitch Quantization In each frame-pair, the first frame’s pitch period (19 – 140) is absolutely quantized using 7 bits; the second frame’s pitch period is differentially quantized using 5bits.

  20. Quantization of Voicing Class and Pitch

  21. Bit-Stream Formatting and Error Protection

  22. Bit-Stream Formatting and Error Protection

  23. Outline • Introduction • Part I – Terminal Side Algorithm Description • Part II – Server Side Algorithm Description • Part III – Results vs. Requirements • Algorithmic requirements • Tonal Language Recognition Evaluation • Intelligibility Evaluation You are here !

  24. Part II – Server Side Algorithm Description • Bit-stream decoding and error mitigation • Speech reconstruction block diagram • Pitch tracking and smoothing • Cepstra de-equalization (XAFE) • Features transformation at 16kHz sampling rate (XAFE) • Harmonic magnitudes reconstruction • Harmonic phases synthesis • Line spectrum to time-domain transformation • Overlap-add

  25. Bit-stream Decoding and Error Mitigation • Extract pitch and voicing class indices and check PC-CRC • Error free frame pair – decode • Decode voicing class using VC encoding table • First frame – pitch index points to quantization level • Second frame – decode pitch using Pitch encoding table • Corrupt frame pair – keep receiving until error free pair is determined • Assign pitch and class parameters of corrupt frames

  26. Pitch, Class, & logE Assignment for Corrupt Frames • B  2 – copy from last/first good frame • 2 < B  12 • copy pitch and class from last/first good frame • “fully-voiced” class  “mixed-voiced” class • logE(n) = max(logE(n  1) – 2, 4.7) • B > 12 • class = “unvoiced”, pitch = 0, logE = 4.7 ... Last good B B First good

  27. Speech Reconstruction Block Diagram HSI VPH • PTS – pitch tracking & smoothing • HIS – harmonics structure initialization • CDE – cepstra de-equalization only for XAFE • T16kHz – features transformation only for XAFE at 16 kHz • HOCR – high order cepstra recovery • UPH – unvoiced phase synthesis • SFEQ – solving front-end equations • CTM – cepstra to magnitudes transformation • COMB – combined magnitudes estimate UPH MFCC, logE SFEQ APM CDE T16 kHz HOCR COMB PF pitch PTS CTM voicing class LSTD • APM – all-pole modeling • VPH – voiced phase synthesis • PF - postfiltering • LSTD – line spectrum to time domain transformation • OLA – overlap-add OLA speech

  28. Pitch Tracking and Smoothing input • 1-st stage • Handle short voiced segments • Find the most energetic set of similar pitch values (track) and determine reference pitch value • Do integer scaling • 2-nd stage - correct outliers • 3-rd stage – smoothing by a 5-tap symmetric filter … … oldest most recent 10 8 output • Voicing class correction • Voiced  Unvoiced – Voicing Class = “unvoiced” • Unvoiced Voiced – Voicing Class = “mixed-voiced”

  29. Pitch Contours Clean vs. Babble Noise Pitch samples Time, msec

  30. Pitch Contours XAFE vs. XFE Pitch samples Time, msec

  31. Speech Synthesis Input/Output • Input • 13 low order cepstra (LOC): C0,C1,…,C12 • Pitch period p8kHz p=p8kHz  out_sampling_rate / 8 • Log-energy logE • Voicing class: • fully voiced • mixed-voiced • unvoiced (“unvoiced” + “non-speech”) • Output speech signal • XFE: output sampling rate = input sampling rate - 8, 11, 16kHz • XAFE: output sampling rate = 8kHz

  32. Harmonic Model of Speech Frame • Time-domain – sum of sinusoidal waves • Frequency domain – line spectrum

  33. Harmonic Structure Initialisation • Fully voiced frame - voiced harmonics array • Unvoiced voiced frame - unvoiced harmonics array • Mixed-voiced frame – voiced and unvoiced harmonic arrays

  34. Cepstra De-equalization - XAFE • Purpose: to reverse AFE blind equalization of the C1,…,C12 cepstra coefficients • Applied to quantized cepstra - regularization factor 0.999 guarantees stability

  35. AFE 16kHz Features Transformation • Purpose: to restore plain MFCC and energy representing [04kHz] frequency band

  36. High Order Cepstra Recovery • Purpose – to estimate high order cepstra (HOC) C13,…,C22 not transmitted from client side • Increases accuracy of harmonic magnitudes estimation • Implemented through look-up table using pitch as parameter • Pitch range is partitioned into sub-ranges • Representative HOC vector for each sub-range stored in the table has been obtained by averaging over large speech database • Further refinement of HOC is built in the magnitudes reconstruction procedure

  37. Harmonic Magnitudes Reconstruction • Two independent estimates of harmonic magnitudes are obtained by two methods: • Solving front-end equations (SFEQ) • Cepstra to magnitudes transfomation (CTM) • The estimates are mixed together using frequency and pitch dependent mixing ratio

  38. Solving Front-End Equations (SFEQ) • Front-end equation ties harmonic parameters with mel-filter bank outputs • Linearization – especially applicable for voiced frames

  39. SFEQ • 23 basis vectors are derived from mel-filter weighting functions sampled at harmonic frequencies 8-th mel-filter and basis vector. Pitch = 85.3 samples

  40. SFEQ • Harmonic magnitudes vector is represented as linear combination of basis vectors • Front-end equations in : • Least square equations:

  41. SFEQ LOC HOC PITCH • Built-in high order cepstra recovery for voiced frames • 3 iterations Solve Equations HOC  Compute HOC • SFEQ estimate: Compute Magnitudes ASFEQ

  42. Cepstra to Magnitudes Transformation (CTM) • Modify cepstra to compensate influence of pre-emphasis and variable mel-channel width • Find location of Mel-scaled harmonic frequency at Mel-scaled channel centers grid

  43. CTM • Compute IDCT coefficient corresponding to (non-integer) index  • Compute estimate of harmonic magnitudes as:

  44. Combined Magnitudes Estimate (COMB) – Scaling SFEQ Magnitudes • Unvoiced harmonics or short pitch period (p8kHz55) – constant scaling factor: • Long pitch period (p8kHz > 55) – frequency dependent scaling factor SF SFLOW SFHIGH Hz 200 2500

  45. COMB – Mixing SFEQ and CTM Magnitudes • Unvoiced harmonics: • Voiced harmonics – pitch dependent mixture ratio specified by a table (p) (XAFE)

  46. All-pole Modeling of Spectral Envelope for Voiced Harmonics (APM) Long pitch period? {an} Inverse DFT A YES Interpolate Magnitudes Durbin Levinson ACF NO Anew MagnitudesSynthesis

  47. Postfiltering (PF) • Purpose – formants emphasis of voiced frames • Weighting of voiced harmonic magnitudes by filter derived from all-pole model parameters • Weights W(exp(-j2fk)) are normalized bounded and applied to voiced harmonics

  48. Voiced Phase Synthesis (VPH) • Three additive components • Linear in frequency phase providing alignment relative to previous frame • Vocal tract phase derived from all-pole model parameters • Pre-stored vocal cords excitation phase exckfrom table

  49. Line Spectrum to Time Domain Transformation (LSTD) • Mixed-voiced frames – low band (0 – 1200 Hz) voiced harmonics are combined with high band (1200 Hz – FNyquist) unvoiced harmonics • Energy normalization • Simulate (non-windowed) analysis frame spectrum by convolution of line spectrum with Dirichlet kernel • Compute energy Eout • Compute scaling factor • Multiply harmonic magnitudes by SC

  50. LSTD • Synthesis of output frame discrete spectrum by convolution of line spectrum with Hann window Fourier transform FFT FTWhann Frame Shift IFFT • Inverse FFT sout Sout

More Related