- By
**emily** - Follow User

- 266 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Speaker Verification System Part B Final Presentation' - emily

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Speaker Verification SystemPart B Final Presentation

Performed by: Barak Benita & Daniel Adler

Instructor: Erez Sabbag

Implementation of a speaker verification algorithm on a DSP

The verification module will perform a real time authentication of the user based on sampled voice data.

The idea is to integrate the speaker verification model with other security and management models allowing them to grant access to resources based on the speakers voice verification.

The Project GoalIntroduction

Speaker verification is the process of automatically authenticating the speaker on the basis of individual information included in speech waves.

Speaker’s Identity (Reference)

Speaker

Verification

System

Speaker’s

Voice

Segment

Result [0:1]

System Overview

BT

Base Station

Access

Denied

My name is Bob!

LAN

Speaker

Verification

Unit

Server

BT

Base Station

LAN

System Description

The system is compound from TI’s C6701floating point DSP with the speaker verification algorithm on it. A user with a hand device (e.g. bluetooth on a PDA), will receive access to different resources ( door opening, file access, etc) based on a voice verification process.

The project implements only the speaker verification algorithm on the DSP and has input and output interfaces to interact with other devices (e.g. Bluetooth).

The DSP is encoded with the users voice signature. Each time user verification is needed, the algorithm compares the speakers voice with the signature.

3

System Block Diagram

DSP

Signature

parameters

Enrollment

Server

(training phase – building

A signature)

Codec

Verification Channel

Voice Channel

(optional)

Bluetooth

Radio

Interface

- Bluetooth
- unit

Codec

Bluetooth

Base station

Authorization

Server

LAN

Voice Channel

(optional)

Voice Channel

“My name is Bob”

5

Project Description:

- Part One:
- Literature review
- Algorithms selection
- MATLAB implementation
- Result analysis
- Part Two:
- Implementation of the chosen algorithm on a DSP

Speaker Verification Process

Analog Speech

Pre-Processing

Feature

Extraction

Pattern

Matching

Reference

Model

Decision

Result [0:1]

Implemented Algorithms: Feature Extraction Module – MFCC

MFCC (Mel Frequency Cepstral Coefficients) is the most common technique for feature extraction. MFCC tries to mimic the way our ears work by analyzing the speech waves linearly at low frequencies and logarithmically at high frequencies.

The idea acts as follows:

Spectrum

Mel Spectrum

Cepstrum

Mel Cepstrum

FFT

Mel-frequency

Wrapping

Windowed

PDS Frame

Reference

Model = Codebook

Pattern Matching

=

Distortion measure

Distortion Rate

Implemented Algorithms:Pattern Matching Modeling Module – Vector Quantization (VQ)In the enrolment part we build a codebook of the speaker according to the LBG (Linde, Buzo, Gray) algorithm, which creates an N size codebook from set of L feature vectors.

In the verification stage, we are measuring the distortion of the given sequence of the feature vectors to the reference codebook.

Implemented Algorithms: Decision

In VQ the decision is based on checking if the distortion rate is higher than a preset threshold: acceptance if distortion rate > t, else rejection.

In this project no decision model will be build, the output of the system will be based on the following score rate (values between 0 to 1), which indicates the suitability of the person to the reference model:

Score = exp (-mean distance)

Implementation Environment

- Hardware tools:
- TI DSP 6701 EVM board
- PC host station
- Software development tools:
- TI Code Composer
- Matlab 6.1
- Programming Languages:
- C
- Assembler
- Matlab

TI DSP 6701 EVM

- Why?
- Floating Point
- Designed Especially for Voice Applications
- Large Bank of On Chip Memory
- High level development (C)
- PCI Interface
- Why Not?
- Price
- Size
- Consumption

Program Workflow

Analog Speech (input)

DSP

Program

Pre-Processing

Feature

Extraction

MATLAB

Program

Reference

Model

Pattern

Matching

Decision

Result [0:1] (output)

Step By Step Implementation

- Pre-processing a ‘ones’ vector on the DSP and comparing it to the Matlab results
- Pre-processing an audio file and comparing to the Matlab results
- Feature extracting of the audio file (after pre-processing) and comparing to the Matlab results
- Pattern matching the feature vectors to a ‘ones’ codebook matrix and comparing to the Matlab results (running with the same codebook)
- Creating a real codebook from a reference speaker importing it to the DSP and comparing the running results of the DSP and the Matlab
- Verifying that the distances of the speakers from the codebook in the DSP program and in the Matlab program are the same

Creating the Assembler Lookup Files

- Creating the output data through Matlab functions (e.g. hamming(n))

- Saving the output in an assembler lookup table format

- Referencing the lookup table with a name that will be called from the C source code in the DSP project (as a function)

hamming = fopen('hamming.asm', 'wt', 'l');

fprintf(hamming, '; hamming.asm - single precision floating point table generated from MATLAB\n');

fprintf(hamming, '\t.def\t_hamming\n');

fprintf(hamming, '\t.sym\t_hamming, _hamming, 54, 2, %d,, %d\n', size, n);

fprintf(hamming, '\t.data\n');

fprintf(hamming, '_hamming:\n');

fprintf(hamming, '\t.word\t%tXh, %tXh, %tXh, %tXh\n', h);

fprintf(hamming, '\n');

fclose(hamming);

- Importing the file as an asm file (adding a file to the project) to the DSP project

; hamming.asm - single precision floating point table generated from MATLAB

.def _hamming

.sym _hamming, _hamming, 54, 2, 8192,, 256

.data

_hamming:

.word 3DA3D70Ah, 3DA4203Fh, 3DA4FBD3h, 3DA669A4h

.word 3DA86978h, 3DAAFB01h, 3DAE1DD8h, 3DB1D180h

.word 3DB61567h, 3DBAE8E1h, 3DC04B30h, 3DC63B7Dh

.word 3DCCB8DCh, 3DD3C24Bh, 3DDB56B1h, 3DE374E1h

.word 3DEC1B99h, 3DF5497Fh, 3DFEFD27h, 3E049A87h

.word 3E09F7D0h, 3E0F9597h, 3E1572FFh, 3E1B8F1Ch

h = hamming(n);

- Using the lookup table in the C source code

// ----- Windowing the filtered frame with Hamming ----

for (k=0 ; k < N ; k++){

for (j=0 ; j < N ; j++){

if (k - j < 0) break;

frame[k] += hamming[j]*filtered_frame[k-j];

}

}

Generation of assembly functions through Matlab

Generation of voice data file from a *.wav format file through Matlab waveread function

Hamming.asm

Generation of assembly functions through Matlab

Sari5fix.asm

Melbank.asm

Rdct.asm

Generation of assembly functions through Matlab

Codebook.asm

Binding All The PiecesAnalog Speech (input)

DSP

Program

C Code

Pre-Processing

Feature

Extraction

Pattern

Matching

Decision

Result [0:1] (output)

Software Modules

main

init

O(1)

extract_frame

O(n^2)

digitrev_index

O(n)

bitrev

O(n)

calc_dist

O(1)

hamming

O(1)

bitrev

O(n)

melbank

O(n)

cfftr2_dit

O(nlog(n))

Project Structure

Include

Files

board.h

codec.h

dma.h

intr.h

mcbsp.h

link.cmd

pci.h

regs.h

Libraries

verification.h

rts6700.lib

Source

bitrevf.asm

cfftr2.asm

codebook.asm

digitrev_index.c

hamming.asm

melbank.asm

rdct.asm

verification.c

Tested System

- The Tested System parameters:
- The tested algorithms and methods were the MFCC and VQ with the following parameters:
- Sampling Frequency: 11025Hz
- Feature Vector Size: 18
- Window Size: 256
- Offset Size: 128
- Codebook Size: 128
- Number of iterations for codebook creation: 25
- We compared between the Matlab and DSP results based on a codebook created from Daniel’s 60 seconds of random speech and random selection of different five seconds speakers.

Verifications

- The DSP results were compared to the Matlab simulation.
- We chose random speakers from the speakers DB with one reference codebook.
- For Example:
- PersonMATLABDSP
- Daniel 66.95% (0.4011) 66.95% (0.4011)
- Barak 44.01% (0.8206) 44.01% (0.8206)
- Ayelet 43.61% (0.8299) 43.61% (0.8299)
- Diego 53.97% (0.6166) 53.97% (0.6166)
- Adi 42.07% (0.8656) 42.07% (0.8656)

Conclusions

- The TI DSP 6701 EVM is capable of preforming speaker
- verification analysis and achieve high resolution results (as
- achieved in the Matlab)

- Speaker Verification algorithms are not mature enough to
- become a good biometric detection solution

- Code Composer is not stable and good enough to become an “easy
- to use” development environment

- A second phase project, which will implement a complete verification system should be build

Time Table – First Semester

14.11.01 – Project description presentation

15.12.01 – completion of phase A: literature review and algorithm selection

25.12.01 – Handing out the mid-term report

25.12.01 – Beginning of phase B: algorithm implementation in MATLAB

10.04.02 – Publishing the MATLAB results and selecting the algorithm that will be implemented on the DSP

Time Table – Second Semester

10.04.02 – Presenting the progress and planning of the project to the supervisor

17.04.02 – Finishing MATLAB Testing

17.04.02 – The beginning of the implementation on the DSP

07.11.02 – Project presentation and handing the project final report

Analog Speech

Anti aliasing filter to avoid aliasing during sampling. LPF [0, Fs/2]

LPF

Band Limited

Analog Speech

Analog to digital converter with frequency sampling (Fs) of [10,16]KHz

A/D

Digital Speech

Low order digital system to spectrally flatten the signal (in favor of vocal tract parameters), and make it less susceptible to later finite precision effects

First Order FIR

Pre-emphasized

Digital Speech

(PDS)

Frame Blocking

Frame blocking of the sampled signal. Each frame is of N samples overlapped with N-M samples of the previous frame. Frame rate ~ 100 Frames/Sec

N values: [200,300], M values: [100,200]

PDS Frames

Frame

Windowing

Using Hamming (or Hanning or Blackman) windowing in order to minimize the signal discontinuities at the beginning and end of each frame.

Windowed PDS Frames

Windowed PDS Frames

Set of Feature Vectors

[1, 2, … , N]

[1, 2, … , K]

Feature

Extraction

Extracting the features of speech from each frame and representing it in a vector (feature vector).

Pattern Matching Modeling (step 3)

- The pattern matching modeling techniques is divided into two sections:
- The enrolment part, in which we build the reference model of
- the speaker.
- The verifications (matching) part, where the users will be
- compared to this model.

Set of Feature Vectors

[1, 2, … , K]

Modeling

Speaker Model

This part is done outside the DSP and the DSP receives only the speaker model (calculated offline in a host).

In VQ the decision is based on checking if the distortion rate is higher than a preset threshold:

if distortion rate > t, Output = Yes,

else Output = No.

In HMM the decision is based on checking if the probability score is higher than a preset threshold:

if probability scores > t, Output = Yes,

else Output = No.

- Two reference models were generated (one male and one female),
- each model was trained in 3 different ways:
- repeating the same sentence for 15 seconds
- repeating the same sentence for 40 seconds
- reading random text for one minute
- The voice database is compound from 10 different speakers (5
- males and 5 females), each speaker was recorded in 3 ways:
- repeating the reference sentence once (5 seconds)
- repeating the reference sentence 3 times (15 seconds)
- speaking a random sentence for 5 seconds

Conclusions:

Window size of 330 and offset of 110 samples performs better than window size of 256 and offset of 128 samples

Conclusions:

Feature vector of 18 coeffs is better than feature vector of 12 coeffs

- Conclusions:
- Worst combinations:
- 5 seconds of fixed sentence for testing with an
- enrolment of 15 seconds of the same sentence.
- 5 seconds of fixed sentence for testing with an
- enrolment of 40 seconds of the same sentence.

- Best combinations:
- 15 seconds of fixed sentence for testing with an
- enrolment of 40 seconds of the same sentence.
- 15 seconds of fixed sentence for testing with an
- enrolment of 60 seconds of random sentences.
- 5 seconds of a random sentence with an
- enrolment of 60 seconds of random sentences.

The Best Results:

Additional verification results

- The DSP results were compared to the Matlab simulation.
- We chose random speakers from the speakers DB with one reference codebook.
- For Example:
- PersonMATLABDSP
- Alex 69.58% (0.3627) 69.58% (0.3627)
- Sari 61.66% (0.4835) 61.66% (0.4835)
- Roee 49.97% (0.6938) 49.97% (0.6938)
- Eran 54.75% (0.6023) 54.75% (0.6023)
- Hila 55.72% (0.5849) 55.72% (0.5849)

Download Presentation

Connecting to Server..