1 / 1

Detection of Target Speakers in Audio Databases

Database:.  One-target-speaker detection: subset aABC_NLI of the HUB4 database (ABC Nightline) target speaker: Ted Koppel 3 broadcasts for training the target model 12 broadcasts for testing (26 to 35 minutes)  Two-target-speaker detection: subset bABC_WNN of the HUB4 database

pepin
Download Presentation

Detection of Target Speakers in Audio Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database: •  One-target-speaker detection: • subset aABC_NLI of the HUB4 database (ABC Nightline) • target speaker: Ted Koppel • 3 broadcasts for training the target model • 12 broadcasts for testing (26 to 35 minutes) •  Two-target-speaker detection: • subset bABC_WNN of the HUB4 database • (ABC World News Now) • target speakers: Mark Mullen (T1) and Thalia Assures (T2) • 3 broadcasts for training the target models • 16 broadcasts for testing (29 to 31 minutes) Results: Results of the one-target-speaker detection experiments Results of the two-target-speaker detection experiments for the alldata category, using B3,B4 for the background models • high-fidelity: High fidelity with no background • clean: all quality categories with no background • allspeech: all quality categories with or without background • alldata: previous category + all the untranscribed portions Modeling: •  Feature vectors: • 20 cepstral coefficients + 20 -cepstral coefficients. •  Gaussian mixture models : • 64 mixtures and diagonal covariance matrices. •  Target speakers models : • Three 90s segments of high-fidelity speech, extracted from 3 broadcasts, concatenated together. •  First background model (B1): • Eight 60s segments of high-fidelity speech (4 females, 4 males) concatenated together (from aABC_NLI). •  Second background model (B2): • Three 90s segments of non-speech data (music only 10%, noise only 10%, commercials 80%), extracted from 3 broadcasts, concatenated together (from aABC_NLI). •  Third background model (B3): • 29 segments (293.5s) of high-fidelity speech (10 females, 10 males) concatenated together (from cABC_WNT). •  Fourth background model (B4): • 23 segments (561.2s) of non-speech data (commercials + theme music), extracted from 2 broadcasts, concatenated together (from bABC_WNN). Conclusion: Evaluation: A method for estimating target speaker segments in multi-speaker audio data using a simple sequential decision technique has been developed. The method does not require segregating speech and audio data, and does not require other speakers in the data to be modeled explicitly.  The method works best for uniform quality speaker segments with duration greater than 2 seconds.  Approximately 70% of target speaker segments with duration 2 seconds or greater are detected correctly accompanied by approximately 5 false alarm segments per hour.  Frame-level Miss Rate (FMIR): # labeled target frames not estimated as target frames total # labeled target frames  Frame-level False Alarm Rate (FFAR): # estimated target frames labeled as non-target frames total # labeled non-target frames  Frame-level COnfusion Rate (FCOR): # labeled target frames estimated as target frames of another speaker total # labeled target frames(FCOR is a component of FMIR)  Segment-level Miss Rate (SMIR): # missed segments . total # target segments  Segment-level False Alarm Rate (SFAR): # false alarm segments divided by the total duration of the broadcast.  Segment-level COnfusion Rate (SCOR): # confusion segments divided by the total duration of the broadcast. *Rice University, Houston, Texas - **AT&T Labs Research, Florham Park, New Jersey ivan@ieee.org - aer@research.att.com - sps@ research.att.com Problem and Definitions:  Data - broadcast band audio data from television news programs containing speech segments from a variety of speakers plus segments containing mixed speech and music (typically commercials), and music only. Speech segments may have variable quality and may be contaminated by music, speech, and/or noise backgrounds.  Speaker detection task - locate and label segments of designated speakers (target speakers) in the data.  Overall goal - aid information retrieval from large multimedia databases.  Assumption - segmented and labeled training data exist for target speakers, other speakers, and other audio material. Detection of Target Speakers in Audio Databases Detection algorithm:  log-likelihood ratio:  smoothed log-likelihood ratio every vectors: with (1s) and (0.2s)  segmentation algorithm: Ivan Magrin-Chagnolleau *, Aaron E. Rosenberg **, and S. Parthasarathy ** Future directions:  use more than one model for each target speaker.  use more background models.  study the performances as a function of the smoothing parameters and the segmentation algorithm parameters.  use a new post processor to find the best path through a speaker lattice. Note 1: This work has been done when the first author was with AT&T Labs Research. Note 2: The first author would like to thank Rice University for financing his conference participation.

More Related