Speech Intelligibility Theory

Lecture 1.2. THEORY OF SPEECH INTELLIGIBILITY

GENERAL THEORY OF SPEECH INTELLIGIBILITY • For speech messages transmitting are used the direct speech message transmitting (непосредственная передача речевого сообщения), parametrical and phonemics (фонемный) methods. • Фоне́ма (др.-греч. φώνημα — «звук») — минимальная единица звукового строя языка. • Direct transmitting of speech process can be carried out by analogue, pulse or digital communication channels. In analogue channel a signal is the harmonious oscillation, for which one parameter (the amplitude, frequency or phase) changes in accordance with the law of speech process change.

The most widespread method of speech intelligibility measurement during testing of a receiving channel is the articulation method. Sound (звуковую), syllabic (слоговую), verbal (словесную) and phrase (фразовую) articulations are distinguished. To determine them a special standard set of the speech materials is used. Sound and syllabic articulations are called speech intelligibility (разборчивость речи), verbal and phrase – clearness(понятность).

More often syllabic intelligibility is determined, i.e. intelligibility of the sound combinations (разборчивость звукосочетаний) which do not have the semantic meaning (смыслового значения) and is formed by certain rules. • For receiving of reliable results the sample volume (объем выборки) is defined in accordance with probability theory. Percent of correctly accepted syllables is called as coefficient of syllabic legibility S (коэффициентом слоговой разборчивости) and is also used as criterion of transmitting quality by the telephone channel.

Between sounds intelligibility W, syllables intelligibility S, words intelligibility D and phrases intelligibility P exists univalent functional dependence (однозначная функциональная зависимость). That allows knowing one of factors, for example S, to find any other, using the tabulated dependences. Dependence of (разборчивости слогов)syllables intelligibility S on (разборчивости слов)words intelligibility D is shown on Fig.

The speech intelligibility can be calculated analytically. • Distortion of speech messages in communication channels arises due to influence the factors causing distortion of their spectral structure. • Among such factors can be frequency and nonlinear distortions and hindrances, in systems with single-band modulation – a carrier recovery error (погрешность восстановления несущей). At nonlinear distortions in a speech signal spectrum there is possible appearance of the higher harmonics.

Human ear (органы слуха человека) do not react on oscillationsphase (на фазу колебаний), therefore restrictions on phase-frequency characteristics of the telephone communication channel are not imposed (не накладываются). • Intelligibility of speech depends mainly on peak-frequency communication channel characteristics (разборчивость речи зависит в основном от АЧХ).

COMPRESSION OF SPEECH MESSAGES • Speech signals possess considerable redundancy (избыточность) that allows carrying out their compression without their information reduction (без сокращения информации). It is reached by means of speechsignal transformations with use of direct compression methods (непосредственная компрессия) and functional transformations methods (функциональные преобразования).

Compression (direct compression) is carried out by decreasing of a dynamic range (a peak compression), width of frequency spectrum (a frequency compression), and duration of a signal (time compression).

The peak compression of a speech signal is intended for compression of a dynamic range of a radio signal in the transmitting device. On the receiving side return operation over received signal is made - expanding (expansion) i.e. recovering of compressed dynamic range to its initial value. The compressor and expander, which realize these operations, were called compander. Companding (occasionally called compansion) is a method of mitigating (смягчение) the detrimental (вредных) effects of a channel with limited dynamic range

The transmitting compressor coefficient changes in dependence on signal level arriving on its input, thus the bigger signal level, the less is transmitting coefficient. Therefore maximum and minimum signal level at compressor output is approaching (сближаются). As a result of weak signals level increasing on a transmitter output relation “signal/noise” is increasing that improves intelligibility of speech. Decreasing of a maximum level of useful signals reduces probability of transition of the amplifying devices in a restriction mode (режим ограничения). It also promotes (способствует)increase of quality during transmitting of speech messages.

VOCODER • A vocoder (a combination of the words voice and encoder) is an analysis / synthesis system, mostly used for speech in which the input is passed through a multiband filter, each filter is passed through an envelope follower, the control signals from the envelope followers are communicated, and the decoder applies these (amplitude) control signals to corresponding filters in the (re)synthesizer.

It was originally developed as a speech coder for telecommunications applications in the 1930s, the idea being to code speech for transmission. Its primary use in this fashion is for secure radio communication, where voice has to be encrypted and then transmitted. The advantage of this method of "encryption" is that no 'signal' is sent, but rather envelopes of the bandpass filters.

The receiving unit needs to be set up in the same channel configuration to resynthesize a version of the original signal spectrum. The vocoder as both hardware and software has also been used extensively as an electronic musical instrument. • The vocoder is related to, but essentially different from, the computer algorithm known as the "phase vocoder".

Whereas the vocoder analyzes speech, transforms it into electronically transmitted information, and recreates it, the voder (from Voice Operating Demonstrator) generates synthesized speech by means of a console with fifteen touch-sensitive keys and a foot pedal, basically consisting of the "second half" of the vocoder, but with manual filter controls, needing a highly trained operator.

Vocoder theory • The human voice consists of sounds generated by the opening and closing of the glottis(голосовая щель) by the vocal cords (голосовые связки), which produces a periodic waveform with many harmonics. This basic sound is then filtered by the nose and throat (a complicated resonant piping system) to produce differences in harmonic content (formants) in a controlled way, creating the wide variety of sounds used in speech. There is another set of sounds, known as the unvoiced(глухие) and plosive(звонкие) sounds, which are created or modified by the mouth in different fashions.

END

Speech Intelligibility Theory