1 / 69

The process of speech production and perception in Human Beings

The process of speech production and perception in Human Beings. Speech Generation. The production process (generation) begins when the talker formulates a message in his mind which he wants to transmit to the listener via speech. In case of machine

rance
Download Presentation

The process of speech production and perception in Human Beings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The process of speech production and perception in Human Beings

  2. Speech Generation • The production process (generation) begins when the talker formulates a message in his mind which he wants to transmit to the listener via speech. • In case of machine • First step: message formation in terms of printed text. • Next step: conversion of the message into a language code.

  3. Speech Generation • After the language code is chosen the talker must execute a series of neuromuscular commands to cause the vocal cord to vibrate such that the proper sequence of speech sounds is created. • The neuromuscular commands must simultaneously control the movement of lips, jaw, tongue, and velum.

  4. Speech Perception • The speech signal is generated and propagated to the listener, the speech perception (recognition) process begins. • First the listener processes the acoustic signal along the basilar membrane in the inner ear, which provides running spectral analysis of the incoming signal.

  5. Sound perception The audible frequency range for human is apprx 20Hz to 20KHz The three distinct parts of the ear are outer ear, middle ear and inner ear Outer ear: Outer ear includes pinna and auditory canal It helps to direct an incident sound wave into middle ear Filters and modifies the captured sound

  6. Outer ear: The perceived sound is sensitive to the pinna’s shape By changing the pinnas shape the sound quality alters as well as background noise After passing through ear cannal sound wave strikes the eardrum which is part of middle ear

  7. Middle ear Ear drum: This oscillates with the frequency as that of the sound wave Movements of this membrane are then transmitted through the system of small bones called as ossicular system From ossicular system to cochlea Cochlea achieves efficient form of impedance matching

  8. Inner ear It consist of two membranes Reissner’s membrane and basilar membrane When vibrations enter cochlea they stimulate 20 000 to 30 000 stiff hairs on the basilar membrane These hair in turn vibrate and generate electrical signal that travel to the brain and become sound

  9. Pinna Auditory cannal Tympanic membrane ossicular system cochlea Basilar membrane

  10. Speech Perception • A neural transduction process converts the spectral signal into activity signals on the auditory nerve. • The neural activity is converted into a language code in the brain. • Finally the message comprehension (understanding of meaning) is achieved.

  11. Any coding technique could be used to transmit the acoustic waveform from the talker to the listener.

  12. The Speech-Production Process Vocal tract Lips (end of the vocal cords) Opening of vocal cords (Longitudinal cross-section)

  13. The Speech-Production Process • Vocal tract consist of • Pharynx: the connection from the esophagus to the mouth • Mouth or oral cavity • The total length is about 17 cm • The cross-sectional area determined by the positions of the tongue, lips, jaw, and velum varies from zero (complete closure) to about 20 cm2 • The nasal tract begins at the velum and ends at the nostrils.

  14. The Speech-Production Process • Velum: a trap-door like mechanism at the back of the mouth cavity. • When the velum is lowered, the nasal tract is acoustically coupled to the vocal tract to produce nasal sounds of speech.

  15. The Speech-Production Process

  16. The Speech-Production Process

  17. The Speech-Production Process • The lungs and the associated muscles excites the vocal mechanism. • The muscle force pushes air out of the lungs and through the bronchi and trachea. • When the vocal cord is tensed, the air flow causes them to vibrate, produces so called voice-speech sounds. • When the vocal cord is relaxed a sound is produced. • Speech is produced as a sequence of sounds.

  18. Representing Speech In The Time and Frequency Domains The speech signal is slowly time varying signal

  19. Representing Speech In The Time and Frequency Domains Depending upon the state of the vocal cords the events in speech are classified • Silence (S): where no speech is produced • Unvoiced (U): in which the vocal cords are not vibrating, so the resulting speech waveform is aperiodic or random in nature • Voiced (V):in which the vocal cords are tensed and therefore vibrate periodically when the air flows from the lungs, so the resulting speech waveform is quasi periodic (the pulses are not strictly periodic, but vary slightly from cycle to cycle).

  20. Representing Speech In The Time and Frequency Domains • Spectral representation: An alternative way of characterizing the speech signal and representing the information associated with the sounds. • Sound Spectrogram (most popular representation): a three dimensional representation of speech intensity, in different frequency bands, over time is portrayed.

  21. Speech Sounds and Features Front i(IY) I(IH) e(EH) æ(AE)

  22. Phonetics Phonetics (from the Greek word φωνή, phone = sound/voice) is the study of sounds (voice). It is concerned with the actual properties of speech sounds (phones) as well as those of non-speech sounds, and their production, audition and perception

  23. Phonetics has three main branches: • articulatory phonetics, concerned with the positions and movements of the lips, tongue, vocal tract and folds and other speech organs in producing speech • acoustic phonetics, concerned with the properties of the sound waves and how they are received by the inner ear • auditory phonetics, concerned with speech perception, principally how the brain forms perceptual representations of the input it receives.

  24. Phoneme: (linguistics) one of a small set of speech sounds that are distinguished by the speakers of a particular language • Syllable : A unit of spoken language larger than a phoneme

  25. Speech Sounds and Features The vowels: • The vowel sounds are interesting class of sounds in English. • The practical speech-recognition systems rely heavily on vowel recognition to achieve high performance. • If we omit the vowel letters in the sentence then resulting text is easy to decode. • But if we omit consonant letters the resulting text is not decodable.

  26. Speech Sounds and Features They noted significant improvements in the company’s image, supervision, their working conditions, benefits and opportunities for growth

  27. Speech Sounds and Features • In speaking, vowels are produced by exciting an essentially fixed vocal tract shape with quasi-periodic pulses of air caused by the vibration of the vocal cords. • The way in which cross sectional area varies along the vocal tract determines the resonance frequencies of the tract and thereby the sound is produced.

  28. Speech Sounds and Features • The vowel sound produced is determined by the position of the tongue. • The positions of the jaw, lips, and to a small extent, the velum, also influence the result of the sound.

  29. Speech Sounds and Features • The vowels are long in duration compare to the consonant sound. • They are spectrally well defined. • Easily and reliably recognized, both by machine and by humans.

  30. Speech Sounds and Features A convenient and simplified way of classifying vowel articulatory configuration is in terms of tongue hump position (front, mid, back) and tongue hump height (high, mid, low)

  31. Speech Sounds and Features The concept of a “typical” vowel sound is unreasonable in light of the variability of vowel pronunciation among men, women and children with different regional accents and other variable characteristics.

  32. Diphthong • In phonetics, a diphthong (Greek δίφθογγος, "diphthongos", literally "with two sounds") is a vowel combination involving a quick but smooth movement from one vowel to another, • Often interpreted by listeners as a single vowel sound or phoneme. • While "pure" vowels, or monophthongs, are said to have one target tongue position, diphthongs have two target tongue positions.

  33. Pure vowels are represented in the International Phonetic Alphabet by one symbol: English "sum" as [səm], for example. • Diphthongs are represented by two symbols, for example English "same" as [seɪm], where the two vowel symbols are intended to represent approximately the beginning and ending tongue positions.

  34. Diphthongs in the British English: • [əʊ] as in hope • [aʊ] as in house • [aɪ] as in kite • [eɪ] as in same • [juː] as in few • [ɔɪ] as in join • [ɪə] as in fear • [ɛə] as in hair • [ʊə] as in poor

  35. Speech Sounds and Features Semivowels: (A vowel-like sound that serves as a consonant) • The group of sounds consisting of /w/, /l/, /r/, and /y/ is quite difficult to characterize. These sounds are called semivowels because of their vowel-like nature. • Similar nature to the vowels and diphthongs • Ex. In ‘how’ w sounds near to ‘oo’ in boot

  36. The English word "well" would sound the same if it were spelled "uell" or "ooell". Things must be considered in the opposite manner: the fact that spellings "uell" or "ooell" might be also pronounced as /w/ doesn’t mean that /w/ is a vowel as in boot. The graphical form has nothing to do with phonetics, so let it aside and you’ll find that /w/ is a consonant, and /u(:)/ a vowel

  37. Speech Sounds and Features Nasal Consonants • The nasal consonants /m/, /n/, /η/ are produced with glottal excitation and the vocal tract totally constricted at some point along the oral passageway • The velum is lowered so that air flows through the nasal tract, with noise being radiated at the nostrils.

  38. The oral cavity, although constricted toward front, is still acoustically coupled to the pharynx. • The mouth serves as a resonant cavity that traps acoustic energy at certain natural frequencies.

  39. Voiced consonant • A voiced consonant is a sound made as the vocal cords vibrate, as opposed to a voiceless consonant, where the vocal cords are relaxed. See phonation for a continuum of degrees of tension in the vocal cords. • Examples of voiced-voiceless pairs of consonants are: • Voiced Voiceless • [b] [p] • [d] [t] • [g] [k]

  40. Voiced consonant If you place your fingers on your voice box , you can feel a buzz when you pronounce zzzz, but not when you pronounce ssss. That buzz is the vibration of your vocal cords. Except for this, the sounds [s] and [z] are practically identical, with the same use of tongue and lips

  41. Unvoiced consonant • In phonetics, a voiceless consonant is a consonant that doesn't have voicing. That is, it is produced without vibration of the vocal cords. • Voiceless consonants are usually articulated more strongly than their voiced counterparts, because in voiced consonants, the energy used in pronunciation is split between the laryngeal vibration and the oral articulation. • Ex. peculiar and particular.

  42. Speech Sounds and Features Voiced Fricatives: • The voice fricatives /v/, /th/, /z/, and /zh/. • The unvoiced fricatives /f/, /ө/, /s/, and /sh/, respectively • For voice fricatives the vocal cords are vibrating • Since the vocal tract is constricted (narrow) at some point forward of the glottis, the air flow becomes turbulent in the neighborhood of the constriction

  43. Speech Sounds and Features Voiced Stops • The voiced stop consonants /b/, /d/, and /g/, are transient, noncontinuant sounds produced by building up pressure behind the total constriction somewhere in the oral tract and then suddenly reducing the pressure. • For /b/ the constriction is at lips; for /d/ it is at the back of the teeth, and for /g/ it is near the velum. Unvoiced stop • The unvoiced stop consonants /p/, /t/, and /k/.

  44. Approaches to Automatic Speech Recognition by Machine Three approaches • The acoustic-phonetic approach. • The pattern recognition approach. • The artificial intelligence apporach

  45. Approaches to Automatic Speech Recognition by Machine • The smallest meaningful unit (linguistically distinct) in speech is called a phoneme, which does not have any meaning alone, but it makes possible to discriminate between different words. • Speech signals are sequences of linguistically separable units (phonemes). • Phonemes transitions are mostly relatively smooth. • A phone signifies the physical sound that is produced when a phoneme is uttered. • One phoneme can be pronounced in different ways, therefore a phone group containing similar variants of a single phoneme is called an allphone.

  46. Approaches to Automatic Speech Recognition by Machine The acoustic phonetic approach is based on the theory of acoustic phonetics • First step: segmentation and labeling • Second step: to determine a valid word

  47. Approaches to Automatic Speech Recognition by Machine The pattern-recognition approach to speech recognition is basically one in which the speech patterns are used directly without explicit feature determination and segmentation • Step one: training of speech patterns • Step two: recognition of pattern via pattern comparison • Speech “knowledge” is brought into the system via the training procedure

  48. Approaches to Automatic Speech Recognition by Machine The pattern recognition is the method of choice for speech recognition for three reasons: • Simplicity of use. The method is easy to understand, it is rich in mathematical and communication theory justification for individual procedures used in training and decoding, and it is widely used and understood. • Robustness and invariance to different speech vocabularies, users, feature sets, pattern comparison algorithms and decision rules. • Proven high performance.

More Related