Computers as Listeners and Speakers
November 4, 2013
It's usually easy to distinguish male from female voices, since most women speak and sing at higher frequencies than most men. If you do a more technical analysis of speech signals, you see that all speech information is contained in the audio frequency band below 20 kHz. Further experimentation shows that intelligible speech is contained in frequencies between 300-3400 Hz, with most of the amplitude contained between 80-260 Hz.
The frequency range of intelligible speech was very important in the definition of the telephone system, since you wouldn't want to spend money on high frequency components that weren't required. Telephone research also led to the first form of digital encoding of speech in a system called the vocoder, patented in 1939 by Bell Labs acoustical engineer, Homer Dudley.
In what was a tour de force in the era of vacuum tube electronics, Dudley used a bank of audio filters to determine the amplitude of the speech signal in each band. These amplitudes were encoded as digital data for transmission to a remote bank of oscillators that reconstructed the signal.
The vocoder allowed compression and multiplexing of many voice channels over a single submarine cable. Encryption of the digital data allowed secure voice communications, a technique used during World War II.
The vocoder operated on band-limited signals, independently of their origin. It was over-kill as far as speech signals are concerned, since human speech is contained in definite frequency bands called formants (see figure). Formants arise from the way that human speech is generated. Air flow through the larynx produces an excitation signal that excites resonances in the vocal tract.
Knowledge of the way that human speech is created allowed development of a speech synthesis technique called formant synthesis, which is modeled on the physical production of sound in the human vocal tract. This was best developed as linear predictive coding (LPC), successfully implemented by Texas Instruments in its LPC integrated circuits. Texas Instruments used these chips in its Speak & Spell toy. My e-book reader has a very good text-to-speech feature with both male and female speakers.
It should come as no surprise that research in artificial speech production has led to methods for speech recognition. The Wikipedia list of speech recognition software includes quite a few implementations, including the every-popular Siri, Google Voice Search, and a number of free and open-source software (FOSS) packages.
Some early voice recognition software improved reliability by having a single user speak works from a selected dictionary to calibrate the system to his voice. Modern applications try as much as possible to hide the "computer" part of computing from the user, so this is no longer done. As an episode of The Big Bang Theory shows, such voice recognition has its flaws, even in a one speaker environment. Is speech recognition of multiple speakers in a conversation even possible with today's technology?
Humans have no trouble with the task of identifying speakers in a group conversation, so how hard would it be for a computer to do the same? A team of computer scientists in the Spoken Language Systems Group at MIT's Computer Science and Artificial Intelligence Laboratory have tackled this problem, which is termed, "speaker diarization."[3-5] Speaker diarization is the automatic determination how many speakers there are, and which of these speaks when. It would be useful for indexing and annotating audio and video recordings.
A sonic representation of a single speaker involves the analysis of more than 2,000 different speech sounds, such as the vowel sounds represented in the spectrogram, above. These can be adequately represented by about sixty variables. When several speakers are involved in conversation, the diarization problem reduces to a search of a parameter space of more than 100,000 dimensions. Since you would like to avoid always needing to do diarization on a supercomputer, you need a way to reduce the complexity of the problem.
As an analogy of how such a simplification is achieved, consider the cumulative miles traveled by a train as a function of time. If we just consider the raw data, we would have a two-dimensional graph of miles (y) vs time (x), represented by a straight line. If we execute a mathematical transformation to rotate the graph to place the line at the x-axis, then all the variation happens along the x-axis, and we eliminate one of the two dimensions. The MIT research team's approach is to find the "lines" in the parameter space that encode most of the variation.
|Spectrograms of the average female (left) and male (right) voicing of vowels. These are the English vowel sounds, 'eh' (bet), 'ee' (see), 'ah' (father), 'oh' (note), and 'oo' as in (boot). Note the overall lower frequencies of the male voice, as well as the slower male cadence. (Fig. 1 of ref. 2, licensed under a Creative Commons License.)|
Stephen Shum, a graduate student of Electrical Engineering and Computer Science at MIT, and the lead author of the paper describing the technique, found that a 100-dimension approximation of the parameter space was an adequate representation. In any given conversation, not all speech sounds are used, so a single recording might need just three variables to classify all speakers.
Shum's system starts with an assumption that there are fifteen speakers, and it uses an iterative process to reduce the number by merging close clusters until the actual number of speakers is reached. The technique was tested with the multi-speaker CallHome telephone corpus.
|A representation of the cluster analysis of multiple speakers.|
(Still image by Stephen Shum from a YouTube Video.)
- Homer W Dudley, "Signal transmission," US Patent No. 2,151,091, March 21, 1939.
- Daniel E. Re, Jillian J. M. O'Connor, Patrick J. Bennett and David R. Feinberg, "Preferences for Very Low and Very High Voice Pitch in Humans," PLoS ONE, vol. 7, no. 3 (March 5, 2012), Article No. e32719.
- Stephen H. Shum, Najim Dehak, Réda Dehak and James R. Glass, "Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach," IEEE Transactions On Audio, Speech, And Language Processing, vol. 21, no. 10 (October 2013), pp. 2015-2028.
- Larry Hardesty, "Automatic speaker tracking in audio recordings," MIT Press Release, October 18, 2013.
- YouTube Video, Clustering method of Speech Recognition, Stephen Shum, October 8, 2013. The algorithm groups the points together that are associated with with a single speaker.
- Web Site of MIT Spoken Language Systems Group.
Permanent Link to this article
Linked Keywords: Male; female; voice; frequency; frequencies; speech; signal; information; audio frequency band; experiment; experimentation; intelligibility; intelligible; voice frequency; hertz; Hz; telephony; telephone system; electronic component; research; digital data; digital encoding; vocoder; Bell Labs; acoustical engineering; acoustical engineer; Homer Dudley; tour de force; vacuum tube; electronics; audio filter; amplitude; oscillator; compression; multiplexing; communications channel; submarine cable; encryption; World War II; formant; air flow; larynx; resonance; vocal tract; spectrogram; English language; cadence; Creative Commons License; speech synthesis; formant synthesis; physical modelling synthesis; linear predictive coding; Texas Instruments; LPC integrated circuit; Speak & Spell toy; e-book reader; speech synthesis; text-to-speech; Robby The Robot; San Diego Comic-Con International; Forbidden Planet; Bender; Futurama; Twiki; Buck Rogers in the 25th Century; Patty Mooney; Wikimedia Commons; speech recognition; Wikipedia list of speech recognition software; Siri; Google Voice Search; ree and open-source software; FOSS; reliability; dictionary; calibration; computer; computing; episode; The Big Bang Theory; technology; computer scientist; Spoken Language Systems Group; Computer Science and Artificial Intelligence Laboratory; MIT; speaker diarization; indexing; annotating; sound recording and reproduction; video recording; acoustic; sonic; spectrogram; variable; parameter space; supercomputer; analogy; mile; train; time; raw data; two-dimensional space; Cartesian coordinate system; graph; straight line; mathematical transformation; rotation; YouTube Video; Stephen Shum; graduate student; Electrical Engineering and Computer Science; iteration; iterative; CallHome telephone corpus.
Latest Books by Dev Gualtieri
Thanks to Cory Doctorow of BoingBoing for his favorable review of Secret Codes!
Blog Article Directory on a Single Page
- J. Robert Oppenheimer and Black Holes - April 24, 2017
- Modeling Leaf Mass - April 20, 2017
- Easter, Chicks and Eggs - April 13, 2017
- You, Robot - April 10, 2017
- Collisions - April 6, 2017
- Eugene Garfield (1925-2017) - April 3, 2017
- Old Fossils - March 30, 2017
- Levitation - March 27, 2017
- Soybean Graphene - March 23, 2017
- Income Inequality and Geometrical Frustration - March 20, 2017
- Wireless Power - March 16, 2017
- Trilobite Sex - March 13, 2017
- Freezing, Outside-In - March 9, 2017
- Ammonia Synthesis - March 6, 2017
- High Altitude Radiation - March 2, 2017
- C.N. Yang - February 27, 2017
- VOC Detection with Nanocrystals - February 23, 2017
- Molecular Fountains - February 20, 2017
- Jet Lag - February 16, 2017
- Highly Flexible Conductors - February 13, 2017
- Graphene Friction - February 9, 2017
- Dynamic Range - February 6, 2017
- Robert Boyle's To-Do List for Science - February 2, 2017
- Nanowire Ink - January 30, 2017
- Random Triangles - January 26, 2017
- Torricelli's law - January 23, 2017
- Magnetic Memory - January 19, 2017
- Graphene Putty - January 16, 2017
- Seahorse Genome - January 12, 2017
- Infinite c - January 9, 2017
- 150 Years of Transatlantic Telegraphy - January 5, 2017
- Cold Work on the Nanoscale - January 2, 2017
- Holidays 2016 - December 22, 2016
- Ballistics - December 19, 2016
- Salted Frogs - December 15, 2016
- Negative Thermal Expansion - December 12, 2016
- Verbal Cues and Stereotypes - December 8, 2016
- Capacitance Sensing - December 5, 2016
- Gallium Nitride Tribology - December 1, 2016
- Lunar Origin - November 27, 2016
- Pumpkin Propagation - November 24, 2016
- Math Anxiety - November 21, 2016
- Borophene - November 17, 2016
- Forced Innovation - November 14, 2016
- Combating Glare - November 10, 2016
- Solar Tilt and Planet Nine - November 7, 2016
- The Proton Size Problem - November 3, 2016
- Coffee Acoustics and Espresso Foam - October 31, 2016
- SnIP - An Inorganic Double Helix - October 27, 2016
- Seymour Papert (1928-2016) - October 24, 2016
- Mapping the Milky Way - October 20, 2016
- Electromagnetic Shielding - October 17, 2016
- The Lunacy of the Cows - October 13, 2016
- Random Coprimes and Pi - October 10, 2016
- James Cronin (1931-2016) - October 6, 2016
- The Ubiquitous Helix - October 3, 2016
- The Five-Second Rule - September 29, 2016
- Resistor Networks - September 26, 2016
- Brown Dwarfs - September 22, 2016
- Intrusion Rheology - September 19, 2016
- Falsifiability - September 15, 2016
- Fifth Force - September 12, 2016
- Renal Crystal Growth - September 8, 2016
- The Normality of Pi - September 5, 2016
- Metering Electrical Power - September 1, 2016
Deep Archive 2006-2008