Anthropological Science
Online ISSN : 1348-8570
Print ISSN : 0918-7960
ISSN-L : 0918-7960
Reviews
Non-invasive documentation of primate voice production using electroglottography
CHRISTIAN T. HERBST JACOB C. DUNN
著者情報
ジャーナル フリー HTML

2018 年 126 巻 1 号 p. 19-27

詳細
Abstract

Electroglottography (EGG) is a low-cost, non-invasive method for documenting laryngeal sound production during vocalization. The EGG signal represents relative vocal fold contact area and thus delivers physiological evidence of vocal fold vibration. While the method has received much attention in human voice research over the last five decades, it has seen very little application in other mammals. Here, we give a concise overview of mammalian vocal production principles. We explain how mammalian voice production physiology and the dynamics of vocal fold vibration can be documented qualitatively and quantitatively with EGG, and we summarize and discuss key issues from research with humans. Finally, we review the limited number of studies applying EGG to non-human mammals, both in vivo and in vitro. The potential of EGG for non-invasive assessment of non-human primate vocalization is demonstrated with novel in vivo data of Cebus albifrons and Ateles chamek vocalization. These examples illustrate the great potential of EGG as a new minimally invasive tool in primate research, which can provide important insight into the ‘black box’ that is vocal production. A better understanding of vocal fold vibration across a range of taxa can provide us with a deeper understanding of several important elements of speech evolution, such as the universality of vocal production mechanisms, the independence of source and filter, the evolution of vocal control, and the relevance of non-linear phenomena.

Mammalian voice production

Vocal communication in non-human primates has long been of interest to both academic researchers and the broader public. This interest exists for two principle reasons. Firstly, we have an intrinsic curiosity over how these animals, which are so closely related to us, communicate with one another. Secondly, and perhaps more importantly, understanding vocal communication in our primate relatives provides us with important insight into the evolution of human speech—a topic that has fascinated humans for centuries.

Speech does not fossilize, and studying the evolution of this complex, yet critical aspect of human behavior using proxies from the fossil record (e.g. the shape of the skull or hyoid (Fitch, 2000a; Nishimura, 2003)) has led to little consensus (Fitch, 2000b; Hauser et al., 2002). A more powerful approach is provided by the comparative method, the primary tool used by Darwin to analyze evolutionary phenomena (Darwin, 1859, 1871). Comparative analyses use data from extant species to draw inferences about extinct ancestors and evolutionary processes. Several important advances in our understanding of the evolution of speech have been made using comparative data (Ghazanfar and Hauser, 1999; Fitch, 2000b; Fitch et al., 2016; Ghazanfar et al., 2012; Takahashi et al., 2013).

Humans, non-human primates, most other mammals (Herbst et al., 2012), and even birds (Elemans et al., 2015) produce sound according to a universal physical principle, described by the myoelastic aerodynamic (MEAD) theory (van den Berg, 1958; Titze, 1980). Steady airflow, coming from the lungs, is converted into a sequence of airflow pulses by the passively vibrating vocal folds (or other laryngeal or syringeal tissue), resulting in self-sustained oscillation. The acoustic pressure waveform generated by this sequence of flow pulses excites the vocal tract, which filters the pulses acoustically, and the result is radiated from the mouth (and/or the nose) (Story, 2002). The latter phenomenon, involving the individual contributions of the laryngeal sound source and the vocal tract to the quality of the emitted sound, has been described in the source–filter theory of sound production (Fant, 1960) and its non-linear extension (Flanagan, 1968; Titze, 2008). The source–filter theory thus predicts that both the laryngeal sound source and the vocal tract have distinct influences on the generated sound.

This universal sound production mechanism is facilitated by common basic vocal anatomy among mammals (certain specializations, such as air sacs or vocal membranes, notwithstanding (e.g. Charlton et al., 2013; Dunn et al., 2015)). Similarities in vocal anatomy allow for homologous functionality of sound output.

The mammalian vocal organ is comprised of three subsystems: the respiratory system, the larynx, and the supraglottal vocal tract. On a physical level, these subsystems constitute the power source, the sound source, and the sound modifiers, respectively (Howard and Murphy, 2007). Table 1 summarizes the most basic characteristics of the emitted vocal sound, and how they are controlled through the three voice subsystems. Note that the overview given in Table 1 is a gross simplification of a complex system. A more comprehensive discussion is provided in Herbst (2017).

Table 1 Greatly simplified model of vocal sound quality control in humans. Note that a wide variety of synergies and secondary effects exist, e.g. the positive correlation between subglottal pressure and fundamental frequency, or the enhancement of high-frequency components via proper vocal-tract adjustments in singing.
Feature Voice component Control
Sound intensity Power source Tracheal/subglottal air pressure
Fundamental frequency (fo) Sound source Vocal fold length and tension
Degree of high-frequency energy Sound source Vocal fold geometry, morphology and adduction
‘Breathiness’ (i.e. noise components) Sound source Vocal fold adduction, vocal fold geometry (pathologies/lesions)
Formant structure, vowel color (in humans) Vocal tract Vocal-tract anatomy, articulation (tongue, jaw opening, lips, vertical larynx position)

Bioacoustic research in non-humans typically only focuses on three main parameters of the generated sound (out of the five listed in Table 1):

  1. •  Fundamental frequency (fo), i.e. the repetition rate of the tissue vibrations constituting the sound source, has been suggested to be an indicator for interspecific (Fletcher, 2005) and intraspecific body size (Seyfarth and Cheney, 1986); an indicator of sexual dimorphism (Fouquet et al., 2016); a cue to mate quality, motivations, and emotions; and for individual recognition of conspecifics. The lack of periodicity (resulting in irregular/chaotic signals) may be an indicator for physical condition, status, or motivation (Wilden et al., 1998; Fitch et al., 2002).
  2. •  The intensity of the radiated sound, typically measured via the sound pressure level (SPL), may be an indicator for age, body size, and breeding status (Sanvito and Galimberti, 2003), as well as for physical condition and motivation (Wyman et al., 2008).
  3. •  Finally, the formant structure of the radiated sound (determined by the convolution of the spectral properties of the laryngeal sound source and the spectral characteristics of the supraglottal vocal tract, centrally influenced by the vocal tract’s resonances) plays a central role in vocal communication. The average frequency spacing (Reby and McComb, 2003) between the individual formants is an indicator of the vocalizing animal’s vocal tract length (Fitch, 1997), which has been shown to be a cue to body size (Reby et al., 2005; Charlton et al., 2012). These and many other important studies, mainly focusing on acoustic output, have provided important insights into the signaling function of mammalian vocal communication. In contrast, little is known about the actual voice production process in non-human mammals. The respective physical and physiological mechanisms and functional constraints of laryngeal sound generation have not yet been comprehensively investigated, mostly due to experimental difficulties in vivo.

The typical bioacoustic research paradigm treats the vocal production system as a ‘black box.’ Only sound output is analyzed, and the underlying voice production mechanisms are inferred hypothetically, based on empirical knowledge from humans. The physiological and physical framework is being bypassed, thus limiting the understanding gained from such research, and potentially leading to inappropriate conclusions about sound generation of the species studied. It is therefore important to understand the physiological and physical vocal production mechanisms of vocalizations at the sound source level. This is particularly important in non-human primates, when asking questions about vocal production with a view to understanding the evolution of human speech.

Electroglottography: method

Direct investigation of the laryngeal sound source is best accomplished via laryngeal endoscopy. Several imaging techniques exist, the foremost being videostroboendoscopy (Bless et al., 2009), videokymography (Svec and Schutte, 2012), and high-speed video (HSV) endoscopy (Deliyski and Hillman, 2010). These methods are, however, invasive, and even in humans only 90–95% of the population tolerate the procedure (Markus Hess, personal communication). In non-human primates and other mammals, the method is virtually impossible to apply in vivo, some experiments in situ with anesthetized animals notwithstanding (Berke et al., 1987; Döllinger et al., 2005).

A non-invasive, relatively low-cost alternative is electroglottography (EGG). This method was introduced by Fabre in 1957 as a bio-impedance measurement (Fabre, 1957). A high-frequency, low-voltage current is passed between two electrodes placed on either side of the thyroid cartilage at the level of the vocal folds. Changes in the vocal fold contact area (VFCA) during vocal fold vibration result in admittance variations, and the resulting EGG signal is proportional to the relative VFCA (Hampala et al., 2016).

The landmarks within a stereotypical EGG signal from human normophonic speech are illustrated in Figure 1 (taken from Hampala et al., 2016):

  1. (a)  initial contact of the lower vocal fold margins;
  2. (b)  initial contact of the upper vocal fold margins;
  3. (c)  maximum vocal fold contact reached (glottis not necessarily fully closed);
  4. (d)  de-contacting phase initiated by separation of the lower vocal fold margins;
  5. (e)  upper margins start to separate; and
  6. (f)  glottis is open, the contact area is minimal

Figure 1

Schematic illustration of EGG waveform for one glottal cycle (Baken and Orlikoff, 2000; Hampala et al., 2016) (see text).

The EGG signal thus constitutes indirect physiological evidence of vocal fold vibration dynamics. There is no noteworthy influence of vocal tract acoustics, and no influence at all from room acoustics or ambient background noise, which makes the method ideal for recordings outside of laboratory conditions, lacking a sound-proofed environment. For this reason, EGG is not suitable for assessing vowels in human speech and formant structures in any mammalian vocalization. It is, however, a perfect candidate for assessing the fo of the generated sound. This is illustrated in Figure 2A, B, and G, where fundamental frequency data from simultaneously recorded acoustic and EGG signals are compared.

Figure 2

Glissando (gradual fo variation) produced on vowel /a/ by a human female in laboratory conditions (sound-treated room, negligible background noise) during simultaneous recording of acoustic and EGG signal. Around t = 4 s, an abrupt change of laryngeal mechanism occurred. (A, B) Narrow-band spectrograms of acoustic and EGG signal, respectively, with the calculated fo superimposed as orange circles and cyan triangles; (C, D) EGG and dEGG wavegrams (see text); (E, F) representative EGG waveform and first derivative (dEGG), extracted at t = 3 s and t = 5 s, respectively. The arrows across panels E and D indicate the positive and negative peaks in the dEGG waveform, respectively; (G) scatter plot of fo from acoustic vs. EGG signal. A linear regression fit through the data points resulted in a perfect correlation (R2 = 1).

Informed quantitative interpretation of the EGG waveform can produce deeper insights into the voice production mechanics of the analyzed vocalization. An example is given in Figure 2C–F. The graphs displayed in Figure 2C and D are so-called wavegrams, a recently introduced visualization technique for EGG signals (Herbst et al., 2010) (source code available from http://homepage.univie.ac.at/christian.herbst/index.php?page=wavegram). For wavegram generation, the analyzed signal is segmented into individual glottal vibratory cycles. For each cycle, both the duration and the EGG signal amplitude are normalized, and the normalized amplitude is linearly color-coded (low amplitudes in light color, high amplitude in dark color). The resulting color-strips are vertically oriented and then consecutively plotted along the x-axis from left to right, representing the overall time of the analyzed signal. The y-axis is mapped onto normalized intra-cycle progress, and the z-axis shows the time-varying relative VFCA as recorded by EGG. Wavegrams can either be constructed from the EGG signal (Figure 2C) or its first mathematical derivative (dEGG, i.e. the rate of change of the relative VFCA—see Figure 2D).

The vocalization analyzed in Figure 2 is a glissando (i.e. a gradual variation of fo) sung by an amateur soprano singer. During the glissando, the soprano unwillingly introduced a so-called vocal ‘register break,’ i.e. an abrupt variation of sound spectral characteristics, caused by an abrupt variation in the mechanism of vocal fold vibration. This register break, occurring around t = 4 s, is clearly evident in the EGG waveforms extracted at t = 3 and t = 5 s (Figure 2E and F, respectively). The duration of contact increased from about 34% of the glottal cycle before the register break to about 67% after the register break. In this example, calculation of the contact duration was performed from the positive and negative peaks in the dEGG signal, which have been shown to be roughly (but not precisely) representative of glottal closure and opening incidents (Herbst et al., 2014). This relative contact duration and its development over time can also be recognized in the dEGG wavegram (Figure 2D), indicated by the horizontal dark and light lines.

The relative contact duration, typically termed EGG contact quotient (Orlikoff, 1991), can also be calculated algorithmically, either (a) based on positive and negative maxima in the dEGG signal (as described in the previous paragraph); or (b) with threshold-based or hybrid approaches. Note that the different methods of estimating the contact duration result in different data (Sapienza et al., 1998; Henrich et al., 2004; Herbst and Ternström, 2006; Kankare et al., 2012). In this manner, a larger corpus of data can be analyzed and is thus accessible to statistical appraisal, though the results should be interpreted with great care (Herbst et al., 2017).

The EGG contact quotient is the main quantitative EGG analysis parameter used in human voice research (Howard, 1995; Verdolini et al., 1998; Schutte and Miller, 2001; Henrich et al., 2005; Guzmán et al., 2016). Other, less frequently applied and less rigorously evaluated quantitative EGG analysis parameters are the speed quotient and the relative contact rise time (Orlikoff, 1991), amongst others.

The EGG contact quotient is in most cases roughly equivalent to the closed quotient as derived from high-speed video endoscopy or glottal airflow analysis (Echternach et al., 2010; La and Sundberg, 2014), but important exceptions exist where the EGG contact quotient is meaningless and should not be computed at all (Herbst et al., 2017). Work with human singers suggests that the EGG contact quotient may be used to partly infer glottal configuration and thus activation of intrinsic laryngeal muscles in certain cases, but the relation is, unfortunately, not straightforward (Herbst et al., 2011, 2017).

For the example shown in Figure 2 it can be hypothesized that, based on the observed EGG contact quotients, the first part (t = 0 s to ~4 s) was produced in the so-called ‘falsetto register’ (sometimes called laryngeal mechanism M1 (Henrich, 2006)), while the second part was produced in the so-called ‘chest register’ (laryngeal mechanism M2). During phonation in the chest register the thyroarytenoid muscle is typically more contracted as compared to the falsetto register (Hirano et al., 1969; Chhetri et al., 2012).

Electroglottography: application in non-human primates

In humans, EGG has regularly been applied to voice research, clinical work, and singing voice pedagogy for about 40 years, and this millennium has seen a considerable increase of respective scientific publication outputs. This is most likely owing to the attractiveness of EGG as a low-cost, non-invasive method. In contrast, the application of the method to non-human mammals has been comparatively extremely rare. Most studies involving non-human mammals (mostly dogs, but also sheep and cows (Berke et al., 1989; Alipour et al., 1996; Verdolini et al., 1998; Alipour and Jaiswal, 2008, 2009)) have been conducted in vitro or in anesthetized animals, for the purpose of duplicating the human model (most likely for ethical reasons, avoiding having to investigate human larynges), targeting medical and basic voice science questions in humans.

Only recently has EGG been used in vitro in several studies with excised larynges (Garcia and Herbst, 2018), specifically targeting questions of animal bioacoustics in a number of non-human primates (Herbst et al., 2012; Garcia et al., 2017), as well as prototypical application of the method in birds (Elemans et al., 2015; Rasmussen et al., in preparation). In vivo, the methodology has, to the knowledge of the authors, only been applied twice before. In a pioneering investigation involving two adult female Syke’s monkeys (Cercopithecus albogularis) (Brown and Cannito, 1995), the authors suggested that acoustic variation between sound emissions was principally due to different underlying laryngeal modes of vocalization. In a very recent study, EGG was applied as the central method of data acquisition with an operant conditioning approach (Koda et al., in preparation), studying an adult female Japanese macaque (Macaca fuscata) (Herbst et al., 2016, in preparation). That latter work was complemented with in vitro data from an excised Japanese macaque larynx. It provides a first SPL-calibrated documentation of three of the animal’s call types (coo, growl, and chirp), showing that the Japanese macaque voice range is comparable to that of humans 7–10 years old. EGG evidence suggested that growls, coos, and chirps were produced by distinct laryngeal vibratory mechanisms, analogous to what is known from human vocal registers (recall Figure 2). EGG data also revealed that the investigated Japanese macaque most likely varied the degree of vocal fold adduction in vivo, resulting in variations of the spectral characteristics within the emitted coo calls, ranging from ‘breathy’ (sound containing broadband noise components) to ‘non-breathy.’ This is again analogous to what is found in humans (recall Table 1), further corroborating the notion that humans and non-human primates share a universal physical and physiological sound-production mechanism (Herbst et al., in preparation).

Here, we present first qualitative insights into another recent in vivo EGG study on non-human primates, conducted at La Senda Verde Wildlife Sanctuary in Bolivia (further publications are forthcoming, e.g. Herbst and Dunn, 2018). Spontaneous vocalizations of 12 semidomesticated New World monkeys, stemming from six different species, were simultaneously documented with acoustic and EGG recordings. The two data acquisition strategies—either spontaneous voluntary vocalizations or during temporal immobilization by veterinary staff—are illustrated in Figure 3.

Figure 3

Illustration of in vivo EGG data acquisition in New World monkeys at La Senda Verde wildlife sanctuary, Bolivia. Two investigative paradigms were employed. (A) During spontaneous voluntary interaction between the monkeys and a local animal keeper, EGG electrodes were applied manually (left panel). The communicating animal keeper was seated in front of a glass door, behind which the recording equipment was situated (middle panel). The microphone was attached to the outer frame of the glass door, at a known distance to the monkey. (B) In cases where voluntary interaction was not possible, the monkeys were gently immobilized for a short duration (maximum 5 min) by two animal keepers, and a third animal keeper applied the EGG electrodes manually (right panel). SPL-calibrated measurements were possible through a known mouth-tomicrophone distance (microphone attached to the edge of the brown table in the right panel). In most situations, the animals were highly vocal during the short period of immobilization.

Drawing from the large corpus of data we collected, an example of the causal dependency between vocal fold vibration and sound generation in a white-fronted capuchin (Cebus albifrons) is illustrated in Figure 4. The acoustic excitation event per glottal cycle, indicated by the negative peak in the acoustic signal, occurs around the incident of vocal fold contacting (marked with vertical red lines in Figure 4). This phenomenon is also typically seen in humans (Fant, 1979).

Figure 4

Typical EGG waveform (bottom: three glottal cycles displayed) and corresponding acoustic signal (top graph) recorded from a white-fronted capuchin, time-shifted by 1.06 ms to compensate for the acoustic delay of the microphone signal, as introduced by a mouth-tomicrophone distance of 30 cm and an estimated vocal tract length of 6 cm. Two events of acoustic excitation (the negative minima in the acoustic signal) are indicated with vertical red lines, corresponding to the incidents of maximum increase of vocal fold contact within the EGG signal.

The superior potential of EGG, as compared to acoustic recordings, for understanding laryngeal sound generation is exemplified in Figure 5. Vocalization produced by a 3-year-old spider monkey (Ateles chamek) was documented with both acoustic and EGG recordings. When considering the acoustic signal alone, the acoustic waveform and the acoustic spectrogram of the highlighted segments would suggest regular (t = 154–172 ms) and irregular or chaotic (t = 355–374 ms) vocal fold vibration, respectively. However, when looking at the EGG signal it becomes evident that both segments are actually subharmonic (period-doubling) in nature. The EGG evidence reflects the physiological ‘ground truth’ at the laryngeal level, while the acoustic signal is likely polluted by artefactual background noise.

Figure 5

Different interpretation of laryngeal dynamics based on acoustic and EGG signal. (A, B) Simultaneously acquired acoustic and EGG waveforms of female spider monkey vocalization; (C, D) narrow-band spectrograms of acoustic and EGG signals. The highlighted sequences (154–172 and 355–374 ms, respectively) are illustrated in more detail in panels E–H; (E, F) detailed acoustic and EGG waveforms of segment extracted at 154–172 ms; (G, H) detailed acoustic and EGG waveforms of segment extracted at 355–374 ms.

Summary

In this text we have given an overview of EGG, describing the method and its current application in humans and non-human mammals. We have documented three investigative paradigms for EGG data acquisition in non-human primates: operant conditioning, voluntary communication, and short periods of immobilization. We have demonstrated that, in comparison to acoustic recordings, EGG can be very useful for gaining deeper insights into the vocal production mechanism in non-human primates. Indeed, in some cases, EGG allows interpretations that would not be possible from analysis of acoustic recordings alone, including (the likely common) cases when acoustic data are contaminated with noise. The method thus enables a better understanding of the entire sound-production ‘loop,’ from emotional impetus/communicative situation, muscular action, physical sound production, to sound output.

We argue that EGG is a powerful new minimally invasive tool that can provide important insight into the ‘black box’ that is vocal production. A better understanding of vocal fold vibration across a range of taxa can lead to better comprehension of several important elements of speech evolution, such as the universality of vocal production mechanisms, the independence of source and filter, the evolution of vocal control, and the relevance of non-linear phenomena. Future studies should apply EGG in vivo and/or in vitro across a range of species in order to improve our knowledge about vocal production in non-humans. Such comparative studies are likely to provide important insight into the evolution of human speech.

Acknowledgments

This publication has been partially supported by an ‘APART’ grant, awarded to C.T.H. by the Austrian Academy of Sciences, and supported by the Research Units for Exploring Future Horizons of Kyoto University.

References
 
© 2018 The Anthropological Society of Nippon
feedback
Top