Computational Analysis of a Horror Film Trailer Soundtrack

Nick Redfern

doi:10.17928/jjadh.6.1_11

Abstract

Computational film analysis combines statistics, data science, information visualization, and computer science with the analytical approaches and domain knowledge of the study of film to frame and answer questions about form and style in the cinema. In this article I illustrate the visualisation and analysis of the soundtrack of the trailer for the Korean horror film Into the Mirror (2003). Using the spectrogram and normalized aggregated power envelope to show how the sound design of the trailer evolves over time and functions at different scales. I address a range of practical considerations relevant to the methods demonstrated here relating to the use of mono versus stereo audio files, the use of different sampling rates, and audio normalization. A supporting website with the code for the statistical programming language R used in this analysis is available at https://rpubs.com/nr62_rp33/CFA-into-the-mirror.

Sound in the cinema shapes the viewer’s experience of a film in multiple ways. It can anticipate narrative events through the creation of mood; carry thematic structures across different sequences; reinforce events occurring on screen through emphasizing elements of the visual and adding to the interpretation of what is seen; influence the experience of pace and tempo in a sequence; shapes the meaning given to a scene; and guides the viewer’s emotional experience of a film. Audio content is therefore a rich source of information about narrative structure, action, and emotion in motion pictures. However, computational approaches to the analysis of film style are underdeveloped when it comes to the analysis of sound in the cinema. Brotto (2013, 155) observed that computational methods for exploring film style lacked a method for analyzing sound. Despite an abundance of freely available audio analysis tools developed in the fields of bioacoustics, soundscape ecology, and computational music analysis, few (Ma et al. 2021; Moncrieff, Dorai, and Venkatesh 2002; Pfeiffer and Srinivasan 2002; Redfern 2020a, 2020b, 2021) have attempted to apply these technologies to the analysis of motion picture soundtracks.

In this article I demonstrate the visualization and analysis of the soundtrack of the trailer for the Korean horror film Into the Mirror (2003) within the framework of computational film analysis to produce a critical narrative of the trailer’s soundtrack. I use the packages tuneR (v. 1.3.3; Ligges et al. 2018) and seewave (v. 2.1.5; Sueur, Aubin, and Simonis 2008) for the statistical programming language R (v. 4.0.0; R Core Team 2020), all of which are freely available and open source. Anyone specifically interested in audio analysis in R will find Jerome Sueur’s Sound Analysis and Synthesis with R (Sueur 2018) useful. Finally, I discuss practical considerations relating to sampling rates, mono versus stereo soundtracks, and normalization and to what degree these factors will have an impact on the analytical process.

Computational film analysis

Roger Scruton (1997, 396–98) describes the process of analyzing a work of art as building a bridge from the formal structure of a work to the aesthetic experience it affords through the construction of a “critical narrative.” Such a narrative enables us to experience what is present in a work of art, to experience things differently and, thereby, to help us enjoy a work of art. The aim of computational film analysis is to construct a critical narrative of the form and style of a motion picture. It does so by combining the methods and tools of statistics, data science, information visualization, and computer science with the analytical approaches and domain knowledge of the study of film to make empirical statements about the art of cinema.

Computational film analysis frames questions about form and style in the cinema, which are then put into operation in research, by defining the object(s) of analysis, the concepts in which the researcher is interested, the variables that represent the concepts, and how those variables will be measured. This requires the selection of an appropriate set of quantitative methods for data collection, analysis, and reporting, and the selection of a set of computational tools to implement those methods. Applying computational tools to data will result in a set of outputs (including statistical summaries, data visualizations, and models) that must be interpreted in the context of the quantitative methods selected. Interpreting these results will supply answers to the research questions, making it possible to judge the meaning of those results and satisfy the researcher’s curiosity about cinematic form and style Computational film analysis thus begins and ends with domain knowledge about the cinema. The questions we ask and the answers we construct belong to the study of film, and it is only in this context that those questions and answers have any meaning. Moving through the analytical process, we first see the representation of a film become more abstract as it is turned into data capable of being processed by a computer; then that representation becomes more concrete as we construct a critical narrative of a film’s form and style that brings us back to the artwork itself.

Alan Marsden (2016) writes that computational approaches to aesthetic analysis impose limits on how we can think about a “text” (whether that be a novel, film, piece of music, or any other work) because there must be a definite input to the analysis. The analyst must therefore accept a certain “ontological rigidity” of a film because the text must take the form of binary code with every bit determined prior to analysis. Marsden (2016, 20) argues that, provided the data is recognized as standing for the text at this stage, there is no reason to locate the analysis anywhere other than the film itself and the analysis does not depend on inputs that are extrinsic to the film. However, this does not mean that the results of an analysis are inevitable because, Marsden argues, there is no single definitive representation of an artwork that can be generated from an input. The analytical outputs of computational approaches are more fluid than the inputs because the researcher must make decisions about which aspects of a film are selected to answer their research question and which projections adequately represent those aspects. Consequently, there is not necessarily any reason “to believe that the particular structure in the output of the analytical process has a privileged status among the myriad other possible ways in which the input data could be structured” (Marsden 2016, 20).

What follows is an application of the computational analysis framework to the analysis of sound in a motion picture to show how a film trailer employs sound to promote a film—by creating a sufficiently frightening experience to entice prospective viewers—and communicates marketing information. I focus on the structure of the soundtrack and the presence and form of affective events, employing the spectrogram and the normalized aggregated power envelope as methods of discovering, analyzing, and communicating features of the trailer’s soundtrack to construct a critical narrative of how the soundtrack is organized, how it functions, and how it evolves over time.

Media Processing

Obtaining the Soundtrack

The raw material for analysis is the soundtrack of the trailer for the 2003 Korean horror film Into the Mirror. Film trailers may be available as a DVD extra or can be accessed online. In this case I exported the trailer from Into the Mirror from the PAL DVD (25 fps) as an MP4 file. I loaded the trailer into DaVinci Resolve (v. 16.1.2.026) and rendered the trailer’s audio as a stereo 16-bit linear PCM wave file sampled at 48 kHz.

When using trailers from an online source such as YouTube, it will be necessary to remove the MPAA rating tag screen from the beginning of the trailer (if present) before exporting the audio. Some trailer channels on YouTube add promotional material to the end of a trailer and this will also need to be removed.

Stereo Files to Mono

Figure 1 plots the waveforms of both channels of the soundtrack, displaying the change in amplitude over time. It is, however, hard to identify features beyond the amplitude of the wave from this type of plot. It will therefore be useful to apply methods capable of visualizing more complex information to understand the structure. These methods, described in the next section, require a mono input, and so the once the soundtrack has been loaded into R we need to convert to mono by averaging the left and right channels of the stereo wave file. The practicalities of using stereo and mono files to analyze sound design in the cinema are discussed below.

Figure 1. The waveform of the soundtrack of the trailer for Into the Mirror (2003).

Data Processing

This section describes the methods employed to visualize the soundtrack of Into the Mirror’s trailer that will be used in the formal analysis in the next section.

The Short-Time Fourier Transform

The first stage applies the short-time Fourier transform (STFT) to the mono version of the soundtrack to produce a 2D time-frequency representation of the signal called a spectrogram.

Any audio signal, no matter how complex, can be represented as the sum of a series of sinusoids. A mathematical function called a Fourier transform is used to break down a signal into its component sinusoids and produce a frequency domain representation of a signal called a spectrum. This spectrum shows the amplitudes of a signal at discrete frequency values between 0 Hz and the Nyquist frequency, which is equal to half the sampling rate of the signal. For detailed discussions of Fourier analysis of audio signals, see Goodwin (2008) and Müller (2015).

Figure 2.A displays segments of two signals, each of one second in duration, with frequency components at 329.63 Hz and 493.88 Hz corresponding to the notes E4 and B4, respectively. The waveform of the left-hand signal shows it is a superposition of both frequencies, with both notes present throughout the duration of the signal. The right-hand signal has the same frequency components, but they are not present at the same time: this waveform shifts frequency from E4 to B4 after 0.5 seconds. Figure 2.B shows the spectra of these waves with amplitude peaks at their component frequencies. From this example it is evident that a Fourier transform cannot distinguish between stationary and non-stationary signals; that is, it cannot see the difference between a stationary signal, in which the statistical properties of the signal are consistent over time (the left-hand wave in fig. 2.A), and a non-stationary signal, with components at the same frequencies but whose statistical properties vary with time (the right-hand wave in fig. 2.A).

The STFT overcomes this problem by dividing a signal into a series of windows and calculating the Fourier transform for each window. The result is a Fourier transform of the signal localized in time dependent upon the shape (rectangle, Hann, etc.), size (the number of samples within a window), and overlap of the window used. Movie soundtracks are aperiodic signals and are prone to spectral leakage when applying the Fourier transform, with signal energy smeared across a wide range of frequencies. Multiplying the signal by a window function reduces the effect of spectral leakage by forcing the data within the window to become periodic. There are numerous window functions to choose from. The Hann window is a weighted cosine taper with good frequency resolution when reducing spectral leakage in signals with unknown audio content.

The spectrogram describes how the magnitudes of the individual frequencies comprising a signal vary over time. Time is displayed on the x axis, frequency on the y axis, and the amplitude of a given frequency band within a time window is represented as color. Figure 2.C presents the spectrograms and shows that the two signals have different temporal structures. The size of the window determines the temporal and frequency resolution of the spectrogram. The temporal resolution is equal to the size of the window divided by the sampling rate and the frequency resolution is the sampling rate divided by the window size. Overlapping the windows produces a smoother spectrogram and a more accurate representation of the continuous evolution of the frequencies of a signal.

Figure 2. (A) Segments of two one-second signals with frequency components at 329.63 Hz (E4) and 493.88 Hz (B4). The left-hand signal is a stationary signal constructed as a superposition of both frequencies. The right-hand signal is non-stationary and changes frequency after 0.5s. (B) The spectra of the signals. (C) The spectrograms of the signals.

Time and frequency comprise a conjugate pair of variables and as such are subject to an uncertainty principle: it is not possible to localize both time and frequency with absolute precision. Increasing the window size leads to an increase in the frequency resolution of the spectrogram at the expense of temporal resolution, so that we can say with greater precision which frequencies are changing but cannot determine accurately when those frequencies change. Likewise, decreasing the size of the window will increase the temporal resolution at the expense of frequency resolution, with the result that we have more precise knowledge about when the frequencies change even though we cannot say accurately which frequencies are changing. There is inevitably a trade-off to be made between temporal and frequency resolution, and the purpose of the analysis will determine which features of the spectrogram we will want to examine.

Figure 3 shows the spectrogram of the soundtrack of the Into the Mirror trailer. The window size is 2,048 to give a temporal resolution of 2,048/48,000 = 0.0426 seconds per window, and windows are overlapped by 50%. The frequency resolution is 48,000/2,048 = 23.44 Hz per band. Each column in the spectrogram is a Fourier transform of length equal to half the window length (2,048/2 = 1,024).

Figure 3. The spectrogram of the soundtrack of Into the Mirror, calculated with Hann windows of length of 2048 overlapped by 50%.

The Time Contour Plot

The spectrogram in figure 3 is made up of columns of values representing the amplitude of the different frequencies for that time window of the soundtrack. A time contour plot is generated by summing the amplitude values in each column of the time-frequency matrix comprising the spectrogram to produce an aggregate power envelope, which is then normalized to a unit area and treated as an amplitude probability mass function (Cortopassi 2006). The resulting time contour plot of the normalized aggregated power envelope against time shows how the amplitude of a soundtrack evolves over the course of the trailer.

The number of short-time spectra in the spectrogram in figure 3 is equal to the number of samples in the soundtrack divided by the hop length of the window, where the hop length is equal to the proportion of the STFT window overlapped (50% of 2,048 is 1,024). Rounding up, this gives a total number of short-time spectra of 7,119,360/1,024 = 6,953. This is the same number of points at which the normalized aggregated power envelope is calculated and so the plot of the time contour will be noisy. To clarify the overall structure of the time contour we can fit a loess trendline to the data. Loess regression is a non-parametric method that fits multiple regressions locally rather than globally (Cleveland 1979). The smoothness of the curve is determined by the span of the localized subset of data to which the trendline is fitted and the degree of the polynomials used for each local regression.

Figure 4.A plots the time contour for the soundtrack of the Into the Mirror trailer. It is easier to see the structure of the time contour if we plot the normalized aggregated power values on a logarithmic axis. Figure 4.B plots the audio event in figure 4.A between 30.1 and 68 seconds on a linear axis and figure 4.C plots the data for the same period on a logarithmic axis. When plotting the time contour on a logarithmic y axis, non-linear features will appear as straight lines indicating nonlinear sound mixing as the amplitude increases or decays exponentially. In figure 4.C we see that this section of the film includes both exponentially increasing power from 30.1 to 52.7 seconds followed by exponentially decreasing power from 57.2 to 68 seconds.

The time contour plot is a direct and economic method of visualizing and analyzing motion picture soundtracks that favors a bottom-up approach to audio analysis, enabling analysts to identify interesting features in a soundtrack. It shows us the temporal structure at the macro scale of a soundtrack, giving us an overall sense of the evolution of sound energy in a film. It allows us to segment the narrative structure of the film by identifying meso-level features associated with specific sequences that are intended to have an emotional effect on the viewer, and to place these in relationship to one another and in their broader communicative context. It draws our attention to micro-level features and the functions they serve at transient moments in a film. The time contour plot therefore allows us to understand the connections between the various features of a soundtrack both horizontally, by highlighting relationships between those features in temporal sequence, and vertically, by emphasizing relationships between features at different scales, to gain both a broad and a deep understanding of the sound design in film. It allows us to differentiate between types of audio events based on the shape of their envelope. Is the attack phase of an event when the energy steadily increases to its peak level a step edge or a slope? Is there a period in which the sound level is sustained? Does the decay phase show a steady decrease in sound energy that is slow with a slope, or rapid with a step edge? The characteristic shape of an envelope is often associated with specific functions and provides a direct visual connection between the way in which sound is organized and what it does in a film that can be compared within and across films.

Figure 4. (A) The time contour plot of the normalized aggregated power envelope for the soundtrack of the trailer for Into the Mirror (2003). Sections are marked by roman numerals, subsections by capital letters, and local events by lowercase letters. The audio event between 31 and 68 seconds is plotted on (B) linear and (C) logarithmic axes.

Analyzing the Soundtrack of the Trailer for Into the Mirror

Into the Mirror is a supernatural horror film about Woo Young-Min, a former detective, investigating a series of inexplicable deaths in a department store where he now works, all of which involve mirrors. The trailer for this film devotes relatively little time to establishing character or narrative and is focused entirely on the creation of a frightening experience for the viewer. The trailer contains only two lines of dialogue and depends entirely on the sound design and music to create the required emotional content for the trailer of a horror film. From figure 4.A we see that the soundtrack of the trailer for Into the Mirror comprises three sections: the first section includes is associated with a neutral emotional state in which nothing has yet happened (0–30.1 seconds; section I in fig. 4.A); the second section features the emotional and narrative content of the film (30.1–107.3 seconds; section II) with an overall energy increase that is exponential and is itself made up of three mid-scale segments; and the third section presents promotional information to the viewer (107.3–148.3 seconds; section III).

After a peak that accompanies the corporate logo of the distributor (0–5.7 seconds), the trailer focuses on Choi Mi-Jung, an employee of the department store. This first section of the soundtrack is divided into two parts. The first part follows Choi as she moves around the store to the soundtrack of quiet piano music in the background. It features a peak at 15.1 seconds which accompanies two jump cuts fragmenting a continuous shot of Choi walking, labeled as event (a) in figure 4.A. This is a conventional jump scare with step-edge attack and release. After this moment, the soundtrack increases in volume as the piano returns to dominate the soundtrack. The next part of the trailer comprises the feature excerpted in figures 4.B and 4.C. It begins at 30.1 seconds, when the music fades out as Choi stands before her reflection in a mirror and the only sound we can hear is that of a muffled, steady heartbeat. This feature is an example of a non-linear ticking riser in which sound energy increases exponentially over time (i.e., it appears as a straight line when we plot the time contour on a logarithmic axis) and has a rhythmic pulse present throughout. The micro-level event that runs from 36.8 to 38.2 seconds punctuates this sequence when Choi Mi-Jung’s ID badge falls out of frame (event [b]), which draws attention to this visual element of the trailer. Again, both the attack and release have a step-edge evolution. This feature can also be seen in the spectrogram in figure 3 when the amplitude suddenly increases for frequencies across the range of 0 to 8 kHz, indicating that the power of the soundtrack increases across a broad frequency range. This micro-scale feature is synchronized with movement on screen and serves to draw attention to important visual information (the badge falling out of frame) of which the viewer needs to be aware. A key function of sound in this trailer is to draw attention to the visual components in the trailer as aural punctuation. The makers of the trailer are willing to place a micro-level feature within a larger meso-level structure to fulfill that function, even if it means temporarily suspending the overall sound design for the sequence and interrupting the exponential increase in amplitude evident in the time contour plot. After this peak, Choi bends down to pick up her badge but her reflection does not move, and the pace of the heartbeat increases until 57.2 seconds. Then the step-edge attack of what sound like distorted screams is followed by the nonlinear decay of the envelope to 68 seconds, as Choi’s body hits the ground and the title of the movie is revealed for the first time (event [c]). The ordering of narrative information in this first part of the trailer stays quite close to the film, but the sound design is utterly different: none of the elements of the trailer’s soundtrack (the piano, the heartbeat, or the sudden increases in frequency range) are featured in the film Into the Mirror itself.

In Redfern (2020a), I showed that trailers for US horror films devote the first third of their running time to establishing character and narrative before moving on to the creation of a monomaniac version of the film characterized by heightened emotional intensity, which Jensen (2014) describes as the key feature of how trailers seek to attract an audience. In the case of the Into the Mirror trailer this relationship is reversed, with the first part of the trailer creating emotion and ambience before the main character of the film is introduced. Unlike trailers for films like Insidious (2011) or Sinister (2012) that use nonlinear ticking risers to build to an emotional climax toward the end of their running time, the trailer for Into the Mirror places this feature in its first third, albeit to achieve the same objective. The function of this meso-level feature is the same in both contexts, although the placement is quite different.

From 68 seconds the trailer shifts attention to Woo and his investigation. Section II.B includes two attacks on people at the department store. In the first attack, on a man in an elevator, the trailer again uses sound to draw attention to the visual, synchronizing a tone at approximately 2.54 kHz with the movement of a shadowy figure across a CCTV screen at 76.6 seconds. At 85.2 seconds the heartbeat is briefly reintroduced to the soundtrack just before another victim meets their end, which coincides with the peak at 87.6 seconds (event [d]). The attack of this event is slower than that of the earlier death of Choi, building up a sense of dread in the moment, and it has a step-edge release as the trailer shifts to next scene, when Woo discovers the body. The shape of the envelope for frightening events thus changes as the viewer’s level of knowledge about the film changes, relying less on the shock of the jump scare with a characteristic step-edge attack and more on the viewer’s anticipation of the event associated with a slope attack. The reintroduction of the heartbeat reinforces the expectation that something bad is going to happen based on the viewer’s experience of the trailer so far. The use of the heartbeat in this trailer demonstrates the value of using familiar—even clichéd—sounds in trailers. They provide the viewer with an aural frame of reference for a film they have yet to experience. The power then rises slowly in section II.C, from 91.5 seconds to a peak at 103.8 seconds (event [e]), as the soundtrack links together a montage sequence made up of narratively unconnected scenes from the film.

It is apparent when watching the trailer that the amplitude of the soundtrack changes over time but it is only when visualizing the soundtrack using the normalized aggregated power envelope that we begin to see the relationships between features that we might otherwise have considered sonically unrelated. The overall increase in power across section II of the trailer is exponential, with the trend over time in section II.C aligning with the trend between 30.1 and 57.2 seconds in section II.A. This indicates that the trailer’s sound design operates at different scales, with sound organizing viewers’ understanding of local events (for example, Choi in the bathroom, the deaths of later victims) while also fitting into the overall structure of section II to create the sense of heightened emotional intensity required to entice a potential audience. This large-scale relationship links features that are separated by over half a minute in the trailer by section II.B, which exhibits declining energy from 57.2 seconds to 85.4 seconds, and which serves the promotional (i.e., announcing the title of the film) and narrational (i.e., introducing the characters) functions of the trailer. That the overall evolution of power in section II is exponential only becomes apparent when we plot the normalized power envelope on a logarithmic scale. The nonlinear evolution of power over time is not clear from the waveforms in figure 1 or the spectrogram in figure 3; nor is it clear from listening to the soundtrack itself that the amplitude of the soundtrack evolves in this way. The different shapes of the local affective events do not affect the overall evolution, and these events are used to transition between subsections. This can be seen in the use of the slope release to move from section II.A to section II.B as the trailer shifts between its emotional and promotional functions, and in the step-edge attack as the trailer moves from the narrational back to the exponential trajectory of the emotional function.

The third section of the trailer begins at 107.3 seconds after the peak in sound energy and includes the final three features in the soundtrack. These features are associated with the trailer’s promotional functions, presenting the main actors (107.3–118.6s; section III.A) followed by the logo of the production company (121–131.1s; section III.B), the title card, and the website promoting the film (134.2–144.4s; section III.C), with sections separated by moments of silence over black screens. These events are structured and sequenced in such a way as to maintain the viewer’s interest. The penultimate audio event in the trailer increases sound energy slowly as the production company logo appears on screen and then quickly falls away to silence as a mysterious hand reaches out of the blackness of the screen. The final event has a step-edge attack. It uses the dynamic contrast between the silence after the previous peak, which primes the viewer for a scare by creating anticipation, and the step-edge attack that makes the viewer jump and reengage their attention. The key marketing material of the film is then held on screen over the course of the slope decay of the event. After 144.4 seconds the energy of the soundtrack falls away to silence as the trailer runs out to blackness.

Practical Considerations

The standard format for film audio is a stereo mix with a sampling rate of 48 kHz. However, for the purposes of analysis we may wish to change the format of the audio with which we are working. For example, I used a mono wave object sampled at 48 kHz as an input when visualizing the soundtrack as a spectrogram and time contour plot. In this section, I discuss practical considerations for audio analysis relating to the use of mono or stereo tracks, the sampling rate, and the normalization of audio files.

Mono versus Stereo

To calculate the spectrogram and time contour plot I used a mono wave object produced by averaging across the samples of the left and right channels of the stereo soundtrack of the Into the Mirror trailer. Converting a stereo wave object to mono by averaging samples across channels inevitably results in some loss of information, but it is easier to work with a single channel, and there is little purpose in using stereo wave objects if there is little additional information to be gained. Analyzing the spectrogram and time contour plots of two channels of a stereo signal will only serve to double the amount of work for the analyst without adding any new knowledge if those channels are not meaningfully different from one another. What counts as “meaningful” is for the analyst to decide on a case-by-case basis. There are situations in which analyzing the different tracks of a soundtrack is essential. This is especially the case when the different tracks contain directional information that will be lost by mixing and rendering to a mono track, resulting in the loss of a key element of the sound design, for example when surround-sound technologies or binaural recording techniques are used.

The seewave package includes functions that allow us to compare wave objects, which can assist in determining whether the use of a mono wave object will result in an unacceptable loss of information about a soundtrack. First, we get an overall comparison that combines information about the envelopes and spectra. The result is a unitless number with range [0, 1] that is the L1-norm (i.e., sum of the absolute differences between waves, also known as the Manhattan distance) describing the relative distance between the two wave objects. The difference between the wave objects of the left and right channels of the stereo soundtrack is 0.006, indicating that these two channels are not identical but that there is only a small difference between them. This distance measure is the product of the difference between the envelopes and the difference between the spectra of the two wave objects. We can evaluate these features individually. The difference between channels’ spectra is based on the mean spectrum. Figure 5.A plots the envelopes of the left and right channels and figure 5.B plots the mean spectrum of each channel. The relative distance between the envelopes is 0.255 and the relative distance between the spectra is 0.023. We can conclude that there are only minor differences between the envelopes and the mean spectra of the channels, and the plots in figure 5 reinforce this conclusion.

Figure 5. The (A) envelopes and (B) mean spectra of the left and right channels of the stereo soundtrack of the trailer for Into the Mirror (2003) sampled at 48 kHz.

Figure 6 plots the cumulative time contours for the mono, stereo left, and stereo right wave objects sampled at 48 kHz. Again, we see there are trivial differences between the left and right channels of the stereo soundtrack. There is also only a minor difference between these channels and the mono version of the soundtrack at the same sampling rate: the mono soundtrack has slightly lower energy than the two channels of the stereo soundtrack between 31 and 57 seconds, where averaging has reduced the overall level because of the slight difference between the left and right channels. This is not a major difference, and there is no reason to believe that any key information about the sound design in this trailer is overlooked when using the mono soundtrack as the basis for analysis.

Figure 6. The cumulative time contours for the mono, stereo left, and stereo right wave files sampled at 48 kHz.

Sampling rate

The analogue-to-digital conversion of a sound (recorded as voltage or current) to data (stored as binary digits) transforms the continuous-time signal of the sound wave to the discrete-time signal of its digital representation. This process is called sampling, with each sample recording the amplitude of a sound at a point in time. The sampling rate is the number of samples recorded per second (measured in hertz) and determines the range of frequencies captured by a digital audio file. The standard sampling rate of audio for video is 48,000 Hz (48 kHz).

To assess the impact of different sampling rates on the analysis of film audio we can compare the soundtrack to the Into the Mirror trailer at 48 kHz and 22.05 kHz. For simplicity, we will compare mono wave objects, downsampling the 48 kHz mono soundtrack to produce a version with a sampling rate of 22.05 kHz. Figure 7.A is the spectrogram of the wave object sampled at 48 kHz and figure 7.B is the spectrogram of the wave object sampled at 22.05 kHz. Both spectrograms were calculated using Hann windows of size 2,048 overlapped by 50%. Because the Nyquist frequency is equal to half the sampling rate, the range of frequencies represented by the spectrograms is different for different sampling rates. For a signal with a sampling rate of 48 kHz, a spectrogram can represent a frequency range from 0 to 24 kHz; but only a frequency range of 0 to 11.025 kHz can be represented for a signal with a sampling rate of 22.05 kHz. There will inevitably be a loss of information when sampling at the lower rate. However, there is little energy above 11.025 kHz in figure 7.A, and so the loss of information is minimal and would not lead to incorrect analyses of the soundtrack. In fact, from figures 5.B and 7.A we see that there is almost no energy above 16 kHz, and so a sampling rate of 32 kHz may therefore be optimal. The normal range of human hearing is 20 Hz to 20 kHz, but for most adults the effective upper limit is approximately 16 kHz, as our range of hearing decreases with age.

Figure 7. Spectrograms of the soundtrack to the trailer for Into the Mirror (2003) sampled at (A) 48 kHz and (B) 22.05 kHz.

As the spectrograms for the different versions of the soundtrack have different numbers of short-time spectra (6,953 and 3,194, respectively), we can de-normalize the aggregated power envelopes in each time contour plot by multiplying the value of the time contour by the number of data points, which is equal to the number of short-time spectra, to make a direct comparison between the time contour plots at different sampling rates. This removes the scaling effect of normalizing the envelope. Figure 8 plots the results and shows that there are only minor differences between the time contour plots at different sampling rates. What differences are apparent appear to be in the aggregated power of the envelope and not in its temporal structure. Using the lower sampling rate would not have altered the conclusions arrived at through analysis of the soundtrack. The choice of sampling rate must be considered in the design of any analysis, but 22.05 kHz, 32 kHz, and 48 kHz are sampling rates yielding satisfactory results.

Figure 8. Time contours of the aggregated power envelopes for the soundtrack of the trailer for Into the Mirror (2003) sampled at 48 kHz and 22.05 kHz.

Normalization

Some of the R functions employed here have default settings that apply peak normalization to an audio file. Peak normalization applies a constant amount of gain to a signal to bring the peak amplitude to a target level, which in the examples above is 0.0 dB. The time contour plot for a soundtrack is calculated as the sum of the energy in each column of a spectrogram. This sum will be identical whether or not peak normalization is used, because normalization will rescale the amplitude without altering the dynamic range of a soundtrack, as the same amount of gain is applied to the entire signal. Normalizing the soundtrack will therefore have no impact on the outcome of the analysis.

Conclusion

In this article, I have demonstrated a method for computational analysis of film audio using freely available packages for R and illustrated how they can be used for closely analyzing the soundtrack of the trailer for the Korean horror film Into the Mirror. I have also addressed a range of practical considerations for researchers to consider when designing an analysis and applying these methods. The power of the methods described here is descriptive, analytical, and communicative. It is descriptive in that, using representations of a soundtrack produced by applying statistical methods directly to that soundtrack, we are able to quote directly (and even reproduce) a part of a film, overcoming the “unavailability” of nonliterary arts (see Klevan 2011, 70–71). It is analytical in facilitating a bottom-up approach to film analysis that allows us to identify potentially interesting features of a soundtrack to form the basis of an analysis, and in enabling us to understand structures and relationships between the different parts of a soundtrack that we are not aware of when watching a film (e.g., the use of nonlinear mixing in a soundtrack). Finally, it is communicative in enabling us to share easily understood representations of a complex and continuously evolving element of film style. Grajeda and Beck’s observation that sound studies focuses on a “notoriously elusive object of study” (2008, 110) is easily overcome by the ease and immediacy with which sound can be represented using the methods described here.

Supplementary material

A website containing the code used in the analysis presented here accompanies this article for those who would like to use the methods described here in their own research into film sound. The supporting tutorial is pitched at a level where even if you have not used R before, you should be able to understand enough about the process, software, and methods to begin analyzing audio files. The website (last updated February 15, 2022) can be accessed at https://rpubs.com/nr62_rp33/CFA-into-the-mirror.

References

Corresponding author

Register with J-STAGE for free!