Summary
In order to improve compression of audio files the psychoacoustic nature of the ear has been researched in considerable depth. The ear is only capable of hearing acoustic pressure variations in the range of 20Hz to 20 kHz with the threshold of hearing varying depending upon the frequency.
The ear filters the frequency as the sound enters the cochlea and is assigned into its relative critical band. Simultaneous masking occurs when two audible sounds are played at the same time. The more intense sound can mask the other. When exposed to an acoustic stimulus the hearing threshold increases which can mask the second sound if it is not loud enough. The amount of masking depends upon the frequency and the amplitude.
Temporal masking is where a sound can be masked by an auditory signal that is present either directly before or after. Post masking, where a sound is obscured immediately following the masker is caused primarily by fatigue of the auditory nerve which raises threshold levels. Research has shown other factors such as frequency selectivity and adaptability have an effect.
These characteristics can be used in audio engineering. By using computer algorithms to apply masking filters to music and other audio the frequency components that can be discarded or compressed can be determined. This compression removes data from the audio file, allowing for greater storage capacity whilst maintaining audible quality.
Introduction
The increasing digital age has led to a need for greater storage capacity whilst at the same time improving performance and quality. This is especially true in audio engineering. Early digital audio formats were uncompressed and bulky, meaning that only a small number of songs could be fitted onto the storage medium. For example a CD can only hold approximately 80minutes of audio. These formats contain the raw data, they are not compressed and do not take advantage of the psychoacoustic properties of the ear. By learning about how the ear works it is possible to write computer algorithms to eliminate inaudible sounds from the audio. This significantly reduces the corresponding file size and forms the basis of MP3, ATRAC and WMA compression.
Aim
To research into the psychoacoustic characteristics of the ear and identify the masking properties
Physiology of the ear
Sound can be defined as the variation of acoustic pressure over time caused by a vibrating body that creates propagating pressure waves in the surrounding air (Pressnitzer and McAdams).
For a sound to be heard our ears must be able to code its features. This coding is performed by the physiology of the peripheral auditory system. Sounds which are too weak cannot be heard. The sound intensity required for hearing is called the absolute hearing threshold which varies with frequency and can be seen in Figure 1.
The human ear is sensitive to a range of acoustic pressure frequencies. These pressure variations cause a set of mechanical and electrical responses which ultimately lead to hearing. The audible frequency range is 20Hz to 20kHz. The upper limit decreases with age and most adults cannot hear above 16kHz. In the middle sensitivity range the ear is able to perceive changes of pitch of 2Hz. The ear has an enormous intensity range with a lower limit of 0dB and an upper limit restricted to the point at which damage to the ear occurs (short exposure 120dB, long exposure 80dB) (Department of Neurophysiology, University of Wisconsin). An audio file can be compressed in this way by removing the frequencies which cannot be heard.
When a temporal acoustic wave enters the inner ear the basilar membrane, located in the cochlea decomposes the signal into frequency bands. These frequency sub bands are called critical bands and correspond to a section of the cochlea of around 1.3mm. The bandwidth of these critical bands is approximately 100Hz and is constant up to 500Hz. For each band after 500Hz the bandwidth increases by 20%. This allows human hearing to be modelled as a series of tuned filters that perform frequency selection and analysis on sounds (Zeng 1998). There is still however ongoing discussion amongst researchers regarding the definition of the critical bands. A psychoacoustic unit, the bark was introduced. A bark is defined as the width of one critical band. Figure 2 shows the relationship between the frequency and number of barks (Czyzewski et al, 2003)
Masking
Masking occurs when some components of audible signals cannot by heard due to the influence of another sound. Masking plays an integral part in the human hearing system. It is not only used for audio compression but is used for the perceptual reduction of noise. There are two types of masking: Masking which occurs simultaneously with another sound and masking which occurs either immediately before or after the masking sound.
Simultaneous or frequency masking arises when sounds are played at the same time. The frequency selectivity of the ear into critical bands has an influence. When separated two sounds at 320Hz and 330Hz can be easily distinguished but when played together the two sounds cannot be separately perceived. This is because they both excite the same region in the cochlea. Masking occurs because the stimulus of acoustic pressure effectively increases the absolute hearing threshold of the frequencies around it which can be seen in Figure 3.
The amplitude of the sound and the frequency has an effect on its masking properties. For higher amplitudes the cone of masking covers a greater frequency range and so has a greater capacity for masking. Lower volume sounds located in the cone which would separately be heard would now be masked. At lower frequencies the slope of the masking threshold cone increases, meaning that higher frequencies are more readily masked. If the signal has multiple notes and frequencies then a global masking threshold can be calculated as a function of frequency with only sounds above this frequency being relevant. All sounds under the hearing threshold can be discarded by the coding and therefore produce a compressed audio file (Haritaoglu, 1996). This type of masking is seen commonly in everyday life. For example when a car drives by, it makes conversation difficult. This is on a much greater scale than used during audio engineering but the principle is fundamentally the same.
Temporal Masking is masking which occurs when the obscured sound is played before or after the masking sound. This masking is said to be non parallel masking, occurring in the time domain. Masking after a strong sound is called forward or post masking and can last up to 200ms. Pre masking or backwards masking is caused by a sound which appears after it. This time is relatively short, approximately 20ms. As Figure 4 shows the ear experiences sensation for a short time before and a short time after the sound is heard which causes the temporal masking. This phenomenon is exploited by audio compression technologies to reduce the number of frequencies and therefore file size.
There are many factors which may contribute to forward masking. Through experimental studies it has been found that the responsiveness of the auditory nerve is reduced after exposure to acoustic stimuli. During such stimuli the absolute hearing threshold is increased which may mask other sounds. This effect lasts between 100ms and 200ms after the exposure to the stimuli thus producing forward masking (Meddis and O’Mard, 2005). Other factors may influence the masking effect including:
- Peripheral frequency selectivity (Duifhius 1973). When excited by an acoustic stimulus, oscillations in the hairs in the cochlea remain after the stimulus has passed, decaying exponentially. During this period of continuing oscillation masking of similar frequencies can occur. (Crawford and Fettiplace 1980).
- Adaption in the auditory nerve (Smith 1979). The auditory nerve fibres respond to bursts of constant sound intensity by producing a maximum firing rate at onset followed by an adaption to the sound within 150ms. This adaption may mask over auditory nerve responses.
Relkin and Turner (1988) suggest that the masking effect is not caused by a reduction in auditory nerve responsiveness. In their studies it was found that this does not solely cause an increase in the hearing threshold. They suggested that the increase is the auditory perceptual threshold occurs at a later part in the processing system.
Backwards masking is most commonly explained as a phenomenon that occurs at high levels of the auditory system and is related to short term memory processes. Duifhuis’ theory of interacting excitement patterns in the ear by the two sounds can also be applied here. The theory was tested by Dolan and Small in 1984 with results verifying his theory.
Conclusion
The psychoacoustic nature of the ear has been identified which stems primarily from the frequency filtering nature of the cochlea. This filtering enables both simultaneous and temporal masking. An auditory signal can be made inaudible due the effects of another, more intense sound that occurs either before, during or afterwards. These properties can be researched experimentally to understand the exact characteristics of hearing to generate computer algorithms of these masking models. These algorithms can then be used to compress audio data by eliminating sounds which are inaudible in order to maximize storage capacity.