Myths and realities related to high-quality digital audio formats

jwatte's picture

On competent recordings of normal program material, with excellent equipment, nobody has shown that they can consistently tell the difference between redbook (regular CD audio, at 44.1 kHz sampling rate and 16 bit word depth) and higher-rate/wider audio formats.

However, you can construct program material or cases that are designed to show the weaknesses of redbook, where you can hear the difference. For example, redbook gives you about 95 dB of signal-to-noise (assuming 3 dB dithering, which is the minimum you can get away with to mask quantization). If you master the CD to give you +120 dB for the peaks, that leaves the bottom signal somewhere around 30 dB. If you start with the low signals, and then crescendo to the loud signals, then pure signals like flute will sound quantized, or alternately noisy, assuming mostly transparent systems. Meanwhile, a 20-bit or 24-bit format gets those magical extra ~24 dB needed to cover 0 - 120 dB.

Similarly, the Nyquist theorem states that you can re-create any fully bandlimited signal of frequency F with a perfect reconstruction filter and a sampling frequency greater than F*2. However, there are two keys there:

  • the reconstruction filter has to be perfect
  • the signal has to be stationary

Actually, the meaning of the second limitation is often lost on people making these comparisons. Music is not stationary. Depending on what level of transient response/timing you require, the "effective" Nyquist bandwidth may be as much as 30% less than the theoretical, once you account for non-stationary signals (technically, the bandwidth limitation goes all the way down to the square root of the Nyquist frequency, but I don't pretend that anyone can actually hear those corner cases).

Hence, the higher-bitrange formats do work around a real, very occasionally perceptible problem with redbook audio. Meanwhile, the higher-sampling-rate formats have traditionally worked around problems in filter design and delivery, although there is some theoretical benefit to human hearing from going up to 96 kHz. To cover 0 to 120 dB with 3 dB of dithering, and cover the entire human hearing range with dynamic program material, I would estimate the actual requirements as something like 20 bits at 64 kHz. However, because of history and whatnot, we've ended up with 24 bits and 96 kHz, which is a fine standard, because it has sufficient overkill that your hardware can be slightly "off" and you still get a good reconstruction, within the limits of human hearing.

And, by the way, a music playback system that can actually do 0 to 120 dB (or even 15 to 120 dB, needed to show off 16-bit vs deeper formats) is hard and expensive to build, not to mention you'll need a very well insulated room to show it off. But, as I said, it can be done, especially if you start with the soft signals and then build to the loud signals, with a constant volume setting (0 dBFS == +120 dBSP).