[pulseaudio-discuss] Resampler quality evaluation results

Sun Aug 24 11:53:18 PDT 2014

I have finished the first stage of my work on resampler quality evaluation.

The scripts are here: https://gitorious.org/psy-eval/psy-eval/
The results are here: https://imgur.com/a/jtIEj

Note: they are valid only for 44100 -> 48000 Hz resampling. But that's 
the common case.

TL;DR summary: it makes sense to change the default resampler quality 
from the current "speex-float-1" value to "speex-float-3" or even 
"speex-float-5" on capable machines, otherwise the distortion is 
sometimes noticeable. And, speex-float-{3,5} are similar to what 
proprietary OSes offer.

The work is based on the question: does a human listener notice the 
distortion introduced by a resampler? To answer that, I used a 
psychoacoustical model publicly available at the following URL:

http://www.mp3-tech.org/programmer/docs/6_Heusdens.pdf

The paper was chosen because it is short, the model is simple, newer 
than the PEAQ monster, does not need special treatment of noise vs 
tones, provides one number as the answer, and because I have already 
used it in dcaenc. From that paper, Eq. (5) is the equation that we 
need. We put the power of signal and distortion at each frequency in, 
and get a single number out. If this number is less than 1, the 
distortion is not audible. If it is greater than 1, then the distortion 
is not audible. As that number turns out to be a ratio of powers, it can 
also be converted to dB with the usual 10 * log10(D(m,s)) formula.

The paper takes the following factors into account:

  * absolute threshold of hearing,
  * perceptual masking of nearby frequencies by a tone,
  * temporal masking.

I have removed the temporal masking from the model by omitting L̂ from 
Eq. (5), because it is not relevant in the resampler-evaluation case, as 
users can play arbitrarily-long tones.

So, given the formula, we need to feed something as input. The idea is:

  * Generate a test wav file (with wavegen.py).
  * Play it through the resampler.
  * Capture the output as a wav file.
  * Analyze the result (with resampler_plots.py).

To capture the resampler output, two techniques were used.

For PulseAudio resamplers, we can create a null sink, play a wav file 
with paplay and record the result with parecord through its monitor. 
Unfortunately, parecord inserts some garbage at the beginning. For 
resamplers built into third-party operating systems, a patched QEMU was 
used. The patch deliberately cripples the emulated HD Audio card, so 
that it accepts only 48 kHz, forcing the guest to resample. The 
resampled output was captured using QEMU_AUDIO_DRV=wav. Some other 
environment variables have to be set so that QEMU itself does not 
resample and to reduce the chance of dropouts in the recording.

Patch:

--- qemu/hw/audio/hda-codec.c   2014-07-06 18:46:20.764429441 +0600
+++ qemu/hw/audio/hda-codec.c   2014-08-20 21:58:32.661701409 +0600
@@ -114,7 +114,7 @@

  #define QEMU_HDA_ID_VENDOR  0x1af4
  #define QEMU_HDA_PCM_FORMATS (AC_SUPPCM_BITS_16 |       \
-                              0x1fc /* 16 -> 96 kHz */)
+                              0x040 /* 48 kHz only */)
  #define QEMU_HDA_AMP_NONE    (0)
  #define QEMU_HDA_AMP_STEPS   0x4a



The test signal is a TPDF-dithered 16-bit sine wave with a linearly 
changing frequency. This way, we can know the frequency of the signal 
given only a timestamp. The scripts can detect the frequency/time slope 
automatically and extrapolate it into the area where the resampler 
(rightfully or not) suppresses the signal.

So, for each portion of the resampled wave, we know the signal 
frequency. Ideally, this frequency component should have the same 
amplitude as input if it is below half of the new sample rate, and the 
zero amplitude otherwise. Also, there should be no other frequency 
components. So, the conclusion is quite obvious: treat the reproduced 
part of that component as the signal, and all others (plus the missing 
part of the main component) as a distortion.

Under that definition, the plots that say "Limited bandwidth counts as 
distortion" below them were made. They display audibility of all 
distortions, as defined above, as a function of the input sine wave 
frequency, for a selection of resamplers. The sine wave is assumed to be 
at the full amplitude, which corresponds (as it is a common convention 
in psychoacoustical models) to 92 dB SPL. Note: do not listen at this 
volume. It is harmful. But it is also the worst case for the 
psychoacoustical model.

Also, audibility of the distortions inherent in a TPDF-dithered 16-bit 
input is shown as "quantization noise" on the same plots. As you see, 
16-bit input and TPDF dithering do not result in audible distortions.

Unfortunately, there is a bug on win81 plots, because Windows Media 
Player by default attenuates the file by 6 dB, and my scripts compensate 
for that, but also amplify the quantization noise. I am too lazy to fix 
this today. Please shift the whole win81-wmp curve down by 6 dB, and 
you'll hopefully get an approximately correct result.

As you can see, some resamplers allegedly create audible distortions for 
high-frequency inputs. That's expected: to offer good attenuation of 
unrepresentable frequencies (those above either old or new Nyquist 
frequency), they need to somewhat attenuate representable ones. This 
attenuation is counted as a distortion, and it indeed can be noticed if 
one is offered a direct comparison of resamplers that put the cut-off 
frequency in different places. All that is needed is a high-frequency 
sine wave that is attenuated, although ideally it shouldn't be 
attenuated. Obviously, nobody listens to such sine waves, so this is an 
artifact of the method.

This artifact is somewhat ignorable for 44100 -> 48000 Hz conversion, as 
it doesn't prevent one from creating a resampler that never introduces 
audible distortions (example: speex-float-5). However, it is expected to 
become a problem if one considers the VoIP use case, with lower sample 
rates, and lower transition frequencies.

As an attempt to work around the problem, I have also plotted audibility 
of the distortion vs input signal frequency without treating this 
attenuation of the main tone as a distortion. Look for "Limited 
bandwidth does not count as distortion" below the plot.

As you can see, under the old problematic definition, the following 
resamplers are indistinguishable from a perfect one (i.e. audibility of 
distortions never goes above 0 dB): speex-float-5, soxr-mq, 
src-sinc-medium-quality, and their better variants from the 
corresponding families.

Under the new definition of distortion, the following resamplers also 
become perfect: soxr-lq, src-sinc-fastest, macosx, wine. And maybe 
win81-wmp if I remeasure it.

It's quite sad that the current default in PulseAudio was influenced by 
the needs of low-power embedded devices at the measurable expense of the 
sound quality on the typical desktop. Now, with plots, figures and 
knowledge in hand, we can fix it.

I'll leave other metrics, different sample rates, and evaluation of 
distortions introduced into typical music and speech for my talk at the 
audio mini conference.

P.S. The following resamplers are not on the plots:

src-zero-order-hold: exactly the same as trivial.
speex-float-4: very very similar to speex-float-3. Not perfect.
speex-float-2: worse than speex-float-1.

Please ignore them.

-- 
Alexander E. Patrakov