[pulseaudio-discuss] Resampler quality evaluation results
Alexander E. Patrakov
patrakov at gmail.com
Sun Aug 24 11:53:18 PDT 2014
I have finished the first stage of my work on resampler quality evaluation.
The scripts are here: https://gitorious.org/psy-eval/psy-eval/
The results are here: https://imgur.com/a/jtIEj
Note: they are valid only for 44100 -> 48000 Hz resampling. But that's
the common case.
TL;DR summary: it makes sense to change the default resampler quality
from the current "speex-float-1" value to "speex-float-3" or even
"speex-float-5" on capable machines, otherwise the distortion is
sometimes noticeable. And, speex-float-{3,5} are similar to what
proprietary OSes offer.
The work is based on the question: does a human listener notice the
distortion introduced by a resampler? To answer that, I used a
psychoacoustical model publicly available at the following URL:
http://www.mp3-tech.org/programmer/docs/6_Heusdens.pdf
The paper was chosen because it is short, the model is simple, newer
than the PEAQ monster, does not need special treatment of noise vs
tones, provides one number as the answer, and because I have already
used it in dcaenc. From that paper, Eq. (5) is the equation that we
need. We put the power of signal and distortion at each frequency in,
and get a single number out. If this number is less than 1, the
distortion is not audible. If it is greater than 1, then the distortion
is not audible. As that number turns out to be a ratio of powers, it can
also be converted to dB with the usual 10 * log10(D(m,s)) formula.
The paper takes the following factors into account:
* absolute threshold of hearing,
* perceptual masking of nearby frequencies by a tone,
* temporal masking.
I have removed the temporal masking from the model by omitting L̂ from
Eq. (5), because it is not relevant in the resampler-evaluation case, as
users can play arbitrarily-long tones.
So, given the formula, we need to feed something as input. The idea is:
* Generate a test wav file (with wavegen.py).
* Play it through the resampler.
* Capture the output as a wav file.
* Analyze the result (with resampler_plots.py).
To capture the resampler output, two techniques were used.
For PulseAudio resamplers, we can create a null sink, play a wav file
with paplay and record the result with parecord through its monitor.
Unfortunately, parecord inserts some garbage at the beginning. For
resamplers built into third-party operating systems, a patched QEMU was
used. The patch deliberately cripples the emulated HD Audio card, so
that it accepts only 48 kHz, forcing the guest to resample. The
resampled output was captured using QEMU_AUDIO_DRV=wav. Some other
environment variables have to be set so that QEMU itself does not
resample and to reduce the chance of dropouts in the recording.
Patch:
--- qemu/hw/audio/hda-codec.c 2014-07-06 18:46:20.764429441 +0600
+++ qemu/hw/audio/hda-codec.c 2014-08-20 21:58:32.661701409 +0600
@@ -114,7 +114,7 @@
#define QEMU_HDA_ID_VENDOR 0x1af4
#define QEMU_HDA_PCM_FORMATS (AC_SUPPCM_BITS_16 | \
- 0x1fc /* 16 -> 96 kHz */)
+ 0x040 /* 48 kHz only */)
#define QEMU_HDA_AMP_NONE (0)
#define QEMU_HDA_AMP_STEPS 0x4a
The test signal is a TPDF-dithered 16-bit sine wave with a linearly
changing frequency. This way, we can know the frequency of the signal
given only a timestamp. The scripts can detect the frequency/time slope
automatically and extrapolate it into the area where the resampler
(rightfully or not) suppresses the signal.
So, for each portion of the resampled wave, we know the signal
frequency. Ideally, this frequency component should have the same
amplitude as input if it is below half of the new sample rate, and the
zero amplitude otherwise. Also, there should be no other frequency
components. So, the conclusion is quite obvious: treat the reproduced
part of that component as the signal, and all others (plus the missing
part of the main component) as a distortion.
Under that definition, the plots that say "Limited bandwidth counts as
distortion" below them were made. They display audibility of all
distortions, as defined above, as a function of the input sine wave
frequency, for a selection of resamplers. The sine wave is assumed to be
at the full amplitude, which corresponds (as it is a common convention
in psychoacoustical models) to 92 dB SPL. Note: do not listen at this
volume. It is harmful. But it is also the worst case for the
psychoacoustical model.
Also, audibility of the distortions inherent in a TPDF-dithered 16-bit
input is shown as "quantization noise" on the same plots. As you see,
16-bit input and TPDF dithering do not result in audible distortions.
Unfortunately, there is a bug on win81 plots, because Windows Media
Player by default attenuates the file by 6 dB, and my scripts compensate
for that, but also amplify the quantization noise. I am too lazy to fix
this today. Please shift the whole win81-wmp curve down by 6 dB, and
you'll hopefully get an approximately correct result.
As you can see, some resamplers allegedly create audible distortions for
high-frequency inputs. That's expected: to offer good attenuation of
unrepresentable frequencies (those above either old or new Nyquist
frequency), they need to somewhat attenuate representable ones. This
attenuation is counted as a distortion, and it indeed can be noticed if
one is offered a direct comparison of resamplers that put the cut-off
frequency in different places. All that is needed is a high-frequency
sine wave that is attenuated, although ideally it shouldn't be
attenuated. Obviously, nobody listens to such sine waves, so this is an
artifact of the method.
This artifact is somewhat ignorable for 44100 -> 48000 Hz conversion, as
it doesn't prevent one from creating a resampler that never introduces
audible distortions (example: speex-float-5). However, it is expected to
become a problem if one considers the VoIP use case, with lower sample
rates, and lower transition frequencies.
As an attempt to work around the problem, I have also plotted audibility
of the distortion vs input signal frequency without treating this
attenuation of the main tone as a distortion. Look for "Limited
bandwidth does not count as distortion" below the plot.
As you can see, under the old problematic definition, the following
resamplers are indistinguishable from a perfect one (i.e. audibility of
distortions never goes above 0 dB): speex-float-5, soxr-mq,
src-sinc-medium-quality, and their better variants from the
corresponding families.
Under the new definition of distortion, the following resamplers also
become perfect: soxr-lq, src-sinc-fastest, macosx, wine. And maybe
win81-wmp if I remeasure it.
It's quite sad that the current default in PulseAudio was influenced by
the needs of low-power embedded devices at the measurable expense of the
sound quality on the typical desktop. Now, with plots, figures and
knowledge in hand, we can fix it.
I'll leave other metrics, different sample rates, and evaluation of
distortions introduced into typical music and speech for my talk at the
audio mini conference.
P.S. The following resamplers are not on the plots:
src-zero-order-hold: exactly the same as trivial.
speex-float-4: very very similar to speex-float-3. Not perfect.
speex-float-2: worse than speex-float-1.
Please ignore them.
--
Alexander E. Patrakov
More information about the pulseaudio-discuss
mailing list