[Pixman] [PATCH 7/7] utils.c: Increase acceptable deviation to 0.0064 in pixel_checker_t

Mon Feb 18 16:13:15 PST 2013

On Tue, 12 Feb 2013 22:17:12 +0100
sandmann at cs.au.dk (Søren Sandmann) wrote:

> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
> 
> >> > Moreover, anyone using r5g6b5 format is most likely either memory or
> >> > performance constrained, so they would not particularly appreciate the
> >> > more accurate, but slower conversion between a8r8g8b8 and r5g6b5.
> >> 
> >> It's not an academic discussion btw. If we add dithering, the difference
> >> between shifting and rounding becomes very obvious. Here are two images,
> >> both containing a gradient rendered three different ways: once onto
> >> r5g6b5 without dithering, once onto a8r8g8b8 without dithering, and once
> >> with dithering onto r5g6b5.
> >> 
> >> In the first image, bitshifting is used:
> >> 
> >>     http://people.freedesktop.org/~sandmann/dither-shift.png
> >> 
> >> In the second, rounding is used:
> >> 
> >>     http://people.freedesktop.org/~sandmann/dither-round.png
> >> 
> >> In the first image, there is an obvious darkening in the dithered
> >> gradient. In the second, the difference is visible, but fairly
> >> subtle. Even the undithered gradient, while ugly in both cases, is
> >> rendered visibly more faithfully with rounding.
> >
> > Are you using http://en.wikipedia.org/wiki/Ordered_dithering ?
> > When adding a threshold map to pixels, the image gets a bit lighter.
> > Wouldn't dropping the lower bits actually compensate this?
> 
> The error in converting from [0,255] to [0,63] through bitshift is not a
> consistent darkening; it is a darkening of dark values and a lightening
> of light values. For example 0xf8 / 255.0 = 0.973, but 0xf8 gets
> converted to 62 which corresponds to 62/63.0 = 0.984.
> 
> Here is a graph of the error from bit shifting:
> 
>     http://people.freedesktop.org/~sandmann/shift-error.png
> 
> and here is the graph for rounding:
> 
>     http://people.freedesktop.org/~sandmann/round-error.png
> 
> If the intervals in question were [0,256] and [0,64], the correct
> conversion would be a division by 4, and so a truncating shift would
> produce a darkening. However, for the intervals [0,255] and [0,63] the
> right conversion is a division by 255/63.0 = 4.0476190476190474, so a
> division by 4 produces a slightly-too-large value which is then
> compensated for by the truncation, producing the error shown in the
> graph above.
> 
> > Basically, is the darkening really a problem specifically with
> > conversion and not with dithering?
> 
> The specific dither algorithm I used was a variant of ordered dither
> using the dither matrix from GdkRGB
> 
>    http://git.gnome.org/browse/gtk+/tree/gdk/gdkrgb.c?id=2.24.15#n968
> 
> which is a 128 x 128 table containing 256 copies of the values from 0 to
> 63 arranged in a blue-noise pattern.

If we care about the performance, using 128 x 128 table might be not the
best choice (due to the possible L1 cache misses when accessing this
table). Does using smaller tables result in significantly worse looking
images?

> In order to avoid biasing the values by the dithering itself, I
> subtracted 32 from the dither before shifting it into the lower bits, so
> that it would have a mean value of (close to) 0.
> 
> You are right that for this particular gradient, the combination of
> simply adding the dither without subtracting 32, followed by a bitshift
> produces a better result:
> 
>     http://people.freedesktop.org/~sandmann/dither-add-shift.png
> 
> But this is because this gradient is dark so the bitshift has a
> darkening effect. For a light gradient, adding followed by shift
> produces a lightening effect:
> 
>     http://people.freedesktop.org/~sandmann/dither-add-shift-light.png
> 
> where subtracting by 32 and rounding still produces the right colors:
> 
>     http://people.freedesktop.org/~sandmann/dither-round-light.png
> 
> All of the images were created like this:
> 
> 1. The undithered and dithered gradients were rendered onto a 565 image
>    with either shifting or rounding.
> 
> 2. The 565 image was SRCed to an 8888 surface with either replication or
>    rounding
> 
> 3. An undithered gradient was rendered onto the 8888 surface.
> 
> So the images also include the effect of rounding vs. bit replication
> for upconversion.
> 
> [ Aside about dithering: Theoretically, dithering should be done by
> adding noise uniformly distributed over [-q/2, q/2] where q is the
> quantization step. That is, the really right formula is this:
> 
>     s6 = floor (((s8 / 255.0) + (d/63.0 - 0.5) * (1/63.0)) * 63.0 + 0.5)
> 
> where the dither signal is scaled precisely rather than shifted.
> 
> An approximation of that formula is here:
> 
>    http://people.freedesktop.org/~sandmann/dither-perfect.png
> 
> (only an approximation because it converts to 8 bit before converting to
> 5/6 bits), which can be compared to the rounded version:
> 
>    http://people.freedesktop.org/~sandmann/dither-round.png
> 
> The 'perfect' variant is slightly too light for lighter colors, but
> matches better at the darker end. It may be that to get an exact match,
> a gamma adjustment should be applied to the dither signal. ]

OK, but I'm still not quite sure about how this more accurate
r5g6b5 <-> x8r8g8b8 conversion & dithering fits the 16bpp rendering
pipeline. If we have many intermediate color depth conversions
happening during processing, the final image quality is not going
to be very good anyway. It is easier to just switch to doing
every intermediate rendering using a8r8g8b8/x8r8g8b8 format and
perform dithered conversion of the final picture to r5g6b5 as
the final step (if the target image or framebuffer is using
16bpp color format).

But if all the intermediate rendering is done with 16bpp (for the memory
bandwidth saving reasons) then it may make sense to just apply a few
tweaks or hacks here and there, like the patch from Jeff Muizelaar:

    http://lists.freedesktop.org/archives/pixman/2012-July/002174.html

It would be clearly cutting some corners. But if the performance is
the primary goal no matter the quality reduction (where it does not
bother the users), then this might be a good thing to do. And also
as gradients are by far the worst offenders contributing to banding
artefacts, directly generating 16bpp dithered gradients and possibly
caching them looks like a useful tweak.

At the risk of stating the obvious again, a more correct but somewhat
slower pipeline might be not something that is wanted by the software
developers using pixman library for handling 16bpp graphics.

But if this more correct pipeline proposed by you can really
demonstrate acceptable performance on the hardware where r5g6b5
is still relevant, then I'm all for it.

> >> In the first image, bitshifting is used:
> >> 
> >>     http://people.freedesktop.org/~sandmann/dither-shift.png
> >> 
> >> In the second, rounding is used:
> >> 
> >>     http://people.freedesktop.org/~sandmann/dither-round.png
> >> 
> > Also we would prefer a lossless r5g6b5 -> r8g8b8 -> r5g6b5
> > conversion round-trip. Replicating the high bits and then
> > dropping them when converting back meets this requirement.
> > Doing correct rounding may be also fine, but this needs
> > to be confirmed.
> 
> Here is a python program that can verify this:
> 
>     def round_trip (n_bits):
>         m = (1 << n_bits) - 1.0;
>         for i in range (0, (1 << n_bits)):
>             v8 = int ((i / m) * 255.0 + 0.5)
>             vl = int ((v8 / 255.0) * m + 0.5)
>     
>             assert vl == i
> 
>     for j in range (1, 9):
>         round_trip (j)
> 
> There is also a straightforward argument that a low-bit value will be
> converted to the closest 8 bit value, which in turn will be converted
> back to the closest low-bit value, and that has to be the same as the
> original because the distance between low-bit values is bigger than
> between high-bit values.

Agreed, looks reasonable.

> >> The question I'm trying to answer is how much deviation should be
> >> considered acceptable. The answer is unlikely to be: "We got it
> >> precisely right back when the bitexact test suite was added",
> >> especially, as you pointed out, there are places where we could improve
> >> both performance and accuracy. That goes for r5g6b5 too btw. For
> >> over_8888_0565(), this:
> >> 
> >>        s + DIV_63 ((255 - a) * d)
> >> 
> >> would likely be both faster and more accurate than
> >> 
> >>        s + DIV_255 ((255 - a) * ((d << 4) | (d >> 2)))
> >
> > Yes, that's exactly this case and also over_n_8_0565() which are most
> > important. With the NEON code and excessive performance already
> > saturating memory bandwidth in many cases, it is easy to ignore this
> > optimization, but for ARMv6 it may be beneficial.
> 
> The rounding conversion from 8 bit to 6 bits can be done like this:
> 
>     (253 * g8 + 512) >> 10
> 
> which on NEON can be done with a multiplication and a rounding shift. In
> the worst case of src_8888_0565, which is a pure conversion, only three
> more instructions would be required, which I doubt would be enough to
> push it over the memory bandwidth limit. I think DSPr2 also has rounding
> shift instructions.
> 
> But the impact on ARMv6 may certainly be more severe. It would be
> interesting to try to quantify that impact.
> 
> > As for x86, I believe that r5g6b5 format is not in use anymore.
> 
> If phones or tablets with Atom chips start appearing, I suppose that
> might change.

If I understand it correctly, Android is gradually changing to
prefer 32bpp color depth:

    http://www.curious-creature.org/2010/12/04/gingerbread-and-32-bits-windows/
    http://www.curious-creature.org/2010/12/08/bitmap-quality-banding-and-dithering/

Atom chips do not seem to be suffering from the very limited memory
bandwidth of the early ARM chips. The modern ARM chips also have
significantly improved and Exynos5 in ARM Chromebook is particularly
good (12.8GB/s of theoretical memory bandwidth, and ~6GB/s of it really
available to the CPU). I expect that r5g6b5 does not have much time
left even in mobile devices and will eventually disappear for real.

> > Sure, that's good to have both new tests and the fixes for PDF
> > operators.
> >
> > Still bit-exact testing may have some more life left in it. I'll try
> > to explain why. Looks like nowadays ARM SoCs tend to have dedicated
> > hardware accelerators for 2D graphics. This includes Exynos4
> > (ODROID-U2), Exynos5 (ARM Chromebook), OMAP5 and also Allwinner A10
> > (Mele A1000 / MK802 / cubieboard). Not to try taking them into use
> > would be a waste.
> >
> > It is debatable where exactly this 2D hardware acceleration is better
> > to be plugged in the end (as a pixman backend, a thin wrapper around
> > pixman, cairo or just X11 DDX driver). However pixman test suite
> > is quite useful for the correctness validation. It just needs to be
> > extended to do random tests similar to blitters-test, but with
> > multiple images, randomly switching between 2D acceleration and CPU
> > rendering and also executing sets of random operations on random
> > triples of images (including mask). This extra test complexity is
> > needed to stress asynchronous completion of operations and cache
> > coherency. But if doing sets of multiple compositing operations,
> > then the precision expectations for each final pixel may be quite
> > difficult to set. If we expect bit-exact results, then everything
> > is simple. The only difficulty is that the results for the rendering
> > via 2D accelerator actually happen to differ from pixman. And
> > because the 2D accelerator hardware can't be really changed and is
> > not very configurable, it is pixman that can be adjusted to still
> > keep bit-exact results in such tests.
> 
> You mean having a CRC32 value for each type of hardware?

Having multiple CRC32 values is surely inconvenient (4-bit vs. 7-bit
vs. 8-bit bilinear interpolation dependent CRC32 values is an example
of how bad it is).

But for testing purposes it may be useful to have a special build of
pixman, which disables everything but the generic C implementation. If
this C implementation is tuned to do operations on pixels exactly in
the same way as the hardware accelerator in question, then it can be
used as a reference for bit-exact comparisons.

> Part of my motivation for doing tolerance based tests is that that would
> also be useful for validating the correctness of Render in the X server,
> but making sure that the output doesn't change unexpectedly is also
> useful.

With the tolerance based tests, there still exists a gray area where
we can't see any difference between a subtle bug and an acceptable
precision difference.

Even when checking the X server, we could in some cases identify/guess
the real algorithm used for certain compositing operations and also run
some bit-exact tests instead of just treating it as a black box and
relying on tolerance based tests only.

This is somewhat similar to how DoS attackers could be predicting hash
algorithms behaviour (knowing the fine details about the target
system may provide certain opportunities for good or for bad):

    http://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/

-- 
Best regards,
Siarhei Siamashka