[Pixman] [PATCH 7/7] utils.c: Increase acceptable deviation to 0.0064 in pixel_checker_t

Tue Feb 5 19:16:24 PST 2013

On Sat, 02 Feb 2013 21:23:04 +0100
sandmann at cs.au.dk (Søren Sandmann) wrote:

> Siarhei Siamashka <siarhei.siamashka at gmail.com> writes:
>
> >> pixel 0x03c0, the true floating point value of the resulting green
> >> channel is:
> >> 
> >>    0xc3 / 255.0 + (1.0 - 0x0f / 255.0) * (0x0f / 63.0) = 0.9887955
> >> 
> >> but when compositing 8 bit values, where the 6-bit green channel is
> >> converted to 8 bit through bit replication, the 8-bit result is:
> >> 
> >>    0xc3 + ((255 - 0x0f) * 0x3c + 127) / 255 = 251
> >> 
> >> which corresponds to a real value of 0.984314. The difference from the
> >> true value is 0.004482 which is bigger than the acceptable deviation
> >> of 0.004. So, if we were to compute all the CONJOINT/DISJOINT
> >> operators in floating point, or otherwise make them more accurate, the
> >> acceptable deviation could be set at 0.0045.
> >> 
> >> If we were doing the 6-bit conversion with rounding:
> >> 
> >>    (x / 63.0 * 255.0 + 0.5)
> >> 
> >> instead of bit replication, the deviation in this particular case
> >> would be only 0.0005, so we may want to consider this at some
> >> point.
> >
> > This has been also discussed here:
> >
> >     http://comments.gmane.org/gmane.comp.graphics.pixman/1891
> >
> > Though the bit replication when converting to 8-bit is not so bad.
> > Dropping lower bits when converting back introduces a bigger error.
> >
> > Anyway, if I remember correctly, the accuracy loss has been well known
> > since the time when bitexact testing was introduced. Other than using
> > less accurate but faster conversion approximations, currently there
> > is also an assumption that separate "fetch -> combine -> store" steps
> > must provide exactly the same results as the fast path functions doing
> > the same operations in one go. This restriction surely inhibits
> > performance and accuracy. Certain platforms (ARM11 and MIPS32) should
> > be able to improve performance a bit if we go away from bitexact
> > correctness testing and allow more freedom in implementations. So this
> > patchset indeed looks rather useful.
> >
> > However I think that we may need to come to an agreement on the primary
> > purpose of the 8-bit pipeline, especially now that we also have a
> > floating point pipeline. In my opinion, the 8-bit integer pipeline
> > should always favour performance over accuracy in the case of doubt.
> 
> I agree that the primary purpose of the 8-bit pipeline is
> performance. If performance didn't matter, we could just use floating
> point for everything. But clearly we can't allow arbitrary deviation
> from the exact computation, so the question has to be how much deviation
> is acceptable.
> 
> > Moreover, anyone using r5g6b5 format is most likely either memory or
> > performance constrained, so they would not particularly appreciate the
> > more accurate, but slower conversion between a8r8g8b8 and r5g6b5.
> 
> It's not an academic discussion btw. If we add dithering, the difference
> between shifting and rounding becomes very obvious. Here are two images,
> both containing a gradient rendered three different ways: once onto
> r5g6b5 without dithering, once onto a8r8g8b8 without dithering, and once
> with dithering onto r5g6b5.
> 
> In the first image, bitshifting is used:
> 
>     http://people.freedesktop.org/~sandmann/dither-shift.png
> 
> In the second, rounding is used:
> 
>     http://people.freedesktop.org/~sandmann/dither-round.png
> 
> In the first image, there is an obvious darkening in the dithered
> gradient. In the second, the difference is visible, but fairly
> subtle. Even the undithered gradient, while ugly in both cases, is
> rendered visibly more faithfully with rounding.

Are you using http://en.wikipedia.org/wiki/Ordered_dithering ?
When adding a threshold map to pixels, the image gets a bit lighter.
Wouldn't dropping the lower bits actually compensate this?

Basically, is the darkening really a problem specifically with
conversion and not with dithering?

Also we would prefer a lossless r5g6b5 -> r8g8b8 -> r5g6b5
conversion round-trip. Replicating the high bits and then
dropping them when converting back meets this requirement.
Doing correct rounding may be also fine, but this needs
to be confirmed.

> > There are also other libraries and alternative solutions out
> > there. The competition between different mobile browsers and UI
> > toolkits for the embedded systems seems to be heavily focused on
> > performance. Every little bit is relevant.
> 
> Well, we could start doing division by 255 in this way:
> 
>        (a * b + 0xff) >> 8

The code from pixelflinger seemed to use something similar:

    http://lists.freedesktop.org/archives/pixman/2012-September/002276.html

> The error is not that severe, and it would be a little bit faster than
> (t = a * b + 0x80, (t + (t >> 8)) >> 8.
> 
> If we had started out doing divisions in the above way, would we now be
> debating whether the additional shift instruction in the
> 
>        (t + (t >> 8)) >> 8)
> 
> formula would be worth the higher precision?

But again, if we are generally interested in lossless conversion
round-trip (this time for unmultiply/premultiply alpha), then the
current pixman code implies unmultiplication as:

    x = (255 * x_premultiplied + (a / 2)) / a

And this resulting "x" value can be quite conveniently losslessly
premultiplied again:

    x_premultiplied = (x * a + 127) / 255

If we have less precision in calculations, then some off-by-one errors
may show up and make lossless round-trip much more challenging with
direct computations. And table based conversion (involving large
tables, not just reciprocals) is going to be slower.

> The question I'm trying to answer is how much deviation should be
> considered acceptable. The answer is unlikely to be: "We got it
> precisely right back when the bitexact test suite was added",
> especially, as you pointed out, there are places where we could improve
> both performance and accuracy. That goes for r5g6b5 too btw. For
> over_8888_0565(), this:
> 
>        s + DIV_63 ((255 - a) * d)
> 
> would likely be both faster and more accurate than
> 
>        s + DIV_255 ((255 - a) * ((d << 4) | (d >> 2)))

Yes, that's exactly this case and also over_n_8_0565() which are most
important. With the NEON code and excessive performance already
saturating memory bandwidth in many cases, it is easy to ignore this
optimization, but for ARMv6 it may be beneficial.

As for x86, I believe that r5g6b5 format is not in use anymore.

> > And while we are talking about this, bilinear interpolation precision
> > is also somewhat related here (the choice of 7-bit vs. 4-bit) and
> > whether we can avoid doing correct rounding for it or not.
> 
> To use the tolerance based tests as implemented by do_composite(), I
> think both the reference and the test subject have to use the same
> subsampling precision.
> 
> (Dithered rounding could also be used here, btw.)
> 
> > On the other hand, the floating point pipeline is a good place to
> > implement sRGB, accurate format conversions and the other nice things.
> > In other words, it can favour accuracy over performance.
> 
> In my view, the floating point pipeline should eventually implement
> everything with high accuracy so that it can be used both as a reference
> for a tolerance based test suite, and as a fallback for operations that
> don't have fast paths. I have a start on that here:
> 
>     http://cgit.freedesktop.org/~sandmann/pixman/log/?h=float-imp
> 
> Trying to verify that that branch fixes the a2r10g10b10->a8r8g8b8
> precision loss is what prompted this patch set and some upcoming fixes
> for the PDF operators.

Sure, that's good to have both new tests and the fixes for PDF
operators.

Still bit-exact testing may have some more life left in it. I'll try
to explain why. Looks like nowadays ARM SoCs tend to have dedicated
hardware accelerators for 2D graphics. This includes Exynos4
(ODROID-U2), Exynos5 (ARM Chromebook), OMAP5 and also Allwinner A10
(Mele A1000 / MK802 / cubieboard). Not to try taking them into use
would be a waste.

It is debatable where exactly this 2D hardware acceleration is better
to be plugged in the end (as a pixman backend, a thin wrapper around
pixman, cairo or just X11 DDX driver). However pixman test suite
is quite useful for the correctness validation. It just needs to be
extended to do random tests similar to blitters-test, but with
multiple images, randomly switching between 2D acceleration and CPU
rendering and also executing sets of random operations on random
triples of images (including mask). This extra test complexity is
needed to stress asynchronous completion of operations and cache
coherency. But if doing sets of multiple compositing operations,
then the precision expectations for each final pixel may be quite
difficult to set. If we expect bit-exact results, then everything
is simple. The only difficulty is that the results for the rendering
via 2D accelerator actually happen to differ from pixman. And
because the 2D accelerator hardware can't be really changed and is
not very configurable, it is pixman that can be adjusted to still
keep bit-exact results in such tests.

-- 
Best regards,
Siarhei Siamashka