[Mesa-dev] Problems with accuracy of coeffs_init + attribs_update

Thu Oct 29 08:31:09 PDT 2015

Am 29.10.2015 um 12:27 schrieb Oded Gabbay:
> Hi Roland, Jose
> 
> I wanted to bring a problem I found to your attention, and discuss
> about it and ways to solve it.
> 
> I'm working on regressions of piglit gpu.py between x86-64 and
> ppc64le, when running with llvmpipe.
> 
> One of the regressions manifests itself in 2 tests,
> clip-distance-bulk-copy and clip-distance-itemized-copy
There's actually a third one, interface-vs-unnamed-to-fs-unnamed, which
fails the same (and it's easier to debug...).

> 
> What happens in ppc64le is that *some* of the interpolated per-vertex
> values (clip-distance values) that constitute the input into the
> fragment shader, are not calculated according to the required accuracy
> of the test (which is an error/distance of less than 1e-6)
> 
> It took a while, but I believe I found the reason. In general, when
> running on ppc64le, the code path that is taken at
> lp_build_interp_soa_init() is doing  bld->simple_interp = FALSE, which
> means using coeffs_init() + attribs_update(). That's in contrast with
> x86-64, where bld->simple_interp = TRUE, which means the code will use
> coeffs_init_simple() + attribs_update_simple()
Note this isn't really about x86-64 per se. If you don't have AVX on a
x86-64 box, the other path will be taken too (and the test fail as well).

> 
> In specific, the problem lies in how the coeffs are calculated inside
> coeffs_init. To simply put it, they are only *actually* calculated
> correctly for the first quad. The coeffs are later *updated* for the
> other quads, using dadx/dady and dadq.
> 
> ------
> side-note:
> I believe you even documented it in coeffs_init():
>       /*
>        * a = a0 + (x * dadx + y * dady)
>        * a0aos is the attrib value at top left corner of stamp
>        */
> ------
> 
> The root cause for that, I believe, is because coeffs_init does *not*
> take into account the different x/y offsets for the other quads. In
> contrast, on x86-64, the code uses a function called calc_offset(),
> which does take the different x/y offsets into account.
> 
> Please look at this definition (from lp_bld_interp.c) :
> 
> static const unsigned char quad_offset_x[16] = {0, 1, 0, 1, 2, 3, 2,
> 3, 0, 1, 0, 1, 2, 3, 2, 3};
> static const unsigned char quad_offset_y[16] = {0, 0, 1, 1, 0, 0, 1,
> 1, 2, 2, 3, 3, 2, 2, 3, 3};
> 
> Now, look at how coeffs_init() uses them:
> 
>    for (i = 0; i < coeff_bld->type.length; i++) {
>       LLVMValueRef nr = lp_build_const_int32(gallivm, i);
>       LLVMValueRef pixxf = lp_build_const_float(gallivm, quad_offset_x[i]);
>       LLVMValueRef pixyf = lp_build_const_float(gallivm, quad_offset_y[i]);
>       pixoffx = LLVMBuildInsertElement(builder, pixoffx, pixxf, nr, "");
>       pixoffy = LLVMBuildInsertElement(builder, pixoffy, pixyf, nr, "");
>    }
> 
> So, in ppc64le, where coeff_bld->type.length == 4, we *always* take
> the first four values from the above arrays. The code *never*
> calculates the coeffs based on the other offsets.
> 
> Admittedly, that's *almost* always works and I doubt you will see a
> difference in display. However, in the tests I mentioned above, it
> creates, for some pixels, a bigger error than what is allowed.
Well, I'm not sure what error actually is allowed... GL of course is
vague (the test just seems to be built so the error allowed actually
passes on real hw).

I don't think there's an actual bug with the code. What it does in
coeffs_init, is simply calculate the dadx + dady values from one pixel
to the next (within a quad).
Then, in attrib_update, it simply uses these values (which are constant
for all quads, as the position of the pixels within a quad doesn't
change) to calculate the actual dadq value based on the actual pixel
offsets, and add that to the (also done in coeffs_init) starting value
of each quad.

So, the code is correct. However, it still gives different results
compared to the other method, and the reason is just float math. This
method does two interpolations (one for the start of the quad, the other
per pixel), whereas the other method does just one. And this is why the
results are different, because the results just aren't exact.

> 
> The first thing that came into my mind was to change, in
> lp_build_interp_soa_init():
> 
> "if (coeff_type.length > 4)"
> to this:
> "if (coeff_type.length >= 4)"
> 
> When I did that, I saw it fixes the above tests and didn't cause any
> regressions on ppc64le and/or on x86-64.
> 
> However, I wanted to ask you guys if you remember why you allowed only
> vectors with size above 4 to use the "simple" route ?
> Was there any performance hit ?
Some history, initially there actually only was the two-stage path (well
actually it was slightly different as back then the pixel shader wasn't
even executed in a loop but the idea was the same). The one-stage path
was actually done when extending to 256bit vectors (or maybe when
introducing the fs loop, I forgot...) because it was easier to code.
Albeit both paths were fixed to work both with 128bit and 256bit
vectors. And indeed, the reason why one is used over the other is just
performance now (the two-stage approach has a higher initial cost but
lower per-loop cost, so it's faster when the loop is run four times, but
not when it's only run two times).
I was actually just thinking about nuking the two-stage approach
yesterday as it's quite annoying the results are different depending on
the vector size. But IIRC the performance difference was quite
measurable sometimes.

> 
> I think that to fix coeffs_init()+attribs_update() is to basically
> re-design them. Otherwise, I would have tried to fix them first.

Well, even the one-stage approach is basically crap. The reason is that
the start value is always based on the top left corner of the screen
(coords 0,0 or actually 0.5,0.5). This means if you have some triangle
far away from the screen origin, but with steep gradients, the errors
will be huge (but there's no piglit tests which would hit this).
Modern hw has abandoned this way of interpolation by using barycentric
interpolation. But it's quite complex to do fast (the hw integrates this
into rasterization, since the barycentric weights are the same as the
edge functions).
Albeit it would be possible to improve this particular problem without
using barycentric interpolation (could for instance use one corner of
the triangle as start value instead of 0,0 but adds complexity as well).
But the coeffs_init/update functions as such are unfixable as it's just
float math precision issues (well you could try using fma instead of
mul+add, use double precision, ...) but the fundamental issue that it's
just less precise doing two-step interpolation instead of one-step remains.

> 
> In any case, I would love to hear your feedback and to discuss this further.
> 

Do you see any specific problem outside the 3 failing piglit tests?
Because as far as I can tell, the error allowed in the tests was really
just chosen so it works on ordinary hw
(interface-vs-unnamed-to-fs-unnamed actually has higher allowed error
than the other two).
So, I was reluctant to change it just because of these piglits, but this
isn't set in stone...

Roland