[Mesa-dev] Problems with accuracy of coeffs_init + attribs_update

Thu Oct 29 11:44:28 PDT 2015

On Thu, Oct 29, 2015 at 5:31 PM, Roland Scheidegger <sroland at vmware.com> wrote:
> Am 29.10.2015 um 12:27 schrieb Oded Gabbay:
>> Hi Roland, Jose
>>
>> I wanted to bring a problem I found to your attention, and discuss
>> about it and ways to solve it.
>>
>> I'm working on regressions of piglit gpu.py between x86-64 and
>> ppc64le, when running with llvmpipe.
>>
>> One of the regressions manifests itself in 2 tests,
>> clip-distance-bulk-copy and clip-distance-itemized-copy
> There's actually a third one, interface-vs-unnamed-to-fs-unnamed, which
> fails the same (and it's easier to debug...).
>
Yes, you are correct of course. It was also fixed by my change.

>
>>
>> What happens in ppc64le is that *some* of the interpolated per-vertex
>> values (clip-distance values) that constitute the input into the
>> fragment shader, are not calculated according to the required accuracy
>> of the test (which is an error/distance of less than 1e-6)
>>
>> It took a while, but I believe I found the reason. In general, when
>> running on ppc64le, the code path that is taken at
>> lp_build_interp_soa_init() is doing  bld->simple_interp = FALSE, which
>> means using coeffs_init() + attribs_update(). That's in contrast with
>> x86-64, where bld->simple_interp = TRUE, which means the code will use
>> coeffs_init_simple() + attribs_update_simple()
> Note this isn't really about x86-64 per se. If you don't have AVX on a
> x86-64 box, the other path will be taken too (and the test fail as well).
>
makes sense.

>>
>> In specific, the problem lies in how the coeffs are calculated inside
>> coeffs_init. To simply put it, they are only *actually* calculated
>> correctly for the first quad. The coeffs are later *updated* for the
>> other quads, using dadx/dady and dadq.
>>
>> ------
>> side-note:
>> I believe you even documented it in coeffs_init():
>>       /*
>>        * a = a0 + (x * dadx + y * dady)
>>        * a0aos is the attrib value at top left corner of stamp
>>        */
>> ------
>>
>> The root cause for that, I believe, is because coeffs_init does *not*
>> take into account the different x/y offsets for the other quads. In
>> contrast, on x86-64, the code uses a function called calc_offset(),
>> which does take the different x/y offsets into account.
>>
>> Please look at this definition (from lp_bld_interp.c) :
>>
>> static const unsigned char quad_offset_x[16] = {0, 1, 0, 1, 2, 3, 2,
>> 3, 0, 1, 0, 1, 2, 3, 2, 3};
>> static const unsigned char quad_offset_y[16] = {0, 0, 1, 1, 0, 0, 1,
>> 1, 2, 2, 3, 3, 2, 2, 3, 3};
>>
>> Now, look at how coeffs_init() uses them:
>>
>>    for (i = 0; i < coeff_bld->type.length; i++) {
>>       LLVMValueRef nr = lp_build_const_int32(gallivm, i);
>>       LLVMValueRef pixxf = lp_build_const_float(gallivm, quad_offset_x[i]);
>>       LLVMValueRef pixyf = lp_build_const_float(gallivm, quad_offset_y[i]);
>>       pixoffx = LLVMBuildInsertElement(builder, pixoffx, pixxf, nr, "");
>>       pixoffy = LLVMBuildInsertElement(builder, pixoffy, pixyf, nr, "");
>>    }
>>
>> So, in ppc64le, where coeff_bld->type.length == 4, we *always* take
>> the first four values from the above arrays. The code *never*
>> calculates the coeffs based on the other offsets.
>>
>> Admittedly, that's *almost* always works and I doubt you will see a
>> difference in display. However, in the tests I mentioned above, it
>> creates, for some pixels, a bigger error than what is allowed.
> Well, I'm not sure what error actually is allowed... GL of course is
> vague (the test just seems to be built so the error allowed actually
> passes on real hw).
>
> I don't think there's an actual bug with the code. What it does in
> coeffs_init, is simply calculate the dadx + dady values from one pixel
> to the next (within a quad).
> Then, in attrib_update, it simply uses these values (which are constant
> for all quads, as the position of the pixels within a quad doesn't
> change) to calculate the actual dadq value based on the actual pixel
> offsets, and add that to the (also done in coeffs_init) starting value
> of each quad.
>
> So, the code is correct. However, it still gives different results
> compared to the other method, and the reason is just float math. This
> method does two interpolations (one for the start of the quad, the other
> per pixel), whereas the other method does just one. And this is why the
> results are different, because the results just aren't exact.
>
Yeah, I agree. I also wouldn't say it is a bug per-se, but it is a
method of calculation that produces less accurate results then the
other method (the simple path).

>
>>
>> The first thing that came into my mind was to change, in
>> lp_build_interp_soa_init():
>>
>> "if (coeff_type.length > 4)"
>> to this:
>> "if (coeff_type.length >= 4)"
>>
>> When I did that, I saw it fixes the above tests and didn't cause any
>> regressions on ppc64le and/or on x86-64.
>>
>> However, I wanted to ask you guys if you remember why you allowed only
>> vectors with size above 4 to use the "simple" route ?
>> Was there any performance hit ?
> Some history, initially there actually only was the two-stage path (well
> actually it was slightly different as back then the pixel shader wasn't
> even executed in a loop but the idea was the same). The one-stage path
> was actually done when extending to 256bit vectors (or maybe when
> introducing the fs loop, I forgot...) because it was easier to code.
> Albeit both paths were fixed to work both with 128bit and 256bit
> vectors. And indeed, the reason why one is used over the other is just
> performance now (the two-stage approach has a higher initial cost but
> lower per-loop cost, so it's faster when the loop is run four times, but
> not when it's only run two times).
> I was actually just thinking about nuking the two-stage approach
> yesterday as it's quite annoying the results are different depending on
> the vector size. But IIRC the performance difference was quite
> measurable sometimes.
>
That's what I thought.
I'm willing to test the performance on H/W if you can point me to a
relevant test.
By running glxgears, I can see the FPS drops by about 5 FPS (from 605
to 600), when using the simple method.
Also, it seemed to me, but I didn't measure it, that running piglit
gpu.py was slower by some margin.

>>
>> I think that to fix coeffs_init()+attribs_update() is to basically
>> re-design them. Otherwise, I would have tried to fix them first.
>
> Well, even the one-stage approach is basically crap. The reason is that
> the start value is always based on the top left corner of the screen
> (coords 0,0 or actually 0.5,0.5). This means if you have some triangle
> far away from the screen origin, but with steep gradients, the errors
> will be huge (but there's no piglit tests which would hit this).
> Modern hw has abandoned this way of interpolation by using barycentric
> interpolation. But it's quite complex to do fast (the hw integrates this
> into rasterization, since the barycentric weights are the same as the
> edge functions).
> Albeit it would be possible to improve this particular problem without
> using barycentric interpolation (could for instance use one corner of
> the triangle as start value instead of 0,0 but adds complexity as well).
> But the coeffs_init/update functions as such are unfixable as it's just
> float math precision issues (well you could try using fma instead of
> mul+add, use double precision, ...) but the fundamental issue that it's
> just less precise doing two-step interpolation instead of one-step remains.
>
>
>>
>> In any case, I would love to hear your feedback and to discuss this further.
>>
>
> Do you see any specific problem outside the 3 failing piglit tests?
> Because as far as I can tell, the error allowed in the tests was really
> just chosen so it works on ordinary hw
> (interface-vs-unnamed-to-fs-unnamed actually has higher allowed error
> than the other two).
> So, I was reluctant to change it just because of these piglits, but this
> isn't set in stone...
>
> Roland
>
No, I couldn't see any other problem that relates to this, as I
understood the error was so small (the results of the interpolation
were off by about 5e-7 then x86-64).
I also run some OpenGL demo that uses gl_ClipDistance and I didn't see
any visual problem.

However, I would hate to keep the situation as is, meaning the test
passes on x86-64 and fails on ppc64le.
If we agree that there is no actual bug in the code, then another
possibility is to change the failure criteria. Because we know the
maximum deviation between the two methods is 5e-7, we can add 5e-7 (or
0.5e-6) to the criteria of the three above mentioned tests. That would
make them pass even with the two-stages approach.

What do you think ?

      Oded