[Mesa-dev] Problems with accuracy of coeffs_init + attribs_update

Mon Nov 2 05:46:01 PST 2015

On Thu, Oct 29, 2015 at 9:37 PM, Roland Scheidegger <sroland at vmware.com> wrote:
> Am 29.10.2015 um 19:44 schrieb Oded Gabbay:
>> On Thu, Oct 29, 2015 at 5:31 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>> Am 29.10.2015 um 12:27 schrieb Oded Gabbay:
>>>> Hi Roland, Jose
>>>>
>>>> I wanted to bring a problem I found to your attention, and discuss
>>>> about it and ways to solve it.
>>>>
>>>> I'm working on regressions of piglit gpu.py between x86-64 and
>>>> ppc64le, when running with llvmpipe.
>>>>
>>>> One of the regressions manifests itself in 2 tests,
>>>> clip-distance-bulk-copy and clip-distance-itemized-copy
>>> There's actually a third one, interface-vs-unnamed-to-fs-unnamed, which
>>> fails the same (and it's easier to debug...).
>>>
>> Yes, you are correct of course. It was also fixed by my change.
>>
>>>
>>>>
>>>> What happens in ppc64le is that *some* of the interpolated per-vertex
>>>> values (clip-distance values) that constitute the input into the
>>>> fragment shader, are not calculated according to the required accuracy
>>>> of the test (which is an error/distance of less than 1e-6)
>>>>
>>>> It took a while, but I believe I found the reason. In general, when
>>>> running on ppc64le, the code path that is taken at
>>>> lp_build_interp_soa_init() is doing  bld->simple_interp = FALSE, which
>>>> means using coeffs_init() + attribs_update(). That's in contrast with
>>>> x86-64, where bld->simple_interp = TRUE, which means the code will use
>>>> coeffs_init_simple() + attribs_update_simple()
>>> Note this isn't really about x86-64 per se. If you don't have AVX on a
>>> x86-64 box, the other path will be taken too (and the test fail as well).
>>>
>> makes sense.
>>
>>>>
>>>> In specific, the problem lies in how the coeffs are calculated inside
>>>> coeffs_init. To simply put it, they are only *actually* calculated
>>>> correctly for the first quad. The coeffs are later *updated* for the
>>>> other quads, using dadx/dady and dadq.
>>>>
>>>> ------
>>>> side-note:
>>>> I believe you even documented it in coeffs_init():
>>>>       /*
>>>>        * a = a0 + (x * dadx + y * dady)
>>>>        * a0aos is the attrib value at top left corner of stamp
>>>>        */
>>>> ------
>>>>
>>>> The root cause for that, I believe, is because coeffs_init does *not*
>>>> take into account the different x/y offsets for the other quads. In
>>>> contrast, on x86-64, the code uses a function called calc_offset(),
>>>> which does take the different x/y offsets into account.
>>>>
>>>> Please look at this definition (from lp_bld_interp.c) :
>>>>
>>>> static const unsigned char quad_offset_x[16] = {0, 1, 0, 1, 2, 3, 2,
>>>> 3, 0, 1, 0, 1, 2, 3, 2, 3};
>>>> static const unsigned char quad_offset_y[16] = {0, 0, 1, 1, 0, 0, 1,
>>>> 1, 2, 2, 3, 3, 2, 2, 3, 3};
>>>>
>>>> Now, look at how coeffs_init() uses them:
>>>>
>>>>    for (i = 0; i < coeff_bld->type.length; i++) {
>>>>       LLVMValueRef nr = lp_build_const_int32(gallivm, i);
>>>>       LLVMValueRef pixxf = lp_build_const_float(gallivm, quad_offset_x[i]);
>>>>       LLVMValueRef pixyf = lp_build_const_float(gallivm, quad_offset_y[i]);
>>>>       pixoffx = LLVMBuildInsertElement(builder, pixoffx, pixxf, nr, "");
>>>>       pixoffy = LLVMBuildInsertElement(builder, pixoffy, pixyf, nr, "");
>>>>    }
>>>>
>>>> So, in ppc64le, where coeff_bld->type.length == 4, we *always* take
>>>> the first four values from the above arrays. The code *never*
>>>> calculates the coeffs based on the other offsets.
>>>>
>>>> Admittedly, that's *almost* always works and I doubt you will see a
>>>> difference in display. However, in the tests I mentioned above, it
>>>> creates, for some pixels, a bigger error than what is allowed.
>>> Well, I'm not sure what error actually is allowed... GL of course is
>>> vague (the test just seems to be built so the error allowed actually
>>> passes on real hw).
>>>
>>> I don't think there's an actual bug with the code. What it does in
>>> coeffs_init, is simply calculate the dadx + dady values from one pixel
>>> to the next (within a quad).
>>> Then, in attrib_update, it simply uses these values (which are constant
>>> for all quads, as the position of the pixels within a quad doesn't
>>> change) to calculate the actual dadq value based on the actual pixel
>>> offsets, and add that to the (also done in coeffs_init) starting value
>>> of each quad.
>>>
>>> So, the code is correct. However, it still gives different results
>>> compared to the other method, and the reason is just float math. This
>>> method does two interpolations (one for the start of the quad, the other
>>> per pixel), whereas the other method does just one. And this is why the
>>> results are different, because the results just aren't exact.
>>>
>> Yeah, I agree. I also wouldn't say it is a bug per-se, but it is a
>> method of calculation that produces less accurate results then the
>> other method (the simple path).
>>
>>>
>>>>
>>>> The first thing that came into my mind was to change, in
>>>> lp_build_interp_soa_init():
>>>>
>>>> "if (coeff_type.length > 4)"
>>>> to this:
>>>> "if (coeff_type.length >= 4)"
>>>>
>>>> When I did that, I saw it fixes the above tests and didn't cause any
>>>> regressions on ppc64le and/or on x86-64.
>>>>
>>>> However, I wanted to ask you guys if you remember why you allowed only
>>>> vectors with size above 4 to use the "simple" route ?
>>>> Was there any performance hit ?
>>> Some history, initially there actually only was the two-stage path (well
>>> actually it was slightly different as back then the pixel shader wasn't
>>> even executed in a loop but the idea was the same). The one-stage path
>>> was actually done when extending to 256bit vectors (or maybe when
>>> introducing the fs loop, I forgot...) because it was easier to code.
>>> Albeit both paths were fixed to work both with 128bit and 256bit
>>> vectors. And indeed, the reason why one is used over the other is just
>>> performance now (the two-stage approach has a higher initial cost but
>>> lower per-loop cost, so it's faster when the loop is run four times, but
>>> not when it's only run two times).
>>> I was actually just thinking about nuking the two-stage approach
>>> yesterday as it's quite annoying the results are different depending on
>>> the vector size. But IIRC the performance difference was quite
>>> measurable sometimes.
>>>
>> That's what I thought.
>> I'm willing to test the performance on H/W if you can point me to a
>> relevant test.
>> By running glxgears, I can see the FPS drops by about 5 FPS (from 605
>> to 600), when using the simple method.
>> Also, it seemed to me, but I didn't measure it, that running piglit
>> gpu.py was slower by some margin.
> glxgears is a bit _too_ simple...
> Something simple which uses "traditional" mul+add shader (2 interpolated
> attributes, one of them for texture coords) or doing something similarly
> simple might be a good test. gloss demo, or openarena - desktop
> compositors would of course be nice, but I don't think they come with
> benchmark mode...
>
> Roland
>

Hi Roland,
I run a couple of tests on ppc64le machine and my haswell laptop with
LP_NATIVE_VECTOR_WIDTH=128:

- glxspheres, on ppc64le
   - original code:        4.892745317 frames/sec 5.460303857 Mpixels/sec
   - with my change:   4.932083873 frames/sec 5.504205571 Mpixels/sec
   - difference is 0.8% *for* the change

- glxspheres, on x86-64
   - original code:        20.16418809 frames/sec 22.50323395 Mpixels/sec
   - with my change:   20.31328989 frames/sec 22.66963152 Mpixels/sec
   - difference is 0.74% *for* the change

- glmark2, on ppc64le:
  - original code:        score of 58
  - with my change:   score of 57

- glmark2, on x86-64:
  - original code:        score of 175
  - with my change:   score of 167
  - difference of 4.5% *against* the change

- phoronix gui-toolkits (don't know how relevant it is), only on ppc64le
  - very similar results, some tests better with original code, some are worse

Unfortunately, couldn't manage to run OpenArena on my machines with
latest mesa and llvmpipe. Kept getting an error of: "Your system
currently is not capable of hardware accelerated 3D. Therefore
openarena cannot run."

What do you think ?
Are those benchmarks comprehensive enough ?
It seems there is almost no impact on ppc64le, while on x86-64 there
is a minor impact.
If you have something specific I can run, I will be happy to do it.

        Oded

>
>>
>>>>
>>>> I think that to fix coeffs_init()+attribs_update() is to basically
>>>> re-design them. Otherwise, I would have tried to fix them first.
>>>
>>> Well, even the one-stage approach is basically crap. The reason is that
>>> the start value is always based on the top left corner of the screen
>>> (coords 0,0 or actually 0.5,0.5). This means if you have some triangle
>>> far away from the screen origin, but with steep gradients, the errors
>>> will be huge (but there's no piglit tests which would hit this).
>>> Modern hw has abandoned this way of interpolation by using barycentric
>>> interpolation. But it's quite complex to do fast (the hw integrates this
>>> into rasterization, since the barycentric weights are the same as the
>>> edge functions).
>>> Albeit it would be possible to improve this particular problem without
>>> using barycentric interpolation (could for instance use one corner of
>>> the triangle as start value instead of 0,0 but adds complexity as well).
>>> But the coeffs_init/update functions as such are unfixable as it's just
>>> float math precision issues (well you could try using fma instead of
>>> mul+add, use double precision, ...) but the fundamental issue that it's
>>> just less precise doing two-step interpolation instead of one-step remains.
>>>
>>>
>>>>
>>>> In any case, I would love to hear your feedback and to discuss this further.
>>>>
>>>
>>> Do you see any specific problem outside the 3 failing piglit tests?
>>> Because as far as I can tell, the error allowed in the tests was really
>>> just chosen so it works on ordinary hw
>>> (interface-vs-unnamed-to-fs-unnamed actually has higher allowed error
>>> than the other two).
>>> So, I was reluctant to change it just because of these piglits, but this
>>> isn't set in stone...
>>>
>>> Roland
>>>
>> No, I couldn't see any other problem that relates to this, as I
>> understood the error was so small (the results of the interpolation
>> were off by about 5e-7 then x86-64).
>> I also run some OpenGL demo that uses gl_ClipDistance and I didn't see
>> any visual problem.
>>
>> However, I would hate to keep the situation as is, meaning the test
>> passes on x86-64 and fails on ppc64le.
>> If we agree that there is no actual bug in the code, then another
>> possibility is to change the failure criteria. Because we know the
>> maximum deviation between the two methods is 5e-7, we can add 5e-7 (or
>> 0.5e-6) to the criteria of the three above mentioned tests. That would
>> make them pass even with the two-stages approach.
>>
>> What do you think ?
>>
>>       Oded
>>
>