[Mesa-dev] Problems with accuracy of coeffs_init + attribs_update

Mon Nov 2 14:22:11 PST 2015

On Mon, Nov 2, 2015 at 7:08 PM, Roland Scheidegger <sroland at vmware.com> wrote:
> Am 02.11.2015 um 14:46 schrieb Oded Gabbay:
>> On Thu, Oct 29, 2015 at 9:37 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>> Am 29.10.2015 um 19:44 schrieb Oded Gabbay:
>>>> On Thu, Oct 29, 2015 at 5:31 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>>>> Am 29.10.2015 um 12:27 schrieb Oded Gabbay:
>>>>>> Hi Roland, Jose
>>>>>>
>>>>>> I wanted to bring a problem I found to your attention, and discuss
>>>>>> about it and ways to solve it.
>>>>>>
>>>>>> I'm working on regressions of piglit gpu.py between x86-64 and
>>>>>> ppc64le, when running with llvmpipe.
>>>>>>
>>>>>> One of the regressions manifests itself in 2 tests,
>>>>>> clip-distance-bulk-copy and clip-distance-itemized-copy
>>>>> There's actually a third one, interface-vs-unnamed-to-fs-unnamed, which
>>>>> fails the same (and it's easier to debug...).
>>>>>
>>>> Yes, you are correct of course. It was also fixed by my change.
>>>>
>>>>>
>>>>>>
>>>>>> What happens in ppc64le is that *some* of the interpolated per-vertex
>>>>>> values (clip-distance values) that constitute the input into the
>>>>>> fragment shader, are not calculated according to the required accuracy
>>>>>> of the test (which is an error/distance of less than 1e-6)
>>>>>>
>>>>>> It took a while, but I believe I found the reason. In general, when
>>>>>> running on ppc64le, the code path that is taken at
>>>>>> lp_build_interp_soa_init() is doing  bld->simple_interp = FALSE, which
>>>>>> means using coeffs_init() + attribs_update(). That's in contrast with
>>>>>> x86-64, where bld->simple_interp = TRUE, which means the code will use
>>>>>> coeffs_init_simple() + attribs_update_simple()
>>>>> Note this isn't really about x86-64 per se. If you don't have AVX on a
>>>>> x86-64 box, the other path will be taken too (and the test fail as well).
>>>>>
>>>> makes sense.
>>>>
>>>>>>
>>>>>> In specific, the problem lies in how the coeffs are calculated inside
>>>>>> coeffs_init. To simply put it, they are only *actually* calculated
>>>>>> correctly for the first quad. The coeffs are later *updated* for the
>>>>>> other quads, using dadx/dady and dadq.
>>>>>>
>>>>>> ------
>>>>>> side-note:
>>>>>> I believe you even documented it in coeffs_init():
>>>>>>       /*
>>>>>>        * a = a0 + (x * dadx + y * dady)
>>>>>>        * a0aos is the attrib value at top left corner of stamp
>>>>>>        */
>>>>>> ------
>>>>>>
>>>>>> The root cause for that, I believe, is because coeffs_init does *not*
>>>>>> take into account the different x/y offsets for the other quads. In
>>>>>> contrast, on x86-64, the code uses a function called calc_offset(),
>>>>>> which does take the different x/y offsets into account.
>>>>>>
>>>>>> Please look at this definition (from lp_bld_interp.c) :
>>>>>>
>>>>>> static const unsigned char quad_offset_x[16] = {0, 1, 0, 1, 2, 3, 2,
>>>>>> 3, 0, 1, 0, 1, 2, 3, 2, 3};
>>>>>> static const unsigned char quad_offset_y[16] = {0, 0, 1, 1, 0, 0, 1,
>>>>>> 1, 2, 2, 3, 3, 2, 2, 3, 3};
>>>>>>
>>>>>> Now, look at how coeffs_init() uses them:
>>>>>>
>>>>>>    for (i = 0; i < coeff_bld->type.length; i++) {
>>>>>>       LLVMValueRef nr = lp_build_const_int32(gallivm, i);
>>>>>>       LLVMValueRef pixxf = lp_build_const_float(gallivm, quad_offset_x[i]);
>>>>>>       LLVMValueRef pixyf = lp_build_const_float(gallivm, quad_offset_y[i]);
>>>>>>       pixoffx = LLVMBuildInsertElement(builder, pixoffx, pixxf, nr, "");
>>>>>>       pixoffy = LLVMBuildInsertElement(builder, pixoffy, pixyf, nr, "");
>>>>>>    }
>>>>>>
>>>>>> So, in ppc64le, where coeff_bld->type.length == 4, we *always* take
>>>>>> the first four values from the above arrays. The code *never*
>>>>>> calculates the coeffs based on the other offsets.
>>>>>>
>>>>>> Admittedly, that's *almost* always works and I doubt you will see a
>>>>>> difference in display. However, in the tests I mentioned above, it
>>>>>> creates, for some pixels, a bigger error than what is allowed.
>>>>> Well, I'm not sure what error actually is allowed... GL of course is
>>>>> vague (the test just seems to be built so the error allowed actually
>>>>> passes on real hw).
>>>>>
>>>>> I don't think there's an actual bug with the code. What it does in
>>>>> coeffs_init, is simply calculate the dadx + dady values from one pixel
>>>>> to the next (within a quad).
>>>>> Then, in attrib_update, it simply uses these values (which are constant
>>>>> for all quads, as the position of the pixels within a quad doesn't
>>>>> change) to calculate the actual dadq value based on the actual pixel
>>>>> offsets, and add that to the (also done in coeffs_init) starting value
>>>>> of each quad.
>>>>>
>>>>> So, the code is correct. However, it still gives different results
>>>>> compared to the other method, and the reason is just float math. This
>>>>> method does two interpolations (one for the start of the quad, the other
>>>>> per pixel), whereas the other method does just one. And this is why the
>>>>> results are different, because the results just aren't exact.
>>>>>
>>>> Yeah, I agree. I also wouldn't say it is a bug per-se, but it is a
>>>> method of calculation that produces less accurate results then the
>>>> other method (the simple path).
>>>>
>>>>>
>>>>>>
>>>>>> The first thing that came into my mind was to change, in
>>>>>> lp_build_interp_soa_init():
>>>>>>
>>>>>> "if (coeff_type.length > 4)"
>>>>>> to this:
>>>>>> "if (coeff_type.length >= 4)"
>>>>>>
>>>>>> When I did that, I saw it fixes the above tests and didn't cause any
>>>>>> regressions on ppc64le and/or on x86-64.
>>>>>>
>>>>>> However, I wanted to ask you guys if you remember why you allowed only
>>>>>> vectors with size above 4 to use the "simple" route ?
>>>>>> Was there any performance hit ?
>>>>> Some history, initially there actually only was the two-stage path (well
>>>>> actually it was slightly different as back then the pixel shader wasn't
>>>>> even executed in a loop but the idea was the same). The one-stage path
>>>>> was actually done when extending to 256bit vectors (or maybe when
>>>>> introducing the fs loop, I forgot...) because it was easier to code.
>>>>> Albeit both paths were fixed to work both with 128bit and 256bit
>>>>> vectors. And indeed, the reason why one is used over the other is just
>>>>> performance now (the two-stage approach has a higher initial cost but
>>>>> lower per-loop cost, so it's faster when the loop is run four times, but
>>>>> not when it's only run two times).
>>>>> I was actually just thinking about nuking the two-stage approach
>>>>> yesterday as it's quite annoying the results are different depending on
>>>>> the vector size. But IIRC the performance difference was quite
>>>>> measurable sometimes.
>>>>>
>>>> That's what I thought.
>>>> I'm willing to test the performance on H/W if you can point me to a
>>>> relevant test.
>>>> By running glxgears, I can see the FPS drops by about 5 FPS (from 605
>>>> to 600), when using the simple method.
>>>> Also, it seemed to me, but I didn't measure it, that running piglit
>>>> gpu.py was slower by some margin.
>>> glxgears is a bit _too_ simple...
>>> Something simple which uses "traditional" mul+add shader (2 interpolated
>>> attributes, one of them for texture coords) or doing something similarly
>>> simple might be a good test. gloss demo, or openarena - desktop
>>> compositors would of course be nice, but I don't think they come with
>>> benchmark mode...
>>>
>>> Roland
>>>
>>
>> Hi Roland,
>> I run a couple of tests on ppc64le machine and my haswell laptop with
>> LP_NATIVE_VECTOR_WIDTH=128:
>>
>> - glxspheres, on ppc64le
>>    - original code:        4.892745317 frames/sec 5.460303857 Mpixels/sec
>>    - with my change:   4.932083873 frames/sec 5.504205571 Mpixels/sec
>>    - difference is 0.8% *for* the change
>>
>> - glxspheres, on x86-64
>>    - original code:        20.16418809 frames/sec 22.50323395 Mpixels/sec
>>    - with my change:   20.31328989 frames/sec 22.66963152 Mpixels/sec
>>    - difference is 0.74% *for* the change
>>
>> - glmark2, on ppc64le:
>>   - original code:        score of 58
>>   - with my change:   score of 57
>>
>> - glmark2, on x86-64:
>>   - original code:        score of 175
>>   - with my change:   score of 167
>>   - difference of 4.5% *against* the change
>>
>> - phoronix gui-toolkits (don't know how relevant it is), only on ppc64le
>>   - very similar results, some tests better with original code, some are worse
>>
>> Unfortunately, couldn't manage to run OpenArena on my machines with
>> latest mesa and llvmpipe. Kept getting an error of: "Your system
>> currently is not capable of hardware accelerated 3D. Therefore
>> openarena cannot run."
> Sounds like it couldn't load the driver - IIRC it will refuse falling
> back to indirect rendering. Perhaps using the wrong binary (32bit vs.
> 64bit) without having both loadable 32bit and 64bit drivers?
>
So airlied helped me and I was able to run OpenArena (problem was with
/usr/share/opengl-games-utils/opengl-game-functions.sh)

Here are the results:
ppc64le:
- original code:       3398 frames 1719.0 seconds 2.0 fps
255.0/505.9/2773.0/0.0 ms
- with my change:  3398 frames 1690.4 seconds 2.0 fps 241.0/497.5/2563.0/0.2 ms
- 29 seconds faster with single stage, which is about 2% faster

x86-64:
- original code:      3398 frames 239.6 seconds 14.2 fps 38.0/70.5/719.0/14.6 ms
- with my change: 3398 frames 244.4 seconds 13.9 fps 38.0/71.9/697.0/14.3 ms
- 0.3 fps slower with single stage (about 2%)

Bottom line, nothing too exciting.

> But in any case, IIRC the 4.5% deficit you got with glmark2 is about
> roughly what I got when I did some measurements ages ago with some other
> benchmarks. Not a complete dealbreaker but still quite significant. OTOH
> we don't really care all that much about 4-wide vectors on x86-64
> anymore. Interesting though it doesn't really make a difference on ppc64
> (last I checked I believe there were actually significant differences in
> spilling registers overall with the different interpolation methods with
> x86-64 and stuff like that could change from backend to backend as well
> as different llvm versions)
>
>>
>> What do you think ?
>> Are those benchmarks comprehensive enough ?
>> It seems there is almost no impact on ppc64le, while on x86-64 there
>> is a minor impact.
>> If you have something specific I can run, I will be happy to do it.
>
>
> Sounds like it might be worth to just always enable the "simple" path,
> just to get rid of the different rendering results. Albeit at this point
> I'd defininitely like to keep the old code around.
>
> Roland

OK, so I'll send a patch.

Thanks,

   Oded
>
>>
>>         Oded
>>
>>>
>>>>
>>>>>>
>>>>>> I think that to fix coeffs_init()+attribs_update() is to basically
>>>>>> re-design them. Otherwise, I would have tried to fix them first.
>>>>>
>>>>> Well, even the one-stage approach is basically crap. The reason is that
>>>>> the start value is always based on the top left corner of the screen
>>>>> (coords 0,0 or actually 0.5,0.5). This means if you have some triangle
>>>>> far away from the screen origin, but with steep gradients, the errors
>>>>> will be huge (but there's no piglit tests which would hit this).
>>>>> Modern hw has abandoned this way of interpolation by using barycentric
>>>>> interpolation. But it's quite complex to do fast (the hw integrates this
>>>>> into rasterization, since the barycentric weights are the same as the
>>>>> edge functions).
>>>>> Albeit it would be possible to improve this particular problem without
>>>>> using barycentric interpolation (could for instance use one corner of
>>>>> the triangle as start value instead of 0,0 but adds complexity as well).
>>>>> But the coeffs_init/update functions as such are unfixable as it's just
>>>>> float math precision issues (well you could try using fma instead of
>>>>> mul+add, use double precision, ...) but the fundamental issue that it's
>>>>> just less precise doing two-step interpolation instead of one-step remains.
>>>>>
>>>>>
>>>>>>
>>>>>> In any case, I would love to hear your feedback and to discuss this further.
>>>>>>
>>>>>
>>>>> Do you see any specific problem outside the 3 failing piglit tests?
>>>>> Because as far as I can tell, the error allowed in the tests was really
>>>>> just chosen so it works on ordinary hw
>>>>> (interface-vs-unnamed-to-fs-unnamed actually has higher allowed error
>>>>> than the other two).
>>>>> So, I was reluctant to change it just because of these piglits, but this
>>>>> isn't set in stone...
>>>>>
>>>>> Roland
>>>>>
>>>> No, I couldn't see any other problem that relates to this, as I
>>>> understood the error was so small (the results of the interpolation
>>>> were off by about 5e-7 then x86-64).
>>>> I also run some OpenGL demo that uses gl_ClipDistance and I didn't see
>>>> any visual problem.
>>>>
>>>> However, I would hate to keep the situation as is, meaning the test
>>>> passes on x86-64 and fails on ppc64le.
>>>> If we agree that there is no actual bug in the code, then another
>>>> possibility is to change the failure criteria. Because we know the
>>>> maximum deviation between the two methods is 5e-7, we can add 5e-7 (or
>>>> 0.5e-6) to the criteria of the three above mentioned tests. That would
>>>> make them pass even with the two-stages approach.
>>>>
>>>> What do you think ?
>>>>
>>>>       Oded
>>>>
>>>
>