[Mesa-dev] Problems with accuracy of coeffs_init + attribs_update

Thu Oct 29 12:37:25 PDT 2015

Am 29.10.2015 um 19:44 schrieb Oded Gabbay:
> On Thu, Oct 29, 2015 at 5:31 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>> Am 29.10.2015 um 12:27 schrieb Oded Gabbay:
>>> Hi Roland, Jose
>>>
>>> I wanted to bring a problem I found to your attention, and discuss
>>> about it and ways to solve it.
>>>
>>> I'm working on regressions of piglit gpu.py between x86-64 and
>>> ppc64le, when running with llvmpipe.
>>>
>>> One of the regressions manifests itself in 2 tests,
>>> clip-distance-bulk-copy and clip-distance-itemized-copy
>> There's actually a third one, interface-vs-unnamed-to-fs-unnamed, which
>> fails the same (and it's easier to debug...).
>>
> Yes, you are correct of course. It was also fixed by my change.
> 
>>
>>>
>>> What happens in ppc64le is that *some* of the interpolated per-vertex
>>> values (clip-distance values) that constitute the input into the
>>> fragment shader, are not calculated according to the required accuracy
>>> of the test (which is an error/distance of less than 1e-6)
>>>
>>> It took a while, but I believe I found the reason. In general, when
>>> running on ppc64le, the code path that is taken at
>>> lp_build_interp_soa_init() is doing  bld->simple_interp = FALSE, which
>>> means using coeffs_init() + attribs_update(). That's in contrast with
>>> x86-64, where bld->simple_interp = TRUE, which means the code will use
>>> coeffs_init_simple() + attribs_update_simple()
>> Note this isn't really about x86-64 per se. If you don't have AVX on a
>> x86-64 box, the other path will be taken too (and the test fail as well).
>>
> makes sense.
> 
>>>
>>> In specific, the problem lies in how the coeffs are calculated inside
>>> coeffs_init. To simply put it, they are only *actually* calculated
>>> correctly for the first quad. The coeffs are later *updated* for the
>>> other quads, using dadx/dady and dadq.
>>>
>>> ------
>>> side-note:
>>> I believe you even documented it in coeffs_init():
>>>       /*
>>>        * a = a0 + (x * dadx + y * dady)
>>>        * a0aos is the attrib value at top left corner of stamp
>>>        */
>>> ------
>>>
>>> The root cause for that, I believe, is because coeffs_init does *not*
>>> take into account the different x/y offsets for the other quads. In
>>> contrast, on x86-64, the code uses a function called calc_offset(),
>>> which does take the different x/y offsets into account.
>>>
>>> Please look at this definition (from lp_bld_interp.c) :
>>>
>>> static const unsigned char quad_offset_x[16] = {0, 1, 0, 1, 2, 3, 2,
>>> 3, 0, 1, 0, 1, 2, 3, 2, 3};
>>> static const unsigned char quad_offset_y[16] = {0, 0, 1, 1, 0, 0, 1,
>>> 1, 2, 2, 3, 3, 2, 2, 3, 3};
>>>
>>> Now, look at how coeffs_init() uses them:
>>>
>>>    for (i = 0; i < coeff_bld->type.length; i++) {
>>>       LLVMValueRef nr = lp_build_const_int32(gallivm, i);
>>>       LLVMValueRef pixxf = lp_build_const_float(gallivm, quad_offset_x[i]);
>>>       LLVMValueRef pixyf = lp_build_const_float(gallivm, quad_offset_y[i]);
>>>       pixoffx = LLVMBuildInsertElement(builder, pixoffx, pixxf, nr, "");
>>>       pixoffy = LLVMBuildInsertElement(builder, pixoffy, pixyf, nr, "");
>>>    }
>>>
>>> So, in ppc64le, where coeff_bld->type.length == 4, we *always* take
>>> the first four values from the above arrays. The code *never*
>>> calculates the coeffs based on the other offsets.
>>>
>>> Admittedly, that's *almost* always works and I doubt you will see a
>>> difference in display. However, in the tests I mentioned above, it
>>> creates, for some pixels, a bigger error than what is allowed.
>> Well, I'm not sure what error actually is allowed... GL of course is
>> vague (the test just seems to be built so the error allowed actually
>> passes on real hw).
>>
>> I don't think there's an actual bug with the code. What it does in
>> coeffs_init, is simply calculate the dadx + dady values from one pixel
>> to the next (within a quad).
>> Then, in attrib_update, it simply uses these values (which are constant
>> for all quads, as the position of the pixels within a quad doesn't
>> change) to calculate the actual dadq value based on the actual pixel
>> offsets, and add that to the (also done in coeffs_init) starting value
>> of each quad.
>>
>> So, the code is correct. However, it still gives different results
>> compared to the other method, and the reason is just float math. This
>> method does two interpolations (one for the start of the quad, the other
>> per pixel), whereas the other method does just one. And this is why the
>> results are different, because the results just aren't exact.
>>
> Yeah, I agree. I also wouldn't say it is a bug per-se, but it is a
> method of calculation that produces less accurate results then the
> other method (the simple path).
> 
>>
>>>
>>> The first thing that came into my mind was to change, in
>>> lp_build_interp_soa_init():
>>>
>>> "if (coeff_type.length > 4)"
>>> to this:
>>> "if (coeff_type.length >= 4)"
>>>
>>> When I did that, I saw it fixes the above tests and didn't cause any
>>> regressions on ppc64le and/or on x86-64.
>>>
>>> However, I wanted to ask you guys if you remember why you allowed only
>>> vectors with size above 4 to use the "simple" route ?
>>> Was there any performance hit ?
>> Some history, initially there actually only was the two-stage path (well
>> actually it was slightly different as back then the pixel shader wasn't
>> even executed in a loop but the idea was the same). The one-stage path
>> was actually done when extending to 256bit vectors (or maybe when
>> introducing the fs loop, I forgot...) because it was easier to code.
>> Albeit both paths were fixed to work both with 128bit and 256bit
>> vectors. And indeed, the reason why one is used over the other is just
>> performance now (the two-stage approach has a higher initial cost but
>> lower per-loop cost, so it's faster when the loop is run four times, but
>> not when it's only run two times).
>> I was actually just thinking about nuking the two-stage approach
>> yesterday as it's quite annoying the results are different depending on
>> the vector size. But IIRC the performance difference was quite
>> measurable sometimes.
>>
> That's what I thought.
> I'm willing to test the performance on H/W if you can point me to a
> relevant test.
> By running glxgears, I can see the FPS drops by about 5 FPS (from 605
> to 600), when using the simple method.
> Also, it seemed to me, but I didn't measure it, that running piglit
> gpu.py was slower by some margin.
glxgears is a bit _too_ simple...
Something simple which uses "traditional" mul+add shader (2 interpolated
attributes, one of them for texture coords) or doing something similarly
simple might be a good test. gloss demo, or openarena - desktop
compositors would of course be nice, but I don't think they come with
benchmark mode...

Roland


> 
>>>
>>> I think that to fix coeffs_init()+attribs_update() is to basically
>>> re-design them. Otherwise, I would have tried to fix them first.
>>
>> Well, even the one-stage approach is basically crap. The reason is that
>> the start value is always based on the top left corner of the screen
>> (coords 0,0 or actually 0.5,0.5). This means if you have some triangle
>> far away from the screen origin, but with steep gradients, the errors
>> will be huge (but there's no piglit tests which would hit this).
>> Modern hw has abandoned this way of interpolation by using barycentric
>> interpolation. But it's quite complex to do fast (the hw integrates this
>> into rasterization, since the barycentric weights are the same as the
>> edge functions).
>> Albeit it would be possible to improve this particular problem without
>> using barycentric interpolation (could for instance use one corner of
>> the triangle as start value instead of 0,0 but adds complexity as well).
>> But the coeffs_init/update functions as such are unfixable as it's just
>> float math precision issues (well you could try using fma instead of
>> mul+add, use double precision, ...) but the fundamental issue that it's
>> just less precise doing two-step interpolation instead of one-step remains.
>>
>>
>>>
>>> In any case, I would love to hear your feedback and to discuss this further.
>>>
>>
>> Do you see any specific problem outside the 3 failing piglit tests?
>> Because as far as I can tell, the error allowed in the tests was really
>> just chosen so it works on ordinary hw
>> (interface-vs-unnamed-to-fs-unnamed actually has higher allowed error
>> than the other two).
>> So, I was reluctant to change it just because of these piglits, but this
>> isn't set in stone...
>>
>> Roland
>>
> No, I couldn't see any other problem that relates to this, as I
> understood the error was so small (the results of the interpolation
> were off by about 5e-7 then x86-64).
> I also run some OpenGL demo that uses gl_ClipDistance and I didn't see
> any visual problem.
> 
> However, I would hate to keep the situation as is, meaning the test
> passes on x86-64 and fails on ppc64le.
> If we agree that there is no actual bug in the code, then another
> possibility is to change the failure criteria. Because we know the
> maximum deviation between the two methods is 5e-7, we can add 5e-7 (or
> 0.5e-6) to the criteria of the three above mentioned tests. That would
> make them pass even with the two-stages approach.
> 
> What do you think ?
> 
>       Oded
>