[Mesa-dev] Low interpolation precision for 8 bit textures using llvmpipe

Mon Apr 15 16:39:46 UTC 2019

Am 15.04.19 um 13:55 schrieb Dominik Drees:
> On 4/12/19 5:32 PM, Roland Scheidegger wrote:
>> Am 12.04.19 um 14:34 schrieb Dominik Drees:
>>> Hi Roland!
>>>
>>> On 4/11/19 8:18 PM, Roland Scheidegger wrote:
>>>> What version of mesa are you using?
>>> The original results were generated using version 19.0.2 (from the arch
>>> linux repositories), but I got the same results using the current git
>>> version (98934e6aa19795072a353dae6020dafadc76a1e3).
>> Alright, both of these would use the GALLIVM_PERF var.
>>
>>>> The debug flags were changed a while ago (so that those perf tweaks can
>>>> be disabled on release builds too), it needs to be either:
>>>> GALLIVM_PERF=no_rho_approx,no_brilinear,no_quad_lod
>>>> or easier
>>>> GALLIVM_PERF=no_filter_hacks (which disables these 3 things above
>>>> together)
>>>>
>>>> Although all of that only really affects filtering with mipmaps (not
>>>> sure if you do?).
>>> Using GALLIVM_PERF does not a make a difference, either, but that should
>>> be expected because I'm not using mipmaps, just "regular" linear
>>> filtering (GL_NEAREST).
>>>>
>>>>
>>>> (more below)
>>> See my responses below as well.
>>>>
>>>>
>>>> Am 11.04.19 um 18:00 schrieb Dominik Drees:
>>>>> Running with the suggested flags in the environment does not change the
>>>>> result for the test case I described below. The results with and without
>>>>> the environment variables set are pixel-wise equal.
>>>>>
>>>>> By the way, and if this of interest: For GL_NEAREST sampling the results
>>>>> from hardware and llvmpipe are equal as well.
>>>>>
>>>>> Best,
>>>>> Dominik
>>>>>
>>>>> On 4/11/19 4:36 PM, Ilia Mirkin wrote:
>>>>>> llvmpipe takes a number of shortcuts in the interest of speed which
>>>>>> cause inaccurate texturing. Try running with
>>>>>>
>>>>>> GALLIVM_DEBUG=no_rho_approx,no_brilinear,no_quad_lod
>>>>>>
>>>>>> and see if the issue still occurs.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>      -ilia
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 11, 2019 at 8:30 AM Dominik Drees <dominik.drees at wwu.de>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hello, everyone!
>>>>>>>
>>>>>>> I have a question regarding the interpolation precision of llvmpipe.
>>>>>>> Feel free to redirect me to somewhere else if this is not the right
>>>>>>> place to ask. Consider the following scenario: In a fragment shader we
>>>>>>> are sampling from a 16x16, 8 bit texture with values between 0 and 3
>>>>>>> using linear interpolation. Then we write white to the screen if the
>>>>>>> sampled value is > 1/255 and black otherwise. The output looks very
>>>>>>> different when rendered with llvmpipe compared to the result
>>>>>>> produced by
>>>>>>> rendering hardware (for both intel (mesa i965) and nvidia (proprietary
>>>>>>> driver)).
>>>>>>>
>>>>>>> I've uploaded examplary output images here
>>>>>>> (https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fimgur.com%2Fa%2FD1udpez&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501149697&sdata=vymggYHZTDLwKNh7RpcM1eSyhVA2L%2BfHNchvYS8yQPQ%3D&reserved=0)
>>>>>>>
>>>>>>>
>>>>>>> and the corresponding fragment shader here
>>>>>>> (https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fpa808Req&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501149697&sdata=%2FqKVJCXFS4UswynKeSoqCKivTHAb2o%2FZwVE1nwNms3M%3D&reserved=0).
>>>>>>>
>>>> The shader looks iffy to me, how do you use that vec4 in the if clause?
>>>>
>>>>
>>>>>>>
>>>>>>> My hypothesis is that llvmpipe (in contrast to hardware) only uses
>>>>>>> 8 bit
>>>>>>> for the interpolation computation when reading from 8 bit textures and
>>>>>>> thus loses precision in the lower bits. Is that correct? If so, does
>>>>>>> anyone know of a workaround?
>>>>
>>>> So, in theory it is indeed possible the results are less accurate with
>>>> llvmpipe (I believe all recent hw does rgba8 filtering with more than 8
>>>> bit precision).
>>>> For formats fitting into rgba8, we have a fast path in llvmpipe
>>>> (gallivm) for the lerp, which unpacks the 8bit values into 16bit values,
>>>> does the lerp with that and packs back to 8 bit. The result is
>>>> accurately rounded there (to 8 bit) but only for 1 lerp step - for a 2d
>>>> texture there are 3 of those (one per direction, and a final one
>>>> combining the result). And yes this means the filtered result only has 8
>>>> bits.
>>> Do I understand you correctly in that for the 2D case, the results of
>>> the first two lerps (done in 16 bit) are converted to 8 bit, then
>>> converted back to 16 bit for the final (second stage) lerp?
>> Yes. Even the final lerp is converted back to 8 bit before being finally
>> converted to float. (In theory we could avoid this for the final lerp,
>> but this would need some refactoring, since the last lerp isn't always
>> the same - if you have mipmaps for instance there's yet another lerp in
>> the end between the results of each mip.)
>>
>>
>>>
>>> If so and if I'm understanding this correctly, for 2D (i.e., a 2-stage
>>> linear interpolation) we potentially have an error in the order of one
>>> bit for the final 8 bit value due to the intermediate 16->8->16
>>> conversion. For sampling from a 3D texture (i.e., a 3-stage linear
>>> interpolation) the effect would be amplified: The extra stage could
>>> cause an error with a magnitude of two bits of the final 8 bit result
>>> (if I'm doing the math in my head correctly).
>> I'd have to think about this some more, but I don't think the error
>> really accumulates like this (the results aren't multiplied, the
>> interpolation factors are independent from the previous results).
> Looking at the implementation (and from your explaination below) I now
> realize that my mental model of how the interpolation works in fixed
> point arithmetic was wildly oversimplified. My apologies!
>>
>>>
>>> Is there any (conceptual) reason why the result of a one dimensional
>>> interpolation step is reduced back to 8 bits before the second stage
>>> interpolation? Would avoiding these conversions not actually be faster
>>> (in addition to the improved accuracy)?
>> So this gets a bit complicated...
>> We do actually pack the results back (into 16x8bit values) and unpack
>> them (into 2 8x16bit values) for the next lerp. There's pretty trivial
>> sse2 instructions for this, so this is very fast. You are right that
>> this particular step could be skipped, although the logic gets more
>> complicated (we've experimented with that) and it won't really make
>> things faster in practice.
>> The real problem however is that even if you do that, this doesn't avoid
>> the actual issue. Which is that for doing lerp with fixed point, you
>> have a mul which can overflow. We're using a 16bit vectorized mul (so
>> 8x16bit * 8x16bit -> 8x16bit result). The values we multiply are only 8
>> bits so the upper bits of the 16bit values are 0, which ensures the
>> result stays in 16bit. But we have to throw away the lower 8 bits before
>> we can continue otherwise the next mul would overflow.
>> Now it's entirely possible to avoid this - don't throw away the low bits
>> (hence keep the value as 8.8 fixed point), and then for next lerp step
>> you just have to calculate the high bits of the result too (and
>> shift/add things back into place so you have 8.8 numbers again and not
>> 16.16 ones). But this filtering path is supposed to be fast - it's
>> essentially the most important thing we optimize for (such filtering
>> with rgba8 formats is very common for screen composition etc.), and the
>> code would also get more complex. Feel free though to experiment with it
>> (lp_build_lerp2d is a good start, the normalized paths there).
>> There _might_ be another way to solve this actually: d3d10 (not sure
>> about newer versions, and gl tends to be fuzzy around these things)
>> requires 6 bits for the fractional part of the calculated texel coords,
>> and likewise 6 bits for fractional part of LOD. This means that we
>> possibly could do the lerps with effectively 8.2 (for intermediate
>> results) and 6.0 (for interpolation factors) values, making the mul
>> non-overflowing too. I have no idea though if this would actually
>> satisfy the tests (of course interpolation weight precision would be
>> lower), and I think it would also be some rather complex code change,
>> these normalized lerps are optimized to death with a million tricks used...
> I have actually started to tinker with the code a bit before your mail
> arrived and saw that it's not really straight forward to remove
> intermediate packing and unpacking. Anyway, I'm not really familiar with
> fixed point arithmetic in detail (as evidenced by my previous email...)
> and of course overlooked the overflow issue. I see now how avoiding the
> 16->8->16 conversion would not do nothing for the interpolation
> accuracy. This really isn't my area of expertise, and I actually learned
> something. So thank you very much for your detailed explanation.
So, in theory doing this with higher precision should be possible
without too many extra instructions. Basically a second mul (sse2 has an
instruction for giving back the high bits of a 16x16 mul, although it
may be tricky to get llvm to do it efficiently, without the use of
intrinsics) would need to be used, and combined back into the result
(we'd have a 16 bit x 8 bit mul effectively, so the high 8 bits of the
low result together with the low 8 bits of the high result gives the new
16 bit result).
But indeed, this is not straightforward, and certainly would need quite
some extra code (of course for the first-step lerps, we really would
need different code than for later lerps). And of course all
intermediate packing would need to go.

Roland


>>
>>
>>
>>>>
>>>> I do believe you should not rely on implementations having more accuracy
>>>> - as far as I know the filtering we do is conformant there (it is tricky
>>>> to do better using the fast path).
>>> In principle you are correct. In our regressiontests we actually have
>>> (per test) configurable thresholds for maximum pixel distance/maximum
>>> number of differing pixels/neighborhood search radius etc. We could just
>>> increase these thresholds, but would risk missing some regressions that
>>> (for example) only affect a very small portion of the screen. For the
>>> larger part of our test suite llvmpipe actually works quite well within
>>> the established limits.
>>> For some other cases where we render a relatively small 8 bit 3D volume
>>> the differences basically trampled the previously set thresholds and
>>> were quite visible to the naked eye.
>>>
>>>>
>>>> There would be code to actually do filtering with full float precision,
>>>> although there's no way to reach it with rgba8 formats unless you change
>>>> the code (if you want to try out the theory, look at
>>>> lp_bld_sample_soa.c, lp_build_sample_soa_code() determines whether to
>>>> use the fast (aos) filtering path (use_aos, determined mostly by
>>>> util_format_fits_8unorm()). If you set this to false it will use the
>>>> full float filtering path. (FWIW I was actually thinking a while ago we
>>>> should force this path when there's only 1 channel, albeit I never got
>>>> around to test (benchmark) it - this is because the AoS filtering path
>>>> is really optimized for rgba8 formats, and if you only have 1 channel
>>>> it's quite possible float filtering is actually faster, since this
>>>> handles the channels individually.)
>>>> I guess though if the full float precision filtering is useful in
>>>> general, we could add that to GALLIVM_PERF.
>>> Forcing float precision indeed fixes the test case described below and
>>> our volume rendering regression tests! If this cannot be fixed in
>>> general I would be very happy about an option to force float precision
>>> via GALLIVM_PERF. FWIW, with forced float precision running our test
>>> suit is actually faster (~6 minutes) than "stock" master (~6:40), but
>>> these may be highly biased, of course.
>> So, if these are really 8bit (not 4x8bit) textures (or you use only one
>> channel of the result) then I wouldn't be surprised if it's faster with
>> floats. (As I mentioned before, we considered using float filtering for
>> this case, albeit only the former case, we can't really detect which
>> channels are actually used here.)
>> This may also be dependent on your hw - in particular if your cpu
>> supports AVX but not AVX2 the float filtering path tends to gain some
>> ground (because AVX can't handle integers, hence we split the values
>> into 2x128bit halves rather than processing as 256bit for AoS filtering).
> You are spot on: The processed data indeed consists mostly of one
> channel 8 bit or 16 bit 3D textures and my processor does in fact
> supports AVX, but not AVX2. Obviously no one should take decisions based
> on these numbers.
Well it is a valid combination. My suspicion is for 1 channel formats,
it is probably a good idea to use SoA filtering regardless, even if the
cpu only supports sse, or supports avx2.


>>
>> This answer got a bit lengthy...
>>
>> In any case, I think a new option for GALLIVM_PERF sounds reasonable as
>> a quick fix...
> I have created a quick patch for a GALLIVM_PERF option and validated
> that it works for my usecase. I created a merge request in the mesa
> gitlab: https://gitlab.freedesktop.org/mesa/mesa/merge_requests/659
> 
> Best,
> Dominik
> 
> 
>>
>> Roland
>>
>>
>>
>>
>>>
>>> Best,
>>> Dominik
>>>>
>>>> Roland
>>>>
>>>>
>>>>
>>>>
>>>>>>>
>>>>>>> A little bit of background about the use case: We are trying to
>>>>>>> move the
>>>>>>> CI of Voreen
>>>>>>> (https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.uni-muenster.de%2FVoreen%2F&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501149697&sdata=tZf1sxXpC0rDhAAzqXNp9UQnRmrnZceKCerfJKcMdmk%3D&reserved=0)
>>>>>>>
>>>>>>> to the Gitlab-CI
>>>>>>> running in docker without any hardware dependencies. Using llvmpipe
>>>>>>> for
>>>>>>> our regression tests works in principle, but shows significant
>>>>>>> differences in the raycasting rendering of an 8-bit-per-voxel dataset.
>>>>>>> (The effect is of course less visible than the constructed example
>>>>>>> case
>>>>>>> linked above, but still quite noticeable for a human.)
>>>>>>>
>>>>>>> Any help or pointers would be appreciated!
>>>>>>>
>>>>>>> Best,
>>>>>>> Dominik
>>>>>>>
>>>>>>> -- 
>>>>>>> Dominik Drees
>>>>>>>
>>>>>>> Department of Computer Science
>>>>>>> Westfaelische Wilhelms-Universitaet Muenster
>>>>>>>
>>>>>>> email: dominik.drees at wwu.de
>>>>>>> web:
>>>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.wwu.de%2FPRIA%2Fpersonen%2Fdrees.shtml&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501159687&sdata=tZeO2bZCQzdIz8ifZnNRbQ8tM46CCTDrDFgTeXbVWUU%3D&reserved=0
>>>>>>>
>>>>>>>
>>>>>>> phone: +49 251 83 - 38448
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mesa-dev mailing list
>>>>>>> mesa-dev at lists.freedesktop.org
>>>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fmesa-dev&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501159687&sdata=d%2Fj7ZLjayR308Y0qFzFu5YqVBbQF%2B1b8tHPS75U3jco%3D&reserved=0
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fmesa-dev&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501179679&sdata=fMbBfbBWnYQbDmwTcV%2FaOVpXwTLD%2BV5PF2yGH8hvHkM%3D&reserved=0
>>>>>
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>