[Mesa-dev] Low interpolation precision for 8 bit textures using llvmpipe

Fri Apr 12 15:32:29 UTC 2019

Am 12.04.19 um 14:34 schrieb Dominik Drees:
> Hi Roland!
> 
> On 4/11/19 8:18 PM, Roland Scheidegger wrote:
>> What version of mesa are you using?
> The original results were generated using version 19.0.2 (from the arch
> linux repositories), but I got the same results using the current git
> version (98934e6aa19795072a353dae6020dafadc76a1e3).
Alright, both of these would use the GALLIVM_PERF var.

>> The debug flags were changed a while ago (so that those perf tweaks can
>> be disabled on release builds too), it needs to be either:
>> GALLIVM_PERF=no_rho_approx,no_brilinear,no_quad_lod
>> or easier
>> GALLIVM_PERF=no_filter_hacks (which disables these 3 things above
>> together)
>>
>> Although all of that only really affects filtering with mipmaps (not
>> sure if you do?).
> Using GALLIVM_PERF does not a make a difference, either, but that should
> be expected because I'm not using mipmaps, just "regular" linear
> filtering (GL_NEAREST).
>>
>>
>> (more below)
> See my responses below as well.
>>
>>
>> Am 11.04.19 um 18:00 schrieb Dominik Drees:
>>> Running with the suggested flags in the environment does not change the
>>> result for the test case I described below. The results with and without
>>> the environment variables set are pixel-wise equal.
>>>
>>> By the way, and if this of interest: For GL_NEAREST sampling the results
>>> from hardware and llvmpipe are equal as well.
>>>
>>> Best,
>>> Dominik
>>>
>>> On 4/11/19 4:36 PM, Ilia Mirkin wrote:
>>>> llvmpipe takes a number of shortcuts in the interest of speed which
>>>> cause inaccurate texturing. Try running with
>>>>
>>>> GALLIVM_DEBUG=no_rho_approx,no_brilinear,no_quad_lod
>>>>
>>>> and see if the issue still occurs.
>>>>
>>>> Cheers,
>>>>
>>>>     -ilia
>>>>
>>>>
>>>>
>>>> On Thu, Apr 11, 2019 at 8:30 AM Dominik Drees <dominik.drees at wwu.de>
>>>> wrote:
>>>>>
>>>>> Hello, everyone!
>>>>>
>>>>> I have a question regarding the interpolation precision of llvmpipe.
>>>>> Feel free to redirect me to somewhere else if this is not the right
>>>>> place to ask. Consider the following scenario: In a fragment shader we
>>>>> are sampling from a 16x16, 8 bit texture with values between 0 and 3
>>>>> using linear interpolation. Then we write white to the screen if the
>>>>> sampled value is > 1/255 and black otherwise. The output looks very
>>>>> different when rendered with llvmpipe compared to the result
>>>>> produced by
>>>>> rendering hardware (for both intel (mesa i965) and nvidia (proprietary
>>>>> driver)).
>>>>>
>>>>> I've uploaded examplary output images here
>>>>> (https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fimgur.com%2Fa%2FD1udpez&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501149697&sdata=vymggYHZTDLwKNh7RpcM1eSyhVA2L%2BfHNchvYS8yQPQ%3D&reserved=0)
>>>>>
>>>>>
>>>>> and the corresponding fragment shader here
>>>>> (https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fpa808Req&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501149697&sdata=%2FqKVJCXFS4UswynKeSoqCKivTHAb2o%2FZwVE1nwNms3M%3D&reserved=0).
>>>>>
>> The shader looks iffy to me, how do you use that vec4 in the if clause?
>>
>>
>>>>>
>>>>> My hypothesis is that llvmpipe (in contrast to hardware) only uses
>>>>> 8 bit
>>>>> for the interpolation computation when reading from 8 bit textures and
>>>>> thus loses precision in the lower bits. Is that correct? If so, does
>>>>> anyone know of a workaround?
>>
>> So, in theory it is indeed possible the results are less accurate with
>> llvmpipe (I believe all recent hw does rgba8 filtering with more than 8
>> bit precision).
>> For formats fitting into rgba8, we have a fast path in llvmpipe
>> (gallivm) for the lerp, which unpacks the 8bit values into 16bit values,
>> does the lerp with that and packs back to 8 bit. The result is
>> accurately rounded there (to 8 bit) but only for 1 lerp step - for a 2d
>> texture there are 3 of those (one per direction, and a final one
>> combining the result). And yes this means the filtered result only has 8
>> bits.
> Do I understand you correctly in that for the 2D case, the results of
> the first two lerps (done in 16 bit) are converted to 8 bit, then
> converted back to 16 bit for the final (second stage) lerp?
Yes. Even the final lerp is converted back to 8 bit before being finally
converted to float. (In theory we could avoid this for the final lerp,
but this would need some refactoring, since the last lerp isn't always
the same - if you have mipmaps for instance there's yet another lerp in
the end between the results of each mip.)

> 
> If so and if I'm understanding this correctly, for 2D (i.e., a 2-stage
> linear interpolation) we potentially have an error in the order of one
> bit for the final 8 bit value due to the intermediate 16->8->16
> conversion. For sampling from a 3D texture (i.e., a 3-stage linear
> interpolation) the effect would be amplified: The extra stage could
> cause an error with a magnitude of two bits of the final 8 bit result
> (if I'm doing the math in my head correctly).
I'd have to think about this some more, but I don't think the error
really accumulates like this (the results aren't multiplied, the
interpolation factors are independent from the previous results).

> 
> Is there any (conceptual) reason why the result of a one dimensional
> interpolation step is reduced back to 8 bits before the second stage
> interpolation? Would avoiding these conversions not actually be faster
> (in addition to the improved accuracy)?
So this gets a bit complicated...
We do actually pack the results back (into 16x8bit values) and unpack
them (into 2 8x16bit values) for the next lerp. There's pretty trivial
sse2 instructions for this, so this is very fast. You are right that
this particular step could be skipped, although the logic gets more
complicated (we've experimented with that) and it won't really make
things faster in practice.
The real problem however is that even if you do that, this doesn't avoid
the actual issue. Which is that for doing lerp with fixed point, you
have a mul which can overflow. We're using a 16bit vectorized mul (so
8x16bit * 8x16bit -> 8x16bit result). The values we multiply are only 8
bits so the upper bits of the 16bit values are 0, which ensures the
result stays in 16bit. But we have to throw away the lower 8 bits before
we can continue otherwise the next mul would overflow.
Now it's entirely possible to avoid this - don't throw away the low bits
(hence keep the value as 8.8 fixed point), and then for next lerp step
you just have to calculate the high bits of the result too (and
shift/add things back into place so you have 8.8 numbers again and not
16.16 ones). But this filtering path is supposed to be fast - it's
essentially the most important thing we optimize for (such filtering
with rgba8 formats is very common for screen composition etc.), and the
code would also get more complex. Feel free though to experiment with it
(lp_build_lerp2d is a good start, the normalized paths there).
There _might_ be another way to solve this actually: d3d10 (not sure
about newer versions, and gl tends to be fuzzy around these things)
requires 6 bits for the fractional part of the calculated texel coords,
and likewise 6 bits for fractional part of LOD. This means that we
possibly could do the lerps with effectively 8.2 (for intermediate
results) and 6.0 (for interpolation factors) values, making the mul
non-overflowing too. I have no idea though if this would actually
satisfy the tests (of course interpolation weight precision would be
lower), and I think it would also be some rather complex code change,
these normalized lerps are optimized to death with a million tricks used...

>>
>> I do believe you should not rely on implementations having more accuracy
>> - as far as I know the filtering we do is conformant there (it is tricky
>> to do better using the fast path).
> In principle you are correct. In our regressiontests we actually have
> (per test) configurable thresholds for maximum pixel distance/maximum
> number of differing pixels/neighborhood search radius etc. We could just
> increase these thresholds, but would risk missing some regressions that
> (for example) only affect a very small portion of the screen. For the
> larger part of our test suite llvmpipe actually works quite well within
> the established limits.
> For some other cases where we render a relatively small 8 bit 3D volume
> the differences basically trampled the previously set thresholds and
> were quite visible to the naked eye.
> 
>>
>> There would be code to actually do filtering with full float precision,
>> although there's no way to reach it with rgba8 formats unless you change
>> the code (if you want to try out the theory, look at
>> lp_bld_sample_soa.c, lp_build_sample_soa_code() determines whether to
>> use the fast (aos) filtering path (use_aos, determined mostly by
>> util_format_fits_8unorm()). If you set this to false it will use the
>> full float filtering path. (FWIW I was actually thinking a while ago we
>> should force this path when there's only 1 channel, albeit I never got
>> around to test (benchmark) it - this is because the AoS filtering path
>> is really optimized for rgba8 formats, and if you only have 1 channel
>> it's quite possible float filtering is actually faster, since this
>> handles the channels individually.)
>> I guess though if the full float precision filtering is useful in
>> general, we could add that to GALLIVM_PERF.
> Forcing float precision indeed fixes the test case described below and
> our volume rendering regression tests! If this cannot be fixed in
> general I would be very happy about an option to force float precision
> via GALLIVM_PERF. FWIW, with forced float precision running our test
> suit is actually faster (~6 minutes) than "stock" master (~6:40), but
> these may be highly biased, of course.
So, if these are really 8bit (not 4x8bit) textures (or you use only one
channel of the result) then I wouldn't be surprised if it's faster with
floats. (As I mentioned before, we considered using float filtering for
this case, albeit only the former case, we can't really detect which
channels are actually used here.)
This may also be dependent on your hw - in particular if your cpu
supports AVX but not AVX2 the float filtering path tends to gain some
ground (because AVX can't handle integers, hence we split the values
into 2x128bit halves rather than processing as 256bit for AoS filtering).

This answer got a bit lengthy...

In any case, I think a new option for GALLIVM_PERF sounds reasonable as
a quick fix...

Roland

> 
> Best,
> Dominik
>>
>> Roland
>>
>>
>>
>>
>>>>>
>>>>> A little bit of background about the use case: We are trying to
>>>>> move the
>>>>> CI of Voreen
>>>>> (https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.uni-muenster.de%2FVoreen%2F&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501149697&sdata=tZf1sxXpC0rDhAAzqXNp9UQnRmrnZceKCerfJKcMdmk%3D&reserved=0)
>>>>>
>>>>> to the Gitlab-CI
>>>>> running in docker without any hardware dependencies. Using llvmpipe
>>>>> for
>>>>> our regression tests works in principle, but shows significant
>>>>> differences in the raycasting rendering of an 8-bit-per-voxel dataset.
>>>>> (The effect is of course less visible than the constructed example
>>>>> case
>>>>> linked above, but still quite noticeable for a human.)
>>>>>
>>>>> Any help or pointers would be appreciated!
>>>>>
>>>>> Best,
>>>>> Dominik
>>>>>
>>>>> -- 
>>>>> Dominik Drees
>>>>>
>>>>> Department of Computer Science
>>>>> Westfaelische Wilhelms-Universitaet Muenster
>>>>>
>>>>> email: dominik.drees at wwu.de
>>>>> web:
>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.wwu.de%2FPRIA%2Fpersonen%2Fdrees.shtml&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501159687&sdata=tZeO2bZCQzdIz8ifZnNRbQ8tM46CCTDrDFgTeXbVWUU%3D&reserved=0
>>>>>
>>>>>
>>>>> phone: +49 251 83 - 38448
>>>>>
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fmesa-dev&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501159687&sdata=d%2Fj7ZLjayR308Y0qFzFu5YqVBbQF%2B1b8tHPS75U3jco%3D&reserved=0
>>>>>
>>>>>
>>>
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fmesa-dev&data=02%7C01%7Csroland%40vmware.com%7Cbdef52eb504c4078f9f808d6be96da17%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636905952501179679&sdata=fMbBfbBWnYQbDmwTcV%2FaOVpXwTLD%2BV5PF2yGH8hvHkM%3D&reserved=0
>>>
>>>
>>
>