[Mesa-dev] [PATCH] st/mesa: expose EXT_framebuffer_multisample_blit_scaled if MSAA is supported

Paul Berry stereotype441 at gmail.com
Tue Jul 16 20:05:40 PDT 2013


On 16 July 2013 17:52, Grigori Goronzy <greg at chown.ath.cx> wrote:

> On 17.07.2013 02:05, Marek Olšák wrote:
>
>> No, it's not faster, but it's not slower either.
>>
>> Now that I think about it, I can't come up with a good shader-based
>> algorithm for the resolve operation.
>>
>> I don't think Christoph's approach that an MSAA texture can be viewed
>> as a larger single-sample texture is correct, because the physical
>> locations of the samples in memory usually do not correspond to the
>> sample locations the 3D engine used for rasterization. so fetching a
>> texel from the larger texture at (x,y) physical coordinates won't
>> always return the closest rasterized sample at those coordinates. Also
>> the bilinear filter would be horrible in this case, because it only
>> takes 4 samples per pixel.
>>
>> Now let's consider implementing the scaled resolve operation in the
>> shader by texelFetch-ing all samples and using a bilinear filter. For
>> Nx MSAA, there would be N*4 texel fetches per pixel; in comparison,
>> separate resolve+blit needs only N+4 texel fetches per pixel. In
>> addition to that, the resolve is a special fixed-function blending
>> operation and the fragment shader is not even executed. See? Separate
>> resolve+blit beats everything.
>>
>>
> AFAICS the point of the spec is that it allows cheaper approximations that
> don't use all texels and it allows the implementation to avoid writes to a
> temp texture, both to save memory bandwidth. I am not sure if it is
> reasonably possible to do this (without causing aliasing). How does scaled
> blit on Intel hardware perform compared to resolve+blit? Maybe it helps on
> bandwidth-constrained GPU configurations.
>

It does definitely help on bandwidth-constrained GPU configurations such as
Intel's, though I don't remember exactly how much.  There's also a quality
improvement in certain use cases.  For example, imagine a game that renders
at 1280x720 with 4x MSAA, and then blits the result to a 1920x1080 display
(I believe this is a typical use case for this extension).  Because of the
multisampling, the 4x MSAA buffer contains approximately 2560x1440 worth of
resolution at triangle edges (of course, there's only 1280x720 worth of
texture detail, since the fragment shader runs only once per pixel).  If
you first resolve that down to 1280x720 and then blit up to 1920x1080, you
lose that extra resolution.  If instead you do a direct blit using
EXT_framebuffer_multisample_blit_scaled, and the implementation is smart,
then you get essentially full 1920x1080 resolution at triangle edges, and
you still get a reasonable amount of antialiasing, without paying the cost
of a full 1920x1080 MSAA render.

Based on the the "NVIDIA Implementation Details" section at the end of the
EXT_framebuffer_multisample_blit_scaled spec, and based on Anuj's
experiments as he was doing Mesa's implementation for Intel, I believe
Christoph is exactly correct, at least concerning what happens in the
Nvidia proprietary drivers.  Nvidia reinterprets the 1280x720 4x MSAA
renderbuffer as a 2560x1440 single-sampled renderbuffer, and then does a
standard blit, with either bilinear filtering (in the case of
SCALED_RESULT_FASTEST_EXT) or anisotropic filtering (in the case of
SCALED_RESULT_NICEST_EXT).  It's a slight cheat, since the true sample
locations don't form a regular grid, but in practice it introduces a small
enough error to be negligible.

The algorithm Anuj implemented for Mesa's Intel driver uses the
SCALED_RESULT_FASTEST_EXT version of the "NVIDIA Implementation Details"
algorithm.  It wasn't easy since Intel hardware doesn't store a 1280x720 4x
MSAA texture using a 2560x1440 physical layout in memory*.  So we couldn't
just point the sampler at the buffer and tell it to pretend that it was
single-sampled.  Instead we had to have the shader do some math to figure
out which 4 texels to fetch from the multisampled texture, and then do the
bilinear filtering manually.  But it was still worth it, since the Intel
hardware is bandwidth constrained, so the cost of doing the bilinear
filtering in the shader was negligible.

(*Technically on Gen6, a 1280x720 4x MSAA buffer actually *is* laid out
using a 2560x1440 physical layout in memory, but the interleaving pattern
is strange, so we still have to do the bilinear filtering in the shader
anyhow).


>
> In terms of memory bandwidth per pixel, resolve+blit needs N reads and 1
> write for the resolve step and 1 read for the blit step. If we assume 100%
> hit rate for the texture cache, scaled blit needs only N reads and that's
> it. So in theory it may work. OTOH, compressed colorbuffers and fast clear
> that are used by r600g should reduce actual bandwidth requirements for the
> resolve step a lot. And we cannot take advantage of the compression when
> we're sampling from colorbuffers. I probably just answered this myself:
> resolve+blit is easier and better at least on Radeon hardware. :)


It sounds like that's the key question in terms of performance: can you
take advantage of compression when sampling from colorbuffers?  On Intel we
have to do 4 texel fetches per output pixel regardless of whether the color
buffer is compressed, but since we're bandwidth limited and the texture
lookup is aggressively cached, we still get all the benefit of color buffer
compression anyhow.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130716/43cae0c5/attachment.html>


More information about the mesa-dev mailing list