[Mesa-dev] [PATCH] st/mesa: expose EXT_framebuffer_multisample_blit_scaled if MSAA is supported

Tue Jul 16 17:52:32 PDT 2013

On 17.07.2013 02:05, Marek Olšák wrote:
> No, it's not faster, but it's not slower either.
>
> Now that I think about it, I can't come up with a good shader-based
> algorithm for the resolve operation.
>
> I don't think Christoph's approach that an MSAA texture can be viewed
> as a larger single-sample texture is correct, because the physical
> locations of the samples in memory usually do not correspond to the
> sample locations the 3D engine used for rasterization. so fetching a
> texel from the larger texture at (x,y) physical coordinates won't
> always return the closest rasterized sample at those coordinates. Also
> the bilinear filter would be horrible in this case, because it only
> takes 4 samples per pixel.
>
> Now let's consider implementing the scaled resolve operation in the
> shader by texelFetch-ing all samples and using a bilinear filter. For
> Nx MSAA, there would be N*4 texel fetches per pixel; in comparison,
> separate resolve+blit needs only N+4 texel fetches per pixel. In
> addition to that, the resolve is a special fixed-function blending
> operation and the fragment shader is not even executed. See? Separate
> resolve+blit beats everything.
>

AFAICS the point of the spec is that it allows cheaper approximations 
that don't use all texels and it allows the implementation to avoid 
writes to a temp texture, both to save memory bandwidth. I am not sure 
if it is reasonably possible to do this (without causing aliasing). How 
does scaled blit on Intel hardware perform compared to resolve+blit? 
Maybe it helps on bandwidth-constrained GPU configurations.

In terms of memory bandwidth per pixel, resolve+blit needs N reads and 1 
write for the resolve step and 1 read for the blit step. If we assume 
100% hit rate for the texture cache, scaled blit needs only N reads and 
that's it. So in theory it may work. OTOH, compressed colorbuffers and 
fast clear that are used by r600g should reduce actual bandwidth 
requirements for the resolve step a lot. And we cannot take advantage of 
the compression when we're sampling from colorbuffers. I probably just 
answered this myself: resolve+blit is easier and better at least on 
Radeon hardware. :)

Grigori