<div dir="ltr">On 16 July 2013 17:52, Grigori Goronzy <span dir="ltr"><<a href="mailto:greg@chown.ath.cx" target="_blank">greg@chown.ath.cx</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div>On 17.07.2013 02:05, Marek Olšák wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
No, it's not faster, but it's not slower either.<br>
<br>
Now that I think about it, I can't come up with a good shader-based<br>
algorithm for the resolve operation.<br>
<br>
I don't think Christoph's approach that an MSAA texture can be viewed<br>
as a larger single-sample texture is correct, because the physical<br>
locations of the samples in memory usually do not correspond to the<br>
sample locations the 3D engine used for rasterization. so fetching a<br>
texel from the larger texture at (x,y) physical coordinates won't<br>
always return the closest rasterized sample at those coordinates. Also<br>
the bilinear filter would be horrible in this case, because it only<br>
takes 4 samples per pixel.<br>
<br>
Now let's consider implementing the scaled resolve operation in the<br>
shader by texelFetch-ing all samples and using a bilinear filter. For<br>
Nx MSAA, there would be N*4 texel fetches per pixel; in comparison,<br>
separate resolve+blit needs only N+4 texel fetches per pixel. In<br>
addition to that, the resolve is a special fixed-function blending<br>
operation and the fragment shader is not even executed. See? Separate<br>
resolve+blit beats everything.<br>
<br>
</blockquote>
<br></div>
AFAICS the point of the spec is that it allows cheaper approximations that don't use all texels and it allows the implementation to avoid writes to a temp texture, both to save memory bandwidth. I am not sure if it is reasonably possible to do this (without causing aliasing). How does scaled blit on Intel hardware perform compared to resolve+blit? Maybe it helps on bandwidth-constrained GPU configurations.<br>
</blockquote><div><br></div><div><div>It does definitely help on bandwidth-constrained GPU configurations such as Intel's, though I don't remember exactly how much. There's also a quality improvement in certain use cases. For example, imagine a game that renders at 1280x720 with 4x MSAA, and then blits the result to a 1920x1080 display (I believe this is a typical use case for this extension). Because of the multisampling, the 4x MSAA buffer contains approximately 2560x1440 worth of resolution at triangle edges (of course, there's only 1280x720 worth of texture detail, since the fragment shader runs only once per pixel). If you first resolve that down to 1280x720 and then blit up to 1920x1080, you lose that extra resolution. If instead you do a direct blit using EXT_framebuffer_multisample_blit_scaled, and the implementation is smart, then you get essentially full 1920x1080 resolution at triangle edges, and you still get a reasonable amount of antialiasing, without paying the cost of a full 1920x1080 MSAA render.</div>
<div><br></div><div>Based on the the "NVIDIA Implementation Details" section at the end of the EXT_framebuffer_multisample_blit_scaled spec, and based on Anuj's experiments as he was doing Mesa's implementation for Intel, I believe Christoph is exactly correct, at least concerning what happens in the Nvidia proprietary drivers. Nvidia reinterprets the 1280x720 4x MSAA renderbuffer as a 2560x1440 single-sampled renderbuffer, and then does a standard blit, with either bilinear filtering (in the case of SCALED_RESULT_FASTEST_EXT) or anisotropic filtering (in the case of SCALED_RESULT_NICEST_EXT). It's a slight cheat, since the true sample locations don't form a regular grid, but in practice it introduces a small enough error to be negligible.</div>
<div><br></div><div>The algorithm Anuj implemented for Mesa's Intel driver uses the SCALED_RESULT_FASTEST_EXT version of the "NVIDIA Implementation Details" algorithm. It wasn't easy since Intel hardware doesn't store a 1280x720 4x MSAA texture using a 2560x1440 physical layout in memory*. So we couldn't just point the sampler at the buffer and tell it to pretend that it was single-sampled. Instead we had to have the shader do some math to figure out which 4 texels to fetch from the multisampled texture, and then do the bilinear filtering manually. But it was still worth it, since the Intel hardware is bandwidth constrained, so the cost of doing the bilinear filtering in the shader was negligible.</div>
<div><br></div><div>(*Technically on Gen6, a 1280x720 4x MSAA buffer actually *is* laid out using a 2560x1440 physical layout in memory, but the interleaving pattern is strange, so we still have to do the bilinear filtering in the shader anyhow).</div>
</div>
<div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
In terms of memory bandwidth per pixel, resolve+blit needs N reads and 1 write for the resolve step and 1 read for the blit step. If we assume 100% hit rate for the texture cache, scaled blit needs only N reads and that's it. So in theory it may work. OTOH, compressed colorbuffers and fast clear that are used by r600g should reduce actual bandwidth requirements for the resolve step a lot. And we cannot take advantage of the compression when we're sampling from colorbuffers. I probably just answered this myself: resolve+blit is easier and better at least on Radeon hardware. :)</blockquote>
<div><br></div><div>It sounds like that's the key question in terms of performance: can you take advantage of compression when sampling from colorbuffers? On Intel we have to do 4 texel fetches per output pixel regardless of whether the color buffer is compressed, but since we're bandwidth limited and the texture lookup is aggressively cached, we still get all the benefit of color buffer compression anyhow.</div>
</div></div></div>