[Mesa-dev] swizzling in llvmpipe [was: other stuff]

Wed Sep 1 10:16:37 PDT 2010

> Well, don't forget that you have to populate the tile from somewhere -
> so you'll hit all of the same cachelines that the non-swizzled version
> would have.
>
> We still get locality from binning, meaning that all accesses to a group
> of cachelines come in a single burst, after which they are done with and
> can migrate out of L1 cache according to the processors own mechanisms.
> With swizzling, we need to write them out ourselves, and try and do so
> without blowing the caches (which is possible with non-temporal writes,
> but it's still an extra operation).

Couldn't you just keep resources permanently in swizzled format?
(Un)swizzling should be necessary only for transfers and for
displaying to screen, which should both be relatively rare.
Most hardware GPUs do this, although some can scan directly from
swizzled buffers.

You could even use the hardware OpenGL implementation to perform
unswizzling to the display.

> A tile covers the same number of cachelines either way, and in normal
> rendering the entire tile gets written, even if it's just with the clear
> color.

That's right for render targets.
For textures, however, it might reduce cache misses on some access patterns.
That's also no longer the case if binning is removed and the whole
pipeline from vertex buffers to render targets in done in a single
call (which might be beneficial if there is a lot of pixel-sized
tessellated triangles, but not sure).

However, if it does not actually provide any clear benefits in
practice, then removing swizzling probably makes sense, since it would
significantly simplify the codebase.

> It's also worth noting that things like improving texture sampling are
> far higher on the list.  We do quite well in non-textured mesa demos
> relative to i965 (50% from a single core seems typical), but drop behind
> drastically in things like tunnel, etc.  Swizzling or not won't bridge
> that gap.

Unfortunately special purpose hardware helps a lot here, as the
Larrabee plans showed by keeping it.

Are the issues due to cache misses, or simply having a lot of
instructions for each texture sampling?

Artificially forcing nearest filtering might help a lot, but at an
obvious drastic quality price.

Analyzing the shader and doing some special approximate projection
pass on the texture itself if the shader only projects a texture could
also perhaps be useful, but looks like a lot of work.