[Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

Kenneth Graunke kenneth at whitecape.org
Thu Feb 12 11:34:46 PST 2015


On Thursday, February 12, 2015 04:13:06 PM Francisco Jerez wrote:
> Francisco Jerez <currojerez at riseup.net> writes:
> > Kenneth Graunke <kenneth at whitecape.org> writes:
> >> On Sunday, January 18, 2015 01:04:02 AM Francisco Jerez wrote:
> >>> This is the first part of a series meant to improve our usage of the L3 cache.
> >>> Currently it's far from ideal since the following objects aren't taking any
> >>> advantage of it:
> >>>  - Pull constants (i.e. UBOs and demoted uniforms)
> >>>  - Buffer textures
> >>>  - Shader scratch space (i.e. register spills and fills)
> >>>  - Atomic counters
> >>>  - (Soon) Images
> >>> 
> >>> This first series addresses the first two issues.  Fixing the last three is
> >>> going to be a bit more difficult because we need to modify the partitioning of
> >>> the L3 cache in order to increase the number of ways assigned to the DC, which
> >>> happens to be zero on boot until Gen8.  That's likely to require kernel
> >>> changes because we don't have any extremely satisfactory API to change that
> >>> from userspace right now.
> >>> 
> >>> The first patch in the series sets the MOCS L3 cacheability bit in the surface
> >>> state structure for buffers so the mentioned memory objects (except the shader
> >>> scratch space that gets its MOCS from elsewhere) have a chance of getting
> >>> cached in L3.
> >>> 
> >>> The fourth patch in the series switches to using the constant cache (which,
> >>> unlike the data cache that was used years ago before we started using the
> >>> sampler, is cached on L3 with the default partitioning on all gens) for
> >>> uniform pull constants loads.  The overall performance numbers I've collected
> >>> are included in the commit message of the same patch for future reference.
> >>> Most of it points at the constant cache being faster than the sampler in a
> >>> number of cases (assuming the L3 caching settings are correct), it's also
> >>> likely to alleviate some cache thrashing caused by the competition with
> >>> textures for the L1/L2 sampler caches, and it allows fetching up to eight
> >>> consecutive owords (128B) with just one message.
> >>> 
> >>> The sixth patch enables 4 oword loads because they're basically for free and
> >>> they avoid some of the shortcomings of the 1 and 2 oword messages (see the
> >>> commit message for more details).  I'll have a look into enabling 8 oword
> >>> loads but it's going to require an analysis pass to avoid wasting bandwidth
> >>> and increasing the register pressure unnecessarily when the shader doesn't
> >>> actually need as many constants.
> >>> 
> >>> We could do something similar for non-uniform offset pull constant loads and
> >>> for both kinds of pull constant loads on the vec4 back-end, but I don't have
> >>> enough performance data to support that yet.
> >>
> >> Hi Curro!
> >>
> > Hi Ken,
> >
> >> Technically, I believe we /are/ taking advantage of the L3 today - the sampler
> >> should be part of the "All Clients" and "Read Only Client Pool" portions of the
> >> L3.  I believe the data port's "Constant Cache" is part of the same L3 region.
> >> However, the sampler has an additional L1/L2 cache.
> >>
> > If you're referring to pull constants, nope we aren't, because it's also
> > necessary to have set the MOCS bits to cacheable in L3, and that wasn't
> > the case for any of the memory objects I mentioned except shader scratch
> > space (the latter goes through the data cache so it's still not cached
> > until Gen8).
> >
> >> When you say you "don't have enough performance data" to support doing this in
> >> the vec4 backend, or for non-uniform offset pull loads, do you mean that you
> >> tried it and it wasn't useful, or you just haven't tried it yet?
> >>
> > I tried it on the VS and didn't see any significant change in the
> > benchmarks I had at hand.  For non-uniform pull constant loads it's a
> > bit trickier because performance may be dependent on how non-uniform
> > the offsets are, I don't have any convincing benchmark data yet but I'll
> > look into it.
> >
> >> In my experience, the VS matters a *lot* - skinning shaders tend to use large
> >> arrays of matrices, which get demoted to pull constants.  For example, I
> >> observed huge speedups in GLBenchmark 2.7 EgyptHD (commit 5d8e246ac86b4a94,
> >> in the VS backend), GLBenchmark 2.1 PRO (commit 259b65e2e79), and Trine
> >> (commit 04f5b2f4e454d6 - in the ARBvp backend) by moving from the data cache
> >> to the sampler.
> >>
> >> I'd love to see data for applying your new approach in the VS backend.
> >>
> > Sure, I'll try running those to see if it makes any difference.  If it
> > does it can be fixed later on as a follow-up in any case.
> >
> >> --Ken
> 
> Ken, I don't see any reason to put this series on hold until the changes
> for the other cases are ready instead of going through it incrementally.
> The VS changes themselves are trivial and completely orthogonal to this
> series, but the amount of testing and benchmarking to be done to make
> sure that they don't incur a performance penalty on any of the other
> platforms is overwhelming, and the expected benefit (according to my
> previous observations) will be considerably lower than what we get from
> the FS changes, if any, so it's not a high priority for me at this
> point.
> 
> I'll get to it, I promise ;), but can we land this before it starts
> bit-rotting?

If this is the faster method, then I really want to move all the
backends to use it at the same time.

When we switched from the data cache to the sampler, we did so
incrementally, and missed a few (the ARB backends), leaving 45%
performance improvements on the floor for about a year.  I'd like to
avoid that by switching to the best method in one go.  We may also learn
something in the process.

Virtually all of the performance gains I've seen have been from the
VS.  GLBenchmark PRO and EgyptHD benefited hugely from VS.  Trine
benefited a ton from VS changes.

It's really common to access uniform arrays via a variable index in the
vertex shader - skinning/character animation shaders frequently do that.
It's less common in the fragment shader.

So, I your observation of "I tried it on the VS for the benchmarks I had
on hand" (which?) "and it didn't seem to matter" seems to conflict with
my observations that it's mattered very much in the past.

Maybe that's the actual result - in the FS, not trashing the sampler
cache makes texture access more efficient, but in the VS, we can use the
sampler since vertex shaders almost never access textures.  Unclear.

But I really want hard data on how the constant cache performs in more
situations before making a decision.  I suppose I didn't make that
clear; I apologize for that.

--Ken
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150212/15788c16/attachment.sig>


More information about the mesa-dev mailing list