[Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

Tue Jan 27 21:09:13 PST 2015

On Sunday, January 18, 2015 01:04:02 AM Francisco Jerez wrote:
> This is the first part of a series meant to improve our usage of the L3 cache.
> Currently it's far from ideal since the following objects aren't taking any
> advantage of it:
>  - Pull constants (i.e. UBOs and demoted uniforms)
>  - Buffer textures
>  - Shader scratch space (i.e. register spills and fills)
>  - Atomic counters
>  - (Soon) Images
> 
> This first series addresses the first two issues.  Fixing the last three is
> going to be a bit more difficult because we need to modify the partitioning of
> the L3 cache in order to increase the number of ways assigned to the DC, which
> happens to be zero on boot until Gen8.  That's likely to require kernel
> changes because we don't have any extremely satisfactory API to change that
> from userspace right now.
> 
> The first patch in the series sets the MOCS L3 cacheability bit in the surface
> state structure for buffers so the mentioned memory objects (except the shader
> scratch space that gets its MOCS from elsewhere) have a chance of getting
> cached in L3.
> 
> The fourth patch in the series switches to using the constant cache (which,
> unlike the data cache that was used years ago before we started using the
> sampler, is cached on L3 with the default partitioning on all gens) for
> uniform pull constants loads.  The overall performance numbers I've collected
> are included in the commit message of the same patch for future reference.
> Most of it points at the constant cache being faster than the sampler in a
> number of cases (assuming the L3 caching settings are correct), it's also
> likely to alleviate some cache thrashing caused by the competition with
> textures for the L1/L2 sampler caches, and it allows fetching up to eight
> consecutive owords (128B) with just one message.
> 
> The sixth patch enables 4 oword loads because they're basically for free and
> they avoid some of the shortcomings of the 1 and 2 oword messages (see the
> commit message for more details).  I'll have a look into enabling 8 oword
> loads but it's going to require an analysis pass to avoid wasting bandwidth
> and increasing the register pressure unnecessarily when the shader doesn't
> actually need as many constants.
> 
> We could do something similar for non-uniform offset pull constant loads and
> for both kinds of pull constant loads on the vec4 back-end, but I don't have
> enough performance data to support that yet.

Hi Curro!

Technically, I believe we /are/ taking advantage of the L3 today - the sampler
should be part of the "All Clients" and "Read Only Client Pool" portions of the
L3.  I believe the data port's "Constant Cache" is part of the same L3 region.
However, the sampler has an additional L1/L2 cache.

When you say you "don't have enough performance data" to support doing this in
the vec4 backend, or for non-uniform offset pull loads, do you mean that you
tried it and it wasn't useful, or you just haven't tried it yet?

In my experience, the VS matters a *lot* - skinning shaders tend to use large
arrays of matrices, which get demoted to pull constants.  For example, I
observed huge speedups in GLBenchmark 2.7 EgyptHD (commit 5d8e246ac86b4a94,
in the VS backend), GLBenchmark 2.1 PRO (commit 259b65e2e79), and Trine
(commit 04f5b2f4e454d6 - in the ARBvp backend) by moving from the data cache
to the sampler.

I'd love to see data for applying your new approach in the VS backend.

--Ken
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150127/feaa1ce1/attachment.sig>