[Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

Francisco Jerez currojerez at riseup.net
Sat Jan 17 15:04:02 PST 2015


This is the first part of a series meant to improve our usage of the L3 cache.
Currently it's far from ideal since the following objects aren't taking any
advantage of it:
 - Pull constants (i.e. UBOs and demoted uniforms)
 - Buffer textures
 - Shader scratch space (i.e. register spills and fills)
 - Atomic counters
 - (Soon) Images

This first series addresses the first two issues.  Fixing the last three is
going to be a bit more difficult because we need to modify the partitioning of
the L3 cache in order to increase the number of ways assigned to the DC, which
happens to be zero on boot until Gen8.  That's likely to require kernel
changes because we don't have any extremely satisfactory API to change that
from userspace right now.

The first patch in the series sets the MOCS L3 cacheability bit in the surface
state structure for buffers so the mentioned memory objects (except the shader
scratch space that gets its MOCS from elsewhere) have a chance of getting
cached in L3.

The fourth patch in the series switches to using the constant cache (which,
unlike the data cache that was used years ago before we started using the
sampler, is cached on L3 with the default partitioning on all gens) for
uniform pull constants loads.  The overall performance numbers I've collected
are included in the commit message of the same patch for future reference.
Most of it points at the constant cache being faster than the sampler in a
number of cases (assuming the L3 caching settings are correct), it's also
likely to alleviate some cache thrashing caused by the competition with
textures for the L1/L2 sampler caches, and it allows fetching up to eight
consecutive owords (128B) with just one message.

The sixth patch enables 4 oword loads because they're basically for free and
they avoid some of the shortcomings of the 1 and 2 oword messages (see the
commit message for more details).  I'll have a look into enabling 8 oword
loads but it's going to require an analysis pass to avoid wasting bandwidth
and increasing the register pressure unnecessarily when the shader doesn't
actually need as many constants.

We could do something similar for non-uniform offset pull constant loads and
for both kinds of pull constant loads on the vec4 back-end, but I don't have
enough performance data to support that yet.

[PATCH 1/7] i965: Enable L3 caching of buffer surfaces.
[PATCH 2/7] i965: Remove the create_raw_surface vtbl hook.
[PATCH 3/7] i965: Let the caller of brw_set_dp_write/read_message control the target cache.
[PATCH 4/7] i965/fs: Switch to the constant cache for uniform pull constants.
[PATCH 5/7] i965/fs: Less broken handling of force_writemask_all in lower_load_payload().
[PATCH 6/7] i965/fs: Fetch one cacheline of pull constants at a time.
[PATCH 7/7] i965/fs: Remove the FS_OPCODE_SET_SIMD4X2_OFFSET virtual opcode.


More information about the mesa-dev mailing list