[Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

Thu Feb 5 01:26:37 PST 2015

Francisco Jerez <currojerez at riseup.net> writes:

> This is the first part of a series meant to improve our usage of the L3 cache.
> Currently it's far from ideal since the following objects aren't taking any
> advantage of it:
>  - Pull constants (i.e. UBOs and demoted uniforms)
>  - Buffer textures
>  - Shader scratch space (i.e. register spills and fills)
>  - Atomic counters
>  - (Soon) Images
>
> This first series addresses the first two issues.  Fixing the last three is
> going to be a bit more difficult because we need to modify the partitioning of
> the L3 cache in order to increase the number of ways assigned to the DC, which
> happens to be zero on boot until Gen8.  That's likely to require kernel
> changes because we don't have any extremely satisfactory API to change that
> from userspace right now.
>
> The first patch in the series sets the MOCS L3 cacheability bit in the surface
> state structure for buffers so the mentioned memory objects (except the shader
> scratch space that gets its MOCS from elsewhere) have a chance of getting
> cached in L3.
>
> The fourth patch in the series switches to using the constant cache (which,
> unlike the data cache that was used years ago before we started using the
> sampler, is cached on L3 with the default partitioning on all gens) for
> uniform pull constants loads.  The overall performance numbers I've collected
> are included in the commit message of the same patch for future reference.
> Most of it points at the constant cache being faster than the sampler in a
> number of cases (assuming the L3 caching settings are correct), it's also
> likely to alleviate some cache thrashing caused by the competition with
> textures for the L1/L2 sampler caches, and it allows fetching up to eight
> consecutive owords (128B) with just one message.
>
> The sixth patch enables 4 oword loads because they're basically for free and
> they avoid some of the shortcomings of the 1 and 2 oword messages (see the
> commit message for more details).  I'll have a look into enabling 8 oword
> loads but it's going to require an analysis pass to avoid wasting bandwidth
> and increasing the register pressure unnecessarily when the shader doesn't
> actually need as many constants.
>
> We could do something similar for non-uniform offset pull constant loads and
> for both kinds of pull constant loads on the vec4 back-end, but I don't have
> enough performance data to support that yet.
>
> [PATCH 1/7] i965: Enable L3 caching of buffer surfaces.
> [PATCH 2/7] i965: Remove the create_raw_surface vtbl hook.
> [PATCH 3/7] i965: Let the caller of brw_set_dp_write/read_message control the target cache.
> [PATCH 4/7] i965/fs: Switch to the constant cache for uniform pull constants.
> [PATCH 5/7] i965/fs: Less broken handling of force_writemask_all in lower_load_payload().
> [PATCH 6/7] i965/fs: Fetch one cacheline of pull constants at a time.
> [PATCH 7/7] i965/fs: Remove the FS_OPCODE_SET_SIMD4X2_OFFSET virtual opcode.
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Any volunteer to review the rest of this performance-improving series
before the merge window closes?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150205/f2025127/attachment.sig>