[Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

Wed Jan 28 05:14:08 PST 2015

Hi Kenneth,

Constant cache could and should allocate to separate region in $L3. The main point of having separate constant region is to avoid texture data trashing due to pulled constants load. In optimal solution constant region is allocated only when shader uses pull constants, but that is not so easy as the $L3 config reg is not part of per constant regs. 

BR,
Harri

-----Original Message-----
From: Kenneth Graunke [mailto:kenneth at whitecape.org] 
Sent: Wednesday, January 28, 2015 7:09 AM
To: mesa-dev at lists.freedesktop.org; Francisco Jerez
Cc: Syrja, Harri
Subject: Re: [Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

On Sunday, January 18, 2015 01:04:02 AM Francisco Jerez wrote:
> This is the first part of a series meant to improve our usage of the L3 cache.
> Currently it's far from ideal since the following objects aren't 
> taking any advantage of it:
>  - Pull constants (i.e. UBOs and demoted uniforms)
>  - Buffer textures
>  - Shader scratch space (i.e. register spills and fills)
>  - Atomic counters
>  - (Soon) Images
> 
> This first series addresses the first two issues.  Fixing the last 
> three is going to be a bit more difficult because we need to modify 
> the partitioning of the L3 cache in order to increase the number of 
> ways assigned to the DC, which happens to be zero on boot until Gen8.  
> That's likely to require kernel changes because we don't have any 
> extremely satisfactory API to change that from userspace right now.
> 
> The first patch in the series sets the MOCS L3 cacheability bit in the 
> surface state structure for buffers so the mentioned memory objects 
> (except the shader scratch space that gets its MOCS from elsewhere) 
> have a chance of getting cached in L3.
> 
> The fourth patch in the series switches to using the constant cache 
> (which, unlike the data cache that was used years ago before we 
> started using the sampler, is cached on L3 with the default 
> partitioning on all gens) for uniform pull constants loads.  The 
> overall performance numbers I've collected are included in the commit message of the same patch for future reference.
> Most of it points at the constant cache being faster than the sampler 
> in a number of cases (assuming the L3 caching settings are correct), 
> it's also likely to alleviate some cache thrashing caused by the 
> competition with textures for the L1/L2 sampler caches, and it allows 
> fetching up to eight consecutive owords (128B) with just one message.
> 
> The sixth patch enables 4 oword loads because they're basically for 
> free and they avoid some of the shortcomings of the 1 and 2 oword 
> messages (see the commit message for more details).  I'll have a look 
> into enabling 8 oword loads but it's going to require an analysis pass 
> to avoid wasting bandwidth and increasing the register pressure 
> unnecessarily when the shader doesn't actually need as many constants.
> 
> We could do something similar for non-uniform offset pull constant 
> loads and for both kinds of pull constant loads on the vec4 back-end, 
> but I don't have enough performance data to support that yet.

Hi Curro!

Technically, I believe we /are/ taking advantage of the L3 today - the sampler should be part of the "All Clients" and "Read Only Client Pool" portions of the L3.  I believe the data port's "Constant Cache" is part of the same L3 region.
However, the sampler has an additional L1/L2 cache.

When you say you "don't have enough performance data" to support doing this in the vec4 backend, or for non-uniform offset pull loads, do you mean that you tried it and it wasn't useful, or you just haven't tried it yet?

In my experience, the VS matters a *lot* - skinning shaders tend to use large arrays of matrices, which get demoted to pull constants.  For example, I observed huge speedups in GLBenchmark 2.7 EgyptHD (commit 5d8e246ac86b4a94, in the VS backend), GLBenchmark 2.1 PRO (commit 259b65e2e79), and Trine (commit 04f5b2f4e454d6 - in the ARBvp backend) by moving from the data cache to the sampler.

I'd love to see data for applying your new approach in the VS backend.

--Ken
---------------------------------------------------------------------
Intel Finland Oy
Registered Address: PL 281, 00181 Helsinki 
Business Identity Code: 0357606 - 4 
Domiciled in Helsinki 

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.