[Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

Thu Feb 12 06:13:06 PST 2015

Francisco Jerez <currojerez at riseup.net> writes:

> Kenneth Graunke <kenneth at whitecape.org> writes:
>
>> On Sunday, January 18, 2015 01:04:02 AM Francisco Jerez wrote:
>>> This is the first part of a series meant to improve our usage of the L3 cache.
>>> Currently it's far from ideal since the following objects aren't taking any
>>> advantage of it:
>>>  - Pull constants (i.e. UBOs and demoted uniforms)
>>>  - Buffer textures
>>>  - Shader scratch space (i.e. register spills and fills)
>>>  - Atomic counters
>>>  - (Soon) Images
>>> 
>>> This first series addresses the first two issues.  Fixing the last three is
>>> going to be a bit more difficult because we need to modify the partitioning of
>>> the L3 cache in order to increase the number of ways assigned to the DC, which
>>> happens to be zero on boot until Gen8.  That's likely to require kernel
>>> changes because we don't have any extremely satisfactory API to change that
>>> from userspace right now.
>>> 
>>> The first patch in the series sets the MOCS L3 cacheability bit in the surface
>>> state structure for buffers so the mentioned memory objects (except the shader
>>> scratch space that gets its MOCS from elsewhere) have a chance of getting
>>> cached in L3.
>>> 
>>> The fourth patch in the series switches to using the constant cache (which,
>>> unlike the data cache that was used years ago before we started using the
>>> sampler, is cached on L3 with the default partitioning on all gens) for
>>> uniform pull constants loads.  The overall performance numbers I've collected
>>> are included in the commit message of the same patch for future reference.
>>> Most of it points at the constant cache being faster than the sampler in a
>>> number of cases (assuming the L3 caching settings are correct), it's also
>>> likely to alleviate some cache thrashing caused by the competition with
>>> textures for the L1/L2 sampler caches, and it allows fetching up to eight
>>> consecutive owords (128B) with just one message.
>>> 
>>> The sixth patch enables 4 oword loads because they're basically for free and
>>> they avoid some of the shortcomings of the 1 and 2 oword messages (see the
>>> commit message for more details).  I'll have a look into enabling 8 oword
>>> loads but it's going to require an analysis pass to avoid wasting bandwidth
>>> and increasing the register pressure unnecessarily when the shader doesn't
>>> actually need as many constants.
>>> 
>>> We could do something similar for non-uniform offset pull constant loads and
>>> for both kinds of pull constant loads on the vec4 back-end, but I don't have
>>> enough performance data to support that yet.
>>
>> Hi Curro!
>>
> Hi Ken,
>
>> Technically, I believe we /are/ taking advantage of the L3 today - the sampler
>> should be part of the "All Clients" and "Read Only Client Pool" portions of the
>> L3.  I believe the data port's "Constant Cache" is part of the same L3 region.
>> However, the sampler has an additional L1/L2 cache.
>>
> If you're referring to pull constants, nope we aren't, because it's also
> necessary to have set the MOCS bits to cacheable in L3, and that wasn't
> the case for any of the memory objects I mentioned except shader scratch
> space (the latter goes through the data cache so it's still not cached
> until Gen8).
>
>> When you say you "don't have enough performance data" to support doing this in
>> the vec4 backend, or for non-uniform offset pull loads, do you mean that you
>> tried it and it wasn't useful, or you just haven't tried it yet?
>>
> I tried it on the VS and didn't see any significant change in the
> benchmarks I had at hand.  For non-uniform pull constant loads it's a
> bit trickier because performance may be dependent on how non-uniform
> the offsets are, I don't have any convincing benchmark data yet but I'll
> look into it.
>
>> In my experience, the VS matters a *lot* - skinning shaders tend to use large
>> arrays of matrices, which get demoted to pull constants.  For example, I
>> observed huge speedups in GLBenchmark 2.7 EgyptHD (commit 5d8e246ac86b4a94,
>> in the VS backend), GLBenchmark 2.1 PRO (commit 259b65e2e79), and Trine
>> (commit 04f5b2f4e454d6 - in the ARBvp backend) by moving from the data cache
>> to the sampler.
>>
>> I'd love to see data for applying your new approach in the VS backend.
>>
> Sure, I'll try running those to see if it makes any difference.  If it
> does it can be fixed later on as a follow-up in any case.
>
>> --Ken

Ken, I don't see any reason to put this series on hold until the changes
for the other cases are ready instead of going through it incrementally.
The VS changes themselves are trivial and completely orthogonal to this
series, but the amount of testing and benchmarking to be done to make
sure that they don't incur a performance penalty on any of the other
platforms is overwhelming, and the expected benefit (according to my
previous observations) will be considerably lower than what we get from
the FS changes, if any, so it's not a high priority for me at this
point.

I'll get to it, I promise ;), but can we land this before it starts
bit-rotting?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150212/3e160cc8/attachment-0001.sig>