[Mesa-dev] [PATCH 6/9] i965/fs: Fetch one cacheline of pull constants at a time.

Francisco Jerez currojerez at riseup.net
Wed Dec 14 23:59:32 UTC 2016


Kenneth Graunke <kenneth at whitecape.org> writes:

> On Wednesday, December 14, 2016 2:18:16 PM PST Francisco Jerez wrote:
>> Francisco Jerez <currojerez at riseup.net> writes:
>> 
>> > Kenneth Graunke <kenneth at whitecape.org> writes:
>> >
>> >> On Friday, December 9, 2016 11:03:29 AM PST Francisco Jerez wrote:
>> >>> Asking the DC for less than one cacheline (4 owords) of data for
>> >>> uniform pull constants is suboptimal because the DC cannot request
>> >>> less than that from L3, resulting in wasted bandwidth and unnecessary
>> >>> message dispatch overhead, and exacerbating the IVB L3 serialization
>> >>> bug.  The following table summarizes the overall framerate improvement
>> >>> (with statistical significance of 5% and sample size ~10) from the
>> >>> whole series up to this patch for several benchmarks and hardware
>> >>> generations:
>> >>> 
>> >>>                          | SKL           | BDW          | HSW
>> >>> SynMark2 OglShMapPcf     | 24.63% ±0.45% | 4.01% ±0.70% | 10.31% ±0.38%
>> >>> GfxBench4 gl_manhattan31 |  5.93% ±0.35% | 3.92% ±0.31% |  6.62% ±0.22%
>> >>> GfxBench4 gl_4           |  2.52% ±0.44% | 1.23% ±0.10% |      N/A
>> >>> Unigine Valley           |  0.83% ±0.17% | 0.23% ±0.05% |  0.74% ±0.45%
>> >>
>> >> I suspect OglShMapPcf gained SIMD16 on Skylake due to reduced register
>> >> pressure, from the lower message lengths on pull loads.  (At least, it
>> >> did when I had a series to fix that.)  That's probably a large portion
>> >> of the performance improvement here, and why it's so much larger for
>> >> that workload on Skylake specifically.  It might be worth mentioning it
>> >> in your commit message here.
>> >>
>> >
>> > Yeah, that matches my understanding too.  I'll add some shader-db stats
>> > in order to illustrate the effect of this on register pressure, as you
>> > asked me to do in your previous reply.
>> >
>> 
>> FTR, here is a summary of the effect of this series on several shader-db
>> stats.  As you can see the register pressure benefit on SKL+ is
>> substantial:
>> 
>>      Lost->Gained Total instructions          Total cycles                    Total spills          Total fills
>> BWR:  5 ->   5    4571248 -> 4568342 (-0.06%) 123375740 -> 123373296 (-0.00%) 1488 -> 1488 (0.00%)  1957 -> 1957 (0.00%)
>> ELK:  5 ->   5    3989020 -> 3985402 (-0.09%)  98757068 -> 98754058 (-0.00%)  1489 -> 1489 (0.00%)  1958 -> 1958 (0.00%)
>> ILK:  1 ->   4    6383591 -> 6376787 (-0.11%) 143649910 -> 143648914 (-0.00%) 1449 -> 1449 (0.00%)  1921 -> 1921 (0.00%)
>> SNB:  0 ->   0    7528395 -> 7501446 (-0.36%) 103503796 -> 102460370 (-1.01%)  549 -> 549 (0.00%)     52 -> 52 (0.00%)
>> IVB: 13 ->   3    6949221 -> 6943317 (-0.08%)  60592262 -> 60584422 (-0.01%)  1271 -> 1271 (0.00%)  1162 -> 1162 (0.00%)
>> HSW: 11 ->   0    6409753 -> 6403702 (-0.09%)  60609070 -> 60604414 (-0.01%)  1271 -> 1271 (0.00%)  1162 -> 1162 (0.00%)
>> BDW: 12 ->   0    8043467 -> 7976364 (-0.83%)  68427730 -> 68483042 (0.08%)   1340 -> 1340 (0.00%)  1452 -> 1452 (0.00%)
>> CHV: 12 ->   0    8045019 -> 7977916 (-0.83%)  68297426 -> 68352756 (0.08%)   1340 -> 1340 (0.00%)  1452 -> 1452 (0.00%)
>> SKL:  0 -> 120    8204037 -> 7939086 (-3.23%)  66583900 -> 65624378 (-1.44%)  1269 -> 375 (-70.45%) 1563 -> 690 (-55.85%)
>
> I'm a bit surprised that Gen7-8 lost SIMD16 programs.  Presumably there
> are some cases where we don't need the whole cacheline worth of pulled
> data, and this increased register pressure.  I suppose that could be
> fixed by demoting pull message return length when the last channels
> aren't used.  We might want to do that later on.
>

Yeah, that's one of the two reasons I reworked things somewhat with
respect to the previous version, in order to give the optimizer control
on the number of OWORDs read by the pull constant load message (the
other reason is so we're able to be more aggressive in the future and
request 8 OWORDs at a time per pull constant message).  Though as you
can see asking for less OWORDs would only help ~10 shaders re-gain
SIMD16 on Gen7-8, which is tiny compared to the benefit on Gen9+.  On
top of that the register pressure scheduling heuristic I'm working on
re-gains us most of the SIMD16 shaders lost here (and a bunch more), so
it didn't seem terribly concerning.

> --Ken
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20161214/60c14ef5/attachment.sig>


More information about the mesa-dev mailing list