[Mesa-dev] [PATCH 6/9] i965/fs: Fetch one cacheline of pull constants at a time.

Wed Dec 14 22:56:33 UTC 2016

On Wednesday, December 14, 2016 2:18:16 PM PST Francisco Jerez wrote:
> Francisco Jerez <currojerez at riseup.net> writes:
> 
> > Kenneth Graunke <kenneth at whitecape.org> writes:
> >
> >> On Friday, December 9, 2016 11:03:29 AM PST Francisco Jerez wrote:
> >>> Asking the DC for less than one cacheline (4 owords) of data for
> >>> uniform pull constants is suboptimal because the DC cannot request
> >>> less than that from L3, resulting in wasted bandwidth and unnecessary
> >>> message dispatch overhead, and exacerbating the IVB L3 serialization
> >>> bug.  The following table summarizes the overall framerate improvement
> >>> (with statistical significance of 5% and sample size ~10) from the
> >>> whole series up to this patch for several benchmarks and hardware
> >>> generations:
> >>> 
> >>>                          | SKL           | BDW          | HSW
> >>> SynMark2 OglShMapPcf     | 24.63% ±0.45% | 4.01% ±0.70% | 10.31% ±0.38%
> >>> GfxBench4 gl_manhattan31 |  5.93% ±0.35% | 3.92% ±0.31% |  6.62% ±0.22%
> >>> GfxBench4 gl_4           |  2.52% ±0.44% | 1.23% ±0.10% |      N/A
> >>> Unigine Valley           |  0.83% ±0.17% | 0.23% ±0.05% |  0.74% ±0.45%
> >>
> >> I suspect OglShMapPcf gained SIMD16 on Skylake due to reduced register
> >> pressure, from the lower message lengths on pull loads.  (At least, it
> >> did when I had a series to fix that.)  That's probably a large portion
> >> of the performance improvement here, and why it's so much larger for
> >> that workload on Skylake specifically.  It might be worth mentioning it
> >> in your commit message here.
> >>
> >
> > Yeah, that matches my understanding too.  I'll add some shader-db stats
> > in order to illustrate the effect of this on register pressure, as you
> > asked me to do in your previous reply.
> >
> 
> FTR, here is a summary of the effect of this series on several shader-db
> stats.  As you can see the register pressure benefit on SKL+ is
> substantial:
> 
>      Lost->Gained Total instructions          Total cycles                    Total spills          Total fills
> BWR:  5 ->   5    4571248 -> 4568342 (-0.06%) 123375740 -> 123373296 (-0.00%) 1488 -> 1488 (0.00%)  1957 -> 1957 (0.00%)
> ELK:  5 ->   5    3989020 -> 3985402 (-0.09%)  98757068 -> 98754058 (-0.00%)  1489 -> 1489 (0.00%)  1958 -> 1958 (0.00%)
> ILK:  1 ->   4    6383591 -> 6376787 (-0.11%) 143649910 -> 143648914 (-0.00%) 1449 -> 1449 (0.00%)  1921 -> 1921 (0.00%)
> SNB:  0 ->   0    7528395 -> 7501446 (-0.36%) 103503796 -> 102460370 (-1.01%)  549 -> 549 (0.00%)     52 -> 52 (0.00%)
> IVB: 13 ->   3    6949221 -> 6943317 (-0.08%)  60592262 -> 60584422 (-0.01%)  1271 -> 1271 (0.00%)  1162 -> 1162 (0.00%)
> HSW: 11 ->   0    6409753 -> 6403702 (-0.09%)  60609070 -> 60604414 (-0.01%)  1271 -> 1271 (0.00%)  1162 -> 1162 (0.00%)
> BDW: 12 ->   0    8043467 -> 7976364 (-0.83%)  68427730 -> 68483042 (0.08%)   1340 -> 1340 (0.00%)  1452 -> 1452 (0.00%)
> CHV: 12 ->   0    8045019 -> 7977916 (-0.83%)  68297426 -> 68352756 (0.08%)   1340 -> 1340 (0.00%)  1452 -> 1452 (0.00%)
> SKL:  0 -> 120    8204037 -> 7939086 (-3.23%)  66583900 -> 65624378 (-1.44%)  1269 -> 375 (-70.45%) 1563 -> 690 (-55.85%)

I'm a bit surprised that Gen7-8 lost SIMD16 programs.  Presumably there
are some cases where we don't need the whole cacheline worth of pulled
data, and this increased register pressure.  I suppose that could be
fixed by demoting pull message return length when the last channels
aren't used.  We might want to do that later on.

--Ken
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20161214/83b9deec/attachment.sig>