[Mesa-dev] [PATCH 00/25] i965: Scalar back-end support for SIMD32, part 4.

Sun May 29 19:01:26 UTC 2016

Francisco Jerez <currojerez at riseup.net> writes:

> Jason Ekstrand <jason at jlekstrand.net> writes:
>
>> On Fri, May 27, 2016 at 7:05 PM, Francisco Jerez <currojerez at riseup.net>
>> wrote:
>>
>>> This fixes the few code quality regressions from the previous series
>>> enabling SIMD32 CS codegen in the back-end -- AFAICT by the end of the
>>> series we can finally enable GL 4.3 on all Gen8+ hardware.
>>>
>>> Patches 1-8 delay the SIMD lowering pass after the bulk of
>>> optimization passes have been run, which should decrease the
>>> compilation time of mainly SIMD32 shaders and improve the code quality
>>> of SIMD32 shaders on all generations and shaders of any dispatch width
>>> on older generations (up to and including IVB) that use SIMD lowering
>>> more intensively to implement various workarounds.
>>>
>>> Patches 9-14 rework the SIMD lowering pass to avoid emitting the copy
>>> instructions used to zip and unzip register regions where possible,
>>> since the register coalesce and copy propagation passes seem to
>>> perform rather poorly at getting rid of them in some cases.  In the
>>> long term we'll likely want to improve the register coalesce pass
>>> irrespective of these changes.
>>>
>>> Patches 15-20 improve the compute-to-mrf pass used on Gen4-6 to handle
>>> cases where the source of a VGRF-to-MRF copy is initialized by the
>>> shader using multiple single-GRF writes, which becomes far more common
>>> with the additional SIMD lowering going on after this series.
>>>
>>> Patches 21-24 are some other assorted changes improving code quality
>>> on older gens.
>>>
>>> I wanted to provide more detailed (e.g. per commit) shader-db stats
>>> with this series, but kind of ran out of time.  Let me know if you
>>> would like to see more evidence that any of the changes below is
>>> improving code quality in case it's not clear from the commit alone.
>>>
>>> [PATCH 01/25] i965/fs: Let CSE handle logical sampler sends as expressions.
>>> [PATCH 02/25] i965/fs: Allow constant propagation into logical send
>>> sources.
>>> [PATCH 03/25] i965/fs: Add FS_OPCODE_FB_WRITE_LOGICAL to
>>> has_side_effects().
>>> [PATCH 04/25] i965/fs: Run SIMD and logical send lowering after the
>>> optimization loop.
>>> [PATCH 05/25] i965/fs: Take opt_redundant_discard_jumps out of the
>>> optimization loop.
>>> [PATCH 06/25] i965/fs: Fix UB list sentinel dereference in
>>> opt_sampler_eot().
>>> [PATCH 07/25] i965/fs: Implement opt_sampler_eot() in terms of logical
>>> sends.
>>
>> [PATCH 08/25] SQUASH: i965/fs: Add basic dataflow check to
>>> opt_sampler_eot().
>>>
>>
>>
>>> [PATCH 09/25] i965/fs: Refactor offset() into a separate function taking
>>> the width as argument.
>>> [PATCH 10/25] i965/fs: Generalize regions_overlap() from copy propagation
>>> to handle non-VGRF files.
>>> [PATCH 11/25] i965/fs: Factor out region zipping and unzipping from the
>>> SIMD lowering pass.
>>> [PATCH 12/25] i965/fs: Skip SIMD lowering source unzipping for regular
>>> scalar regions.
>>> [PATCH 13/25] i965/fs: Skip SIMD lowering destination zipping if possible.
>>> [PATCH 14/25] i965/fs: Reindent emit_zip().
>>>
>>
>> 9-14 Reviewed-by: Jason Ekstrand <jason at jlekstrand.net>
>>
>>
>>> [PATCH 15/25] i965/fs: Teach regions_overlap() about COMPR4 MRF regions.
>>> [PATCH 16/25] i965/fs: Simplify and improve accuracy of compute_to_mrf()
>>> by using regions_overlap().
>>> [PATCH 17/25] i965/fs: Fix compute-to-mrf VGRF region coverage condition.
>>> [PATCH 18/25] i965/fs: Refactor compute_to_mrf() to split search and
>>> rewrite into separate loops.
>>> [PATCH 19/25] i965/fs: Teach compute_to_mrf about the COMPR4 address
>>> transformation.
>>> [PATCH 20/25] i965/fs: Extend compute_to_mrf() to coalesce VGRFs
>>> initialized by multiple single-GRF writes.
>>> [PATCH 21/25] i965/fs: Extend remove_duplicate_mrf_writes() to handle
>>> non-VGRF to MRF copies.
>>>
>>
>> 15-21 scare me.  A lot.  They even make me think that forking the compiler
>> between SNB and IVB may be a good idea. :-/  MRFs are annoying, but COMPR4
>> is such a gross hack that I really want to teach as little of the compiler
>> about it as possible.
>>
> Heh, my impression (from writing the patches) is that COMPR4 was the
> easiest part of it, it's not entirely unlike the usual Align1 source
> regions -- with hard-wired width and vstride.  I agree it's kind of a
> gross hardware feature but all it took to handle it in compute-to-mrf
> was some simple arithmetic [it would get slightly more complex if
> compute-to-mrf attempted to handle COMPR4 LOAD_PAYLOAD instructions, but
> I don't think it would necessarily be an insane idea].
>
> Honestly, what made patches 16-20 really hard to write (and maybe also
> hard to review, I don't know), was disentangling the highly
> special-cased and dubious dataflow logic of the compute-to-mrf pass
> (patches 16 and 17 could qualify as bugfixes...), and that had little to

Uhm, I just noticed that patch 20 is effectively a bugfix too, because
right now the compute-to-mrf pass only looks at and rewrites the *first*
def it encounters of the VGRF it's coalescing away, so it will end up
corrupting the program if there was more than one def of the VGRF region
read by the copy instruction, because the copy instruction is removed.

I don't see any reason why this bug wouldn't be possible on master
already, but it definitely causes piglit regressions on Gen4-6 in
combination with patch 11 just because the ordering in which the zipping
instructions are emitted happens to change.  I'm afraid we need to land
patches 16-20 before patch 11...

> do with COMPR4 or even with MRFs.  Compute-to-mrf is a crappy register
> coalescing pass that can only handle a small subset of the cases where
> one could possibly elide VGRF-to-MRF copies.
>
> My suspicion is that the situation would improve (both in terms of code
> generation quality on MRF platforms and codebase maintainability) if
> instead of forking the driver we got rid of the compute-to-mrf pass
> altogether and used the normal register coalesce pass to coalesce VGRFs
> into MRFs (since the one difficult part about coalescing registers is
> working out the dataflow and variable interference, COMPR4 is just the
> surface).
>
> BTW, do you have any objection against PATCH 21?  It should be fully
> independent from the other MRF-related ones.
>
>> So here's the million dollar question: Do we need them? and, more
>> importantly, do we need them now?  I didn't see anything wrong in my brief
>> skimming but don't call that a review.
>>
> *Shrug*, nothing in this series is strictly necessary [heh, except of
> course PATCH 25 ;)], but if we don't include them the overall shader-db
> balance from the whole series (including parts 1-4 except for patches
> 16-20 from this series) on SNB would be:
>
>    total instructions in shared programs: 5721434 -> 5755717 (0.60%)
>    instructions in affected programs: 2402892 -> 2437175 (1.43%)
>    helped: 1284
>    HURT: 12148
>
> instead of the following (not excluding any changes):
>
>    total instructions in shared programs: 5721284 -> 5708369 (-0.23%)
>    instructions in affected programs: 489340 -> 476425 (-2.64%)
>    helped: 1398
>    HURT: 21
>
>>
>>> [PATCH 22/25] i965/fs: Fix constant combining for instructions that cannot
>>> accept source mods.
>>> [PATCH 23/25] i965/fs: Allow scalar source regions on SNB math
>>> instructions.
>>> [PATCH 24/25] i965/fs: Skip gen4 pre/post-send dependency workaronds for
>>> the first/last block.
>>> [PATCH 25/25] i965: Expose GL 4.3 on Gen8+.
>>>
>>
>> 22-25 are Reviewed-by: Jason Ekstrand <jason at jlekstrand.net>
>>
>
> Thanks.
>
>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160529/85f10239/attachment.sig>