[Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

Fri Apr 3 10:49:37 PDT 2015

On Fri, Apr 3, 2015 at 8:37 AM, Francisco Jerez <currojerez at riseup.net> wrote:
> Jason Ekstrand <jason at jlekstrand.net> writes:
>
>> On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>
>>>> On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>>>
>>>>>> Instead of the complicated and broken-by-design pile of heuristics we had
>>>>>> before, we now have a straightforward lowering:
>>>>>>
>>>>>>  1) All header sources are copied directly using force_writemask_all and,
>>>>>>     since they are guaranteed to be a single register, there are no
>>>>>>     force_sechalf issues.
>>>>>>
>>>>>>  2) All non-header sources are copied using the exact same force_sechalf
>>>>>>     and saturate modifiers as the LOAD_PAYLOAD operation itself.
>>>>>>
>>>>>>  3) In order to accommodate older gens that need interleaved colors,
>>>>>>     lower_load_payload detects when the destination is a COMPR4 register
>>>>>>     and automatically interleaves the non-header sources.  The
>>>>>>     lower_load_payload pass does the right thing here regardless of whether
>>>>>>     or not the hardware actually supports COMPR4.
>>>>>
>>>>> I had a quick glance at the series and it seems to be going in the right
>>>>> direction.
>>>>>
>>>>> One thing I honestly don't like is the ad-hoc and IMHO premature
>>>>> treatment of payload headers, it still feels like the LOAD_PAYLOAD
>>>>> instruction has more complex semantics than necessary and the benefit is
>>>>> unclear to me.
>>>>>
>>>>> I suppose that your motivation was to avoid setting force_writemask_all
>>>>> in LOAD_PAYLOAD instructions with header.  The optimizer should be able
>>>>> to cope with those flags and get rid of them from the resulting moves
>>>>> where they are redundant, and if it's not able to it looks like
>>>>> something that should be fixed anyway.  The explicit handling of headers
>>>>> is responsible for much of the churn in this series and is likely to
>>>>> complicate users of LOAD_PAYLOAD and optimization passes that have to
>>>>> manipulate them.
>>>>
>>>> Avoiding force_writemask_all is only half of the motivation and the
>>>> small half at that.  A header source, more properly defined, is a
>>>> single physical register that, conceptually, applies to all channels.
>>>> Effectively, a header source (I should have stated this clearly) has
>>>> two properties:
>>>>
>>>> 1) It has force_writemask_all set
>>>> 2) It is exactly one physical hardware register.
>>>>
>>>> This second property is the more important of the two.  Most of the
>>>> disaster of the previous LOAD_PAYLOAD implementation was that we did a
>>>> pile of guesswork and had a ill-conceved "effective width" think in
>>>> order to figure out how big the register actually was.  Making the
>>>> user specify which sources are header sources eliminates that
>>>> guesswork.  It also has the nice side-effect that we can do the right
>>>> force_writemask_all and we can properly handle COMPR4 for the the
>>>> user.

Ok, Allow me to be a bit more explicit as to what all we need to keep track of:

 1) How big is the source register for real.  Even immediates can end
up being two registers in the copy.
 2) Do we want force_writemask_all?
 3) If not, do we want force_sechalf?
 4) On g45 and gen5, we want to use COMPR4 for interlaced movs
 5) When lowering, we want to use 16-wide moves when possible in SIMD16

With the patch series I sent, all of this is explicit except for
COMPR4 which is, admittedly, kind of magic.  "Which of these sources
are headers?" is a reasonable question for the caller to answer.  It
knows explicitly and it would take the LOAD_PAYLOAD helper some work
to guess it correctly.  Another option would be to guess that based on
exec sizes but then the caller has to know not to pass in the wrong
register type or the guess will be wrong.  I like explicit.

>>> Yeah, true, but this seems like the least orthogonal and most annoying
>>> to use solution for this problem, it forces the caller to provide
>>> redundant information, it takes into account the saturate flag on some
>>> arguments and not others, it shuffles sources with respect to the
>>> specified order when COMPR4 is set, but only for the first four
>>> non-header sources.  I think any of the following solutions would be
>>> better-behaved than the current approach:
>>
>> I don't know that saying which sources are headers is really
>> redundant.  It's explicit which is what we want.  Yes, the COMPR4
>> thing is a bit magical but we have to do COMPR4 in lower_load_payload
>> so we have to have some way of doing it and this method puts the
>> interleving code in one place instead of two.
>>
> Well, at least with the previous approach LOAD_PAYLOAD had consistent
> (if broken) semantics across its arguments, and regardless of COMPR4
> being used or not, which IMHO is preferable to the modest code saving.

To be clear, I don't really like the way I did COMPR4 either.  I just
couldn't come up with anything better.

>>> 1/ Use the source width to determine the size of each copy.  This would
>>>    imply that the source width carries semantic information and hence
>>>    would have to be left alone by e.g. copy propagation.
>>
>> That's what do now and it's terrible.  The "effective width" field was
>> basically a width that gets kept.
>>
> Couldn't you just prevent copy propagation from propagating a source of
> different width into a LOAD_PAYLOAD?  You would still be able to get rid
> of the metadata guess and effective_width, but you may have to re-run
> copy propagate after lower_load_payload() for the case you missed any
> optimization oportunities.

Yes, we could prevent copy propagation from messing up exec sizes.
That solves 1 but 2-5 still require terrible heuristics.  Setting
force_writemask_all sort-of solves 2 and 3.  COMPR4 would have to have
a hueristic again which is too complicated.  The 16-wide stuff is
still a problem.

>>> 2/ Use the instruction exec size and flags to determine the properties
>>>    of *all* copies.  This means that if a header is present the exec
>>>    size would necessarily have to be 8 and the halves of a 16-wide
>>>    register would have to be specified separately, which sounds annoying
>>>    at first but in practice wouldn't necessarily be because it could be
>>>    handled by the LOAD_PAYLOAD() helper based on the argument widths
>>>    without running into problems with optimization passes changing the
>>>    meaning of the instruction.  The semantics of the instruction itself
>>>    would be as stupid as possible, but the implementation could still
>>>    trivially recognise 16-wide and COMPR4 copies using the exact same
>>>    mechanism you are using now.
>>
>> Yes, that might work.  I'll try and take a swing at it today.  It will
>> *hopefully* have less code churn than the solution in this series
>> because the magic will still happen, just in a different place.  Of
>> course, this solution also requires that we lower everything with
>> force_writemask_all and *hopefully* we can get rid of those and set
>> force_sechalf appropriately in optimization passes.

Thinking about this some more, I'm not really a fan.    As with your
first suggestion, we can force_writemask_all and the heuristic for
COMPR4 is pretty simple.  However, it still has the problem of
properly handling SIMD16.

>>> 3/ Split LOAD_PAYLOAD into two separate instructions, each of them dead
>>>    simple, say COLLECT and LOAD_HEADER.  "COLLECT dst, src0, ..., srcn"
>>>    would be equivalent to the LOAD_PAYLOAD instruction described in 2,
>>>    but it would only be used to load full-width non-header sources of
>>>    the payload, so you would avoid the need to split 16-wide registers
>>>    in half.  "LOAD_HEADER dst, header, payload" would handle the
>>>    asymmetric requirements of prepending a header, like using 8-wide
>>>    instead of exec_size-wide copies and setting force_writemask_all.
>>>    You could use mlen to specify the size of payload as is usual for
>>>    instructions taking a payload source.
>>
>> I thought about doing something like that but it seems more convoluted
>> than would be at all worth it.
>>
> It's not so obvious to me that 3 would necessarily be more difficult
> than getting 2 right, but yeah it would definitely be more effort than 1.

The problem with this solution is that headers can be different sizes.
Texturing instructions use 1-reg headers but fb writes a header that's
between 0 and 4 registers.  Also, we don't want to have to look at the
size of a vgrf to figure out what the instruction does so we have to
have some way of encoding the number of registers read in both
arguments into the instruction.  And why is overloading mlen better
than renaming and overloading has_header?

Swinging back around to a more general discussion...

1) Any of the solutions currently on the table will take care of this.
I'm not a big fan of making copy-prop be LOAD_PAYLOAD-aware, but my
solutions (which I'll call 0) and 2 and 3 handle it ok.

2+3) My solution gives us this explicitly for free.  The other option,
which you keep suggesting, is to force_writemask_all the universe.
This works, but uses more than needed which, at the moment, is ok.
However, if we want to clean it up, optimization passes will need to
be aware of it and it's not clear to me how we would handle it.

4) My solution is, admittedly, a hack but it's a predictable hack.
Also, my hack does work for all of the messages we care about and I
don't think we'll be adding new messages for gen <= 5 so I'm not
terribly concerned about it.  Having a heuristic at lowering time
would be nicer because the user can simply say what registers they
want where and lowering does the right thing.  However, if we have to
pick between COMPR4 being nice and other things being nice, I'd rather
leave COMPR4 a hack.

5) This one is tricky.  CSE can generate payload copies that don't get
eaten up by register coalesce.  I don't remember exactly how this
happens but I do know that I've seen it in the wild.  When this
happens, you can end up with a copy payload that copies an odd number
of registers.  If you use the simple heuristic of just use SIMD16
moves until you can't and then use a SIMD8, you end up moving things
that aren't pairs as pairs.  Sure, if you set force_writemask_all, the
code is correct.  However, it's possible that register coalesce could
have cleaned up part of the payload but now it can't because the moves
are off-center.  This problem is *why* I wrote the nasty pile of
heuristics that we have now.  None of your suggestions offer a
solution to this where as having the user specify the header size
gives it to you for free.

All in all, I'm now even more convinced that we want the user of
LOAD_PAYLOAD to just tell us what sources are headers and what aren't.
I think it's also conceptually cleaner because the header sources are
different from the rest of the payload.
--Jason

P.S.  For whatever it's worth, it would also allow us to split
LOAD_PAYLOAD instructions from SIMD16 to SIMD8 correctly by copying
the headers directly and splitting the non-headers in half.