[Mesa-dev] [PATCH] nir: add helper to get # of src/dest components

Thu Jun 18 12:04:14 PDT 2015

On Thu, Jun 18, 2015 at 2:34 PM, Connor Abbott <cwabbott0 at gmail.com> wrote:
> On Thu, Jun 18, 2015 at 11:19 AM, Rob Clark <robdclark at gmail.com> wrote:
>> On Thu, Jun 18, 2015 at 1:27 PM, Connor Abbott <cwabbott0 at gmail.com> wrote:
>>> On Thu, Jun 18, 2015 at 9:42 AM, Rob Clark <robdclark at gmail.com> wrote:
>>>> On Thu, Jun 18, 2015 at 11:01 AM, Connor Abbott <cwabbott0 at gmail.com> wrote:
>>>>>>>> (really I want phi's for variables too..  the way I turn arrays into
>>>>>>>> fanin/collect fanout/split works on the backend for dealing with
>>>>>>>> arrays in ssa form (other than making instruction graph large) but the
>>>>>>>> way I go from nir to ir3 currently assumes I get nir phi's everywhere
>>>>>>>> I need an ir3 phi)
>>>>>>>
>>>>>>> Right... we explicitly decided not to support SSA form for arrays in
>>>>>>> NIR, since it seemed like a pretty bad idea. SSA form assumes that
>>>>>>> inserting copies for the things you're SSA-ifying is relatively
>>>>>>> inexpensive, and both SSA-based register allocation and algorithms for
>>>>>>> converting out of SSA make no guarantees about not inserting
>>>>>>> potentially unnecessary copies. This is a good compromise for smaller
>>>>>>> things, but not for your array of (say) 64 things where inserting an
>>>>>>> extra copy is rather disastrous. You can do it if you want and shoot
>>>>>>> yourself in the foot, but NIR isn't going to help you there -- we
>>>>>>> won't write a to-SSA pass for something which doesn't make much sense
>>>>>>> in the first place.
>>>>>>>
>>>>>>
>>>>>> in ir3 my solution is to add sufficient dependencies between
>>>>>> instructions so the array accesses don't get re-ordered and they all
>>>>>> collapse down to a single name per array element/slot
>>>>>
>>>>> It's not about getting reordered, it's about interference. The problem
>>>>> is that as soon as you do basically any optimization at all (even copy
>>>>> propagation), you can wind up with a situation where the sources and
>>>>> destination of a phi node interfere with each other and you have to
>>>>> insert extra mov's. And even if you keep everything exactly the same,
>>>>> with an SSA-based register allocator, there's always the chance that
>>>>> it'll screw up anyways and allocate something over your array and
>>>>> force you to insert a mov. You could force the array to be allocated
>>>>> to a single hardware register, but then it's not really an SSA value
>>>>> anymore -- it's a hardware register, and you can't treat it like an
>>>>> SSA value anymore in your allocator, and so adding phi nodes and
>>>>> whatnot for it in your IR doesn't make much sense.
>>>>
>>>>
>>>> But the point I'm trying to make is that I need the links from src to
>>>> dest that I get in SSA form for *scheduling* (for example, to know how
>>>> many delay slots are needed between two instructions).  For things
>>>> like if/else, I would need to consider number of cycles since either
>>>> possible assigner.  For everything else, the phi nodes (in ir3, not in
>>>> nir) give me this.  Arrays are not special (since they are allocated
>>>> in registers) when it comes to cycle counts.
>>>>
>>>> BR,
>>>> -R
>>>
>>> Except that they still are special, and you need to account for that
>>> when you set up scheduling dependencies for them. For example, imagine
>>> that you have an array A accessed in a loop:
>>>
>>> while (...) {
>>>     ... = A[i];
>>>     A[i] = ...;
>>> }
>>>
>>> if you lower the array to SSA, this will give you something like:
>>>
>>> while (...) {
>>>     A_1 = phi(A_0, A_2);
>>>     ... = A_1[i];
>>>     A_2 = Update(A_1, i, ...); /* makes a copy with the i'th element changed */
>>> }
>>>
>>> and when you set up scheduling dependencies, you'll miss the false
>>> write-after-read dependency between the access and the update, meaning
>>> you could end up with:
>>>
>>> while (...) {
>>>     A_1 = phi(A_0, A_2);
>>>     A_2 = Update(A_1, i, ...);
>>>     ... = A_1[i];
>>> }
>>>
>>> and now the number of instructions in your shader has exploded since
>>> you have to insert a copy somewhere. You could add all the false
>>> dependencies by yourself, and force it into the same register, but by
>>> that point you've already given up all the advantages that SSA has to
>>> offer and inserting phi nodes is a rather pointless exercise.
>>
>> except, like I said, for the purpose of scheduling realizing that
>> there are two dependency paths to consider for figuring out the
>> required number of delay slots..
>
> No, there aren't. There's the dependency between the read and the
> write (in my example), which serializes things and makes it one path.
> In other words, the stuff before the write must happen before the
> write, even if in SSA those are two different values.

that wasn't really the case I was talking about..  but rather:

 if (...) {
   ...
   arr[i] = a + b;
  } else {
   arr[i] = c + d;
  }
  ... = arr[i];

>>
>> That said, having to re-invent the to-ssa pass in my backend is
>> something I was hoping to avoid.. so I'm thinking of other
>> alternatives.  But currently the depth calculations (used for
>> scheduling), dead code elimination (used if nothing else to clean
>> things up when generating binning-pass shader), scheduling, and even
>> to some degree RA, depend on this use-def graph between instructions.
>
> What I'm saying is that for arrays, the use-def graph isn't enough. At
> least the scheduler has to consider additional dependencies when doing
> the depth calculations and when actually scheduling,

essentially I setup dependencies so that array accesses depend
(indirectly, via some meta-instruction nodes) on the instructions
which wrote each array element previously, which maintains the
ordering..

I do need to make writes depend on instructions which have read too,
as you pointed out.. but that fits in fine with the way the rest of
the backend works

> and the RA has to
> split live-ranges of other things and deal with arrays specially too
> in order to not introduce extra array copies.

If I did spilling/rematerialization..  but I don't.  You forget that
by the time I get to RA I can't easily just insert/remove
instructions.  I haven't quite decided *how* I'll eventually deal with
that (but it would likely involve jumping back and re-running the
scheduling pass).  I have a large enough register file that this is
pretty low priority.  It is only vaguely an issue at the moment
because the priority-queue scheduler that replaced what is on master
does very badly with wide/shallow shaders, ie. like
glsl-fs-convolution-1... ie. shaders with a lot of instructions at
minimum depth.. (but I have some ideas to fix that)

> SSA does help a little
> with dead code elimination, but even there having arrays in SSA
> doesn't win you much compared to the non-SSA approach (when there are
> no reads left for the array, delete the array). The point is that SSA
> doesn't gain you nearly as much as you think, and it would actually be
> easier to rewrite things so that arrays aren't in SSA form than to
> support this awkward half-SSA thing.

well, I'm considering it.. but it is really useful in scheduling for
me to be able to easily know which instruction(s) potentially produced
the value being consumed.

Not to mention that internally we have a lot of other "arrays" such as
arguments to texture sample instructions.  I've managed to solve those
cases so far (without resorting to LOADPAYLOAD (which is awkward for
me thanks to scheduling) sort of hacks).

BR,
-R