[Mesa-dev] [PATCH] nir: add helper to get # of src/dest components

Fri Jun 19 16:04:43 PDT 2015

On Fri, Jun 19, 2015 at 6:21 PM, Connor Abbott <cwabbott0 at gmail.com> wrote:
> On Thu, Jun 18, 2015 at 12:04 PM, Rob Clark <robdclark at gmail.com> wrote:
>> On Thu, Jun 18, 2015 at 2:34 PM, Connor Abbott <cwabbott0 at gmail.com> wrote:
>>> On Thu, Jun 18, 2015 at 11:19 AM, Rob Clark <robdclark at gmail.com> wrote:
>>>> On Thu, Jun 18, 2015 at 1:27 PM, Connor Abbott <cwabbott0 at gmail.com> wrote:
>>>>> On Thu, Jun 18, 2015 at 9:42 AM, Rob Clark <robdclark at gmail.com> wrote:
>>>>>> On Thu, Jun 18, 2015 at 11:01 AM, Connor Abbott <cwabbott0 at gmail.com> wrote:
>>>>>>>>>> (really I want phi's for variables too..  the way I turn arrays into
>>>>>>>>>> fanin/collect fanout/split works on the backend for dealing with
>>>>>>>>>> arrays in ssa form (other than making instruction graph large) but the
>>>>>>>>>> way I go from nir to ir3 currently assumes I get nir phi's everywhere
>>>>>>>>>> I need an ir3 phi)
>>>>>>>>>
>>>>>>>>> Right... we explicitly decided not to support SSA form for arrays in
>>>>>>>>> NIR, since it seemed like a pretty bad idea. SSA form assumes that
>>>>>>>>> inserting copies for the things you're SSA-ifying is relatively
>>>>>>>>> inexpensive, and both SSA-based register allocation and algorithms for
>>>>>>>>> converting out of SSA make no guarantees about not inserting
>>>>>>>>> potentially unnecessary copies. This is a good compromise for smaller
>>>>>>>>> things, but not for your array of (say) 64 things where inserting an
>>>>>>>>> extra copy is rather disastrous. You can do it if you want and shoot
>>>>>>>>> yourself in the foot, but NIR isn't going to help you there -- we
>>>>>>>>> won't write a to-SSA pass for something which doesn't make much sense
>>>>>>>>> in the first place.
>>>>>>>>>
>>>>>>>>
>>>>>>>> in ir3 my solution is to add sufficient dependencies between
>>>>>>>> instructions so the array accesses don't get re-ordered and they all
>>>>>>>> collapse down to a single name per array element/slot
>>>>>>>
>>>>>>> It's not about getting reordered, it's about interference. The problem
>>>>>>> is that as soon as you do basically any optimization at all (even copy
>>>>>>> propagation), you can wind up with a situation where the sources and
>>>>>>> destination of a phi node interfere with each other and you have to
>>>>>>> insert extra mov's. And even if you keep everything exactly the same,
>>>>>>> with an SSA-based register allocator, there's always the chance that
>>>>>>> it'll screw up anyways and allocate something over your array and
>>>>>>> force you to insert a mov. You could force the array to be allocated
>>>>>>> to a single hardware register, but then it's not really an SSA value
>>>>>>> anymore -- it's a hardware register, and you can't treat it like an
>>>>>>> SSA value anymore in your allocator, and so adding phi nodes and
>>>>>>> whatnot for it in your IR doesn't make much sense.
>>>>>>
>>>>>>
>>>>>> But the point I'm trying to make is that I need the links from src to
>>>>>> dest that I get in SSA form for *scheduling* (for example, to know how
>>>>>> many delay slots are needed between two instructions).  For things
>>>>>> like if/else, I would need to consider number of cycles since either
>>>>>> possible assigner.  For everything else, the phi nodes (in ir3, not in
>>>>>> nir) give me this.  Arrays are not special (since they are allocated
>>>>>> in registers) when it comes to cycle counts.
>>>>>>
>>>>>> BR,
>>>>>> -R
>>>>>
>>>>> Except that they still are special, and you need to account for that
>>>>> when you set up scheduling dependencies for them. For example, imagine
>>>>> that you have an array A accessed in a loop:
>>>>>
>>>>> while (...) {
>>>>>     ... = A[i];
>>>>>     A[i] = ...;
>>>>> }
>>>>>
>>>>> if you lower the array to SSA, this will give you something like:
>>>>>
>>>>> while (...) {
>>>>>     A_1 = phi(A_0, A_2);
>>>>>     ... = A_1[i];
>>>>>     A_2 = Update(A_1, i, ...); /* makes a copy with the i'th element changed */
>>>>> }
>>>>>
>>>>> and when you set up scheduling dependencies, you'll miss the false
>>>>> write-after-read dependency between the access and the update, meaning
>>>>> you could end up with:
>>>>>
>>>>> while (...) {
>>>>>     A_1 = phi(A_0, A_2);
>>>>>     A_2 = Update(A_1, i, ...);
>>>>>     ... = A_1[i];
>>>>> }
>>>>>
>>>>> and now the number of instructions in your shader has exploded since
>>>>> you have to insert a copy somewhere. You could add all the false
>>>>> dependencies by yourself, and force it into the same register, but by
>>>>> that point you've already given up all the advantages that SSA has to
>>>>> offer and inserting phi nodes is a rather pointless exercise.
>>>>
>>>> except, like I said, for the purpose of scheduling realizing that
>>>> there are two dependency paths to consider for figuring out the
>>>> required number of delay slots..
>>>
>>> No, there aren't. There's the dependency between the read and the
>>> write (in my example), which serializes things and makes it one path.
>>> In other words, the stuff before the write must happen before the
>>> write, even if in SSA those are two different values.
>>
>> that wasn't really the case I was talking about..  but rather:
>>
>>  if (...) {
>>    ...
>>    arr[i] = a + b;
>>   } else {
>>    arr[i] = c + d;
>>   }
>>   ... = arr[i];
>
> In that case, everything is in a seperate basic block and the
> scheduler still doesn't need to have the phi nodes. This is already a
> well-known problem that i965, as well as basically every other backend
> ever, has solved without doing what you want to do.

The hardware doesn't care that they are in different basic blocks.
The difference between adreno and most other hw is that if I consume
the result of an earlier instruction too soon, I don't get sub-optimal
performance, I get undefined results.  Doesn't matter that the
producer and consumer are in different blocks.

(ok, to be fair, I'm not considering the inter-block scheduling in the
scheduler yet, but I will want to once I start optimizing..)

In the case of a direct array read, I need to know all the potential
instructions that previously wrote that array element, and number of
instruction slots in between.  (Note: that the # of slots needed
depends on both the producing and consuming instruction.)

Indirect array read/write is basically the same, I just consider the
worst case of all the array elements.

>>
>>
>>>>
>>>> That said, having to re-invent the to-ssa pass in my backend is
>>>> something I was hoping to avoid.. so I'm thinking of other
>>>> alternatives.  But currently the depth calculations (used for
>>>> scheduling), dead code elimination (used if nothing else to clean
>>>> things up when generating binning-pass shader), scheduling, and even
>>>> to some degree RA, depend on this use-def graph between instructions.
>>>
>>> What I'm saying is that for arrays, the use-def graph isn't enough. At
>>> least the scheduler has to consider additional dependencies when doing
>>> the depth calculations and when actually scheduling,
>>
>> essentially I setup dependencies so that array accesses depend
>> (indirectly, via some meta-instruction nodes) on the instructions
>> which wrote each array element previously, which maintains the
>> ordering..
>>
>> I do need to make writes depend on instructions which have read too,
>> as you pointed out.. but that fits in fine with the way the rest of
>> the backend works
>>
>>> and the RA has to
>>> split live-ranges of other things and deal with arrays specially too
>>> in order to not introduce extra array copies.
>>
>> If I did spilling/rematerialization..  but I don't.
>
> If you force every array into a specific register range, and assume
> that every array is live since the beginning of the program, then you
> may not have to do that. But then, again, this means that inserting
> phi nodes doesn't help you at all in the register allocator. If you're
> doing anything else, then you certainly will have to split live
> ranges, even without spilling, since you may run into a situation
> where the array becomes live, but there isn't a big enough slot to fit
> it in even if the register pressure is low enough for everything to
> fit. In any case, you'll have to know the original array where each
> value came from in order to not accidentally generate copies, so why
> bother inserting phi nodes at all?
>
>> You forget that
>> by the time I get to RA I can't easily just insert/remove
>> instructions.  I haven't quite decided *how* I'll eventually deal with
>> that (but it would likely involve jumping back and re-running the
>> scheduling pass).  I have a large enough register file that this is
>> pretty low priority.  It is only vaguely an issue at the moment
>> because the priority-queue scheduler that replaced what is on master
>> does very badly with wide/shallow shaders, ie. like
>> glsl-fs-convolution-1... ie. shaders with a lot of instructions at
>> minimum depth.. (but I have some ideas to fix that)
>>
>>> SSA does help a little
>>> with dead code elimination, but even there having arrays in SSA
>>> doesn't win you much compared to the non-SSA approach (when there are
>>> no reads left for the array, delete the array). The point is that SSA
>>> doesn't gain you nearly as much as you think, and it would actually be
>>> easier to rewrite things so that arrays aren't in SSA form than to
>>> support this awkward half-SSA thing.
>>
>> well, I'm considering it.. but it is really useful in scheduling for
>> me to be able to easily know which instruction(s) potentially produced
>> the value being consumed.
>
> But for arrays, that question doesn't even make sense. Every
> dependency between reads and writes in an array is a "maybe these
> things overlap, maybe they don't" dependency, and if you knew for
> certain which instruction produced the value being consumed, then you
> wouldn't be reading it out of an array. Again, having things in SSA is
> going to help you less than you think.

note the "(s)" on "instruction(s)"

for *in*direct array access, I don't know the specific array element,
but I know it is *one* of the array elements.

>>
>> Not to mention that internally we have a lot of other "arrays" such as
>> arguments to texture sample instructions.  I've managed to solve those
>> cases so far (without resorting to LOADPAYLOAD (which is awkward for
>> me thanks to scheduling) sort of hacks).
>
> Sure, you have fan in instructions (which, btw, is equivalent to
> LOAD_PAYLOAD, so I don't know why you "haven't resorted to it") to
> load things into textures and fanout instructions to get each
> component of the result, but that's a very different problem from
> arrays with indirect accesses, so the best solution isn't necessarily
> the same.

iirc you replace LOAD_PAYLOAD w/ a sequence of mov's after RA.. but
maybe I misunderstood.  I didn't notice anywhere in your RA code where
walk backwards to find the first instruction that populated an tex arg
"array" to consider it live from that point..

BR,
-R

>>
>> BR,
>> -R