[Mesa-dev] [RFC] i965: alternative to memctx for cleaning up nir variants

Tue Dec 29 11:15:28 PST 2015

On Tue, Dec 29, 2015 at 12:36 PM, Jason Ekstrand <jason at jlekstrand.net> wrote:
>
>
> On Tue, Dec 29, 2015 at 7:32 AM, Rob Clark <robdclark at gmail.com> wrote:
>>
>> On Mon, Dec 28, 2015 at 4:23 PM, Connor Abbott <cwabbott0 at gmail.com>
>> wrote:
>> > On Mon, Dec 28, 2015 at 3:25 PM, Rob Clark <robdclark at gmail.com> wrote:
>> >>>
>> >>>> It is a mix.. I do texcoord saturate, clip-plane, and 2-sided color
>> >>>> lowering in NIR.  But flat-shading, binning-pass, and half vs full
>> >>>> precision color output in ir3.
>> >>>>
>> >>>> I do as much lowering in NIR as I can, in an effort to do as much as
>> >>>> possible at compile time, vs draw time.  I do the first round of
>> >>>> lowering/opt w/ null shader key, which is enough for the common
>> >>>> cases.
>> >>>>
>> >>>> Pretty much independent, I suppose, of whether I came out of SSA or
>> >>>> not first.  Although binning-pass variant and the instruction
>> >>>> scheduling I do are easier in SSA.
>> >>>>
>> >>>> Somewhat unrelated, but I may end up converting array access to
>> >>>> registers, but leave everything else in SSA, so I can benefit from
>> >>>> converting multi-dimensional offsets into a single offset..  this is
>> >>>> still one open issue w/ gallium glsl_to_nir.. right now I have a
>> >>>> hacked up version of nir_lower_io that converts global/local
>> >>>> load/store_var's into load/store_var2 which take an offset as a src
>> >>>> (like load_input/store_output) instead of deref chain.. not sure yet
>> >>>> whether this will be the permanent solution, but at least it fixes a
>> >>>> huge heap of variable-indexing piglits and lets me continue w/
>> >>>> implementing lowering passes for everything else that was previously
>> >>>> done in glsl->tgsi or tgsi->tgsi passes.
>> >>>
>> >>>
>> >>> If you do this, you'll be back to always needing a mutable copy.  Most
>> >>> lowering and optimization passes die the moment they see a register.
>> >>> You'll
>> >>> either have to go fix a bunch of stuff up to no-op properly or run
>> >>> vars_to_regs after doing your NIR lowering but before going into your
>> >>> backend IR.  This means that your "gold copy" still has variables and
>> >>> you
>> >>> always need to lower them to registers before you go into the backend.
>> >>
>> >> ugg.. but good point, thanks for pointing that out before I wasted
>> >> another afternoon on yet another dead-end for handling deref's..
>> >>
>> >> Ok, I guess I need to think of a better name than load/store_var2 for
>> >> the new intrinsics ;-)
>> >
>> > I don't think that "you should throw away registers and use your own
>> > thing" is what Jason wanted you to get out of that.
>
>
> Correct.  Registers are designed explicitly to do exactly what you want:
> Provide an easy-to-work-with linear view of complex variables.  I still
> don't understand why you're trying so hard not to use them.  The code is
> already written for you, you just have to turn it on.  The only thing you
> might have to do is make it take a type_size function like nir_lower_io does
> so that you can configure offset units to your backend's liking.
>
> What i was trying to get across is that the situation in which you can avoid
> cloning by clever use of reference counting is specific to your driver and
> exact way that you have your lowering passes set up.  If you perturb it even
> a little, you have to do a copy all the time and reference counting isn't
> helping anymore.  You're free to design your entire compiler stack around
> avoiding that one copy if you wish, but I wouldn't recommend it.
>
>> perhaps..  I was considering switching to registers for arrays.
>> Although it would end up forcing an extra clone in the common case
>> where there would otherwise not be one... a bit of a tough pill to
>> swallow..
>
>
> I don't see why that is such a tough pill.  Copying is cheap.  When we were
> writing the cloning code, we both ran shader-db runs where we were cloning
> after *every* optimization or lowering pass and it still only hurt runtime
> by something like 10 or 20%.  A single clone won't even get noticed.

well, that is encouraging..  although I probably tend a little bit
more to the cpu limited side of things..

> Also, you've already said that you pre-compile for the "common case" of a
> zero shader key so that extra clone gets eaten at compile time where you're
> already doing piles of optimization and lowering.  The case you really care
> about is when that key is non-zero and you have to stop everything and
> recompile in the middle of a draw.  In that case, you have a non-zero key so
> you have to do a clone anyway.

I guess if I did a clone before to_regs pass..  seems a bit
sub-optimal, and w/ a lower_deref pass (which took type_size fxn
ptr[1]) I could get basically the one part of registers that I want..

[1] Note that part of my gallium glsl_to_nir branch have started to
de-duplicate all the common type_size implementations.  Mesa st uses
one that is basically the same as what is in i965 (vec4) with the
addition of double support..

> At the end of the day, I think we're getting nowhere here.  We have two
> different memory management models that are in conflict.  The ralloc model
> saves us typing and provides some nice safety and refcounting saves you some
> typing and privdes you some nice safety.  It's becoming fairly obvious that
> neither side is going to convince the other that their model is better any
> time soon.  I'm open to suggestions on how to proceed.  One option would be
> to have Anholt come in and break the tie.  I'd be ok with that.  In any
> case, we need to solve this one way or another and either commit the patch
> or not.

Well, as it stands, on the refcnt'ing side of things, so far I am
outnumbered.  I was planning to re-work my ir3 changes without
refcnt'ing, and then the rest of the gallium glsl_to_nir support,
hopefully sometime in the next few days..

BR,
-R

> --Jason
>
>> > Most of the
>> > existing optimization passes barf on registers for a reason: registers
>> > imply that you've gone from "consumer-agnostic NIR," i.e. what's
>> > produced by gtn and operated on by generic optimizations, to your own
>> > driver-specific thing, and any optimizations you're going to run are
>> > only to clean up the result of the lowering passes, so you won't need
>> > to run most of them. In the few cases where we do need an optimization
>> > after lowering to registers, we've gone and fixed it up to no-op
>> > things properly, but in general it's a lot easier and less confusing
>> > to say "new optimization passes don't have to deal with registers"
>> > than to make everyone go and add support for registers to their
>> > passes. I'm not saying that adding a "here's my driver-specific
>> > offset" thing to load/store_var would necessarily be a bad idea, but
>> > don't just dismiss registers out-of-hand.
>>
>> Yeah, I'm not a big fan of making lowering/etc passes deal w/
>> registers unnecessarily.  Seems like coming up w/ some way to lower
>> load/store_var deref chains would be easier.
>>
>> BR,
>> -R