[Mesa-dev] [PATCH v3 002/104] nir: Add a deref instruction type

Mon Apr 9 14:53:46 UTC 2018

On Mon, Apr 9, 2018 at 10:25 AM, Jason Ekstrand <jason at jlekstrand.net> wrote:
> On Mon, Apr 9, 2018 at 5:35 AM, Rob Clark <robdclark at gmail.com> wrote:
>>
>> On Mon, Apr 9, 2018 at 1:35 AM, Jason Ekstrand <jason at jlekstrand.net>
>> wrote:
>> > Rather lively discussion we've got going here...
>> >
>> > On Sun, Apr 8, 2018 at 3:23 PM, Rob Clark <robdclark at gmail.com> wrote:
>> >>
>> >> On Sun, Apr 8, 2018 at 5:54 PM, Bas Nieuwenhuizen
>> >> <bas at basnieuwenhuizen.nl> wrote:
>> >> > On Sun, Apr 8, 2018 at 11:40 PM, Rob Clark <robdclark at gmail.com>
>> >> > wrote:
>> >> >> On Sun, Apr 8, 2018 at 5:20 PM, Bas Nieuwenhuizen
>> >> >> <bas at basnieuwenhuizen.nl> wrote:
>> >> >>>>>>>>>>> +
>> >> >>>>>>>>>>> +   /** The mode of the underlying variable */
>> >> >>>>>>>>>>> +   nir_variable_mode mode;
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> In fact, it seems like deref->mode is unused outside of
>> >> >>>>>>>>>> nir_print and
>> >> >>>>>>>>>> nir_validate.. for logical addressing we can get the mode
>> >> >>>>>>>>>> from
>> >> >>>>>>>>>> the
>> >> >>>>>>>>>> deref_var->var at the start of the chain, and deref->mode
>> >> >>>>>>>>>> has
>> >> >>>>>>>>>> no
>> >> >>>>>>>>>> meaning for physical addressing (where the mode comes from
>> >> >>>>>>>>>> the
>> >> >>>>>>>>>> pointer).
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> So maybe just drop deref->mode?
>> >> >>>>>>>>>
>> >> >>>>>>>>> Isn't it still useful with logical addressing in case a var
>> >> >>>>>>>>> is
>> >> >>>>>>>>> not
>> >> >>>>>>>>> immediately available? (think VK_KHR_variable_pointers)
>> >
>> >
>> > Yes, the idea here is to basically be the NIR equivalent of storage
>> > classes.
>> > For cases where you can chase it all the way back to the variable, this
>> > is
>> > pointless.  For cases where you can't, this can be very useful.
>> >
>> >>
>> >> >>>>>>>> not sure, maybe this should just also use fat-pointers like
>> >> >>>>>>>> physical
>> >> >>>>>>>> addressing does??
>> >
>> >
>> > I *think* (without knowing the future with perfect accuracy) that I'd
>> > like
>> > us to start moving away from fat pointers in SPIR-V -> NIR.  They've
>> > been
>> > working ok thus far but we already have had complaints from the radv
>> > guys
>> > that they don't want fat pointers for SLM and I'm not sure there that
>> > great
>> > of a fit for other things.  The big reason is that it's actually fairly
>> > painful to change from one fat pointer scheme to another if whatever the
>> > spirv_to_nir authors picked doesn't really fit your hardware.
>> >
>> > I'm not 100% sure what to do but the idea I've got at the moment is
>> > something like this:
>> >
>> >  1) For every mode (or storage class), the driver provides to
>> > spirv_to_nir a
>> > glsl_type which is the type it wants the pointer represented as.
>> >  2) Add a deref_to_pointer intrinsic to convert the deref to that type
>> > and
>> > use casts to convert the fat pointer to a deref again
>> >
>> > Why the deref_to_pointer intrinsic?  Because right now some annoying
>> > details
>> > about nir_intrinsic_info force us to have all drefs have a single
>> > component.
>> > If we didn't, then nir_intrinsic_store_var would have two variable-size
>> > sources which aren't the same size.  We could modify the rules and have
>> > a
>> > concept of a "deref source" and to nir_intrinsic_info and say that a
>> > deref
>> > source can be any size.  That would also work and may actually be a
>> > better
>> > plan.  I'm open to other suggestions as well.
>> >
>> > The key point, however, is (1) where we just let the driver choose the
>> > storage type of pointers and then they can have whatever lowering pass
>> > they
>> > want.  If that's a "standard" fat-pointers pass in nir_lower_io, so be
>> > it.
>> > If they want to translate directly into LLVM derefs, they can do that
>> > too, I
>> > suppose.
>> >
>> >> >>>>>>>>> Also I could see this being useful in physical addressing too
>> >> >>>>>>>>> to
>> >> >>>>>>>>> avoid
>> >> >>>>>>>>> all passes working with derefs needing to do the constant
>> >> >>>>>>>>> folding?
>> >
>> >
>> > Constant folding is cheap, so I'm not too worried about that.  The
>> > bigger
>> > thing that actual derefs gain us is the ability of the compiler to
>> > reason
>> > about derefs.  Suppose, for instance, that you had the following:
>> >
>> > struct S {
>> >    float a;
>> >    float b;
>> > };
>> >
>> > location(set = 0, binding = 0) Block {
>> >    S arr[10];
>> > } foo;
>> >
>> > void
>> > my_func(int i)
>> > {
>> >    foo.arr[i].a = make_a_value();
>> >    foo.arr[i].b = make_another_value();
>> >    use_value(foo.arr[i].a);
>> > }
>> >
>> > If everything is lowered to fat pointers, the compiler can't very easily
>> > tell that the second line of my_func() didn't stomp foo.arr[i].a and so
>> > it
>> > has to re-load the value in the third line.  With derefs, we can easily
>> > see
>> > that the second line didn't stomp it and just re-use the re-use the
>> > result
>> > of the make_a_value() call.  Yes, you may be able to tell the difference
>> > between foo.arr[i].a and foo.arr[i].b with some fancy analysis and
>> > arithmetic tracking but it'd be very painful to do.  I've given this
>> > substantial thought for almost two years now and haven't actually come
>> > up
>> > with a good way to do it without the information provided by derefs.
>> >
>> > The biggest thing that I think we will gain from deref instructions
>> > isn't
>> > fancy syntax sugar and better nir_printability of pointers.  It also
>> > isn't
>> > the removal of explicit of fat pointers in spirv_to_nir (thought that is
>> > also nice).  The biggest thing I'm going for is improving NIR's ability
>> > to
>> > optimize UBO, SSBO, and SLM access.  We're pretty good today for local
>> > variables (though there are a couple of known places for improvements)
>> > but
>> > we're utterly rubbish for anything that leaves the current shader
>> > invocation.  If we want competent compute performance (which we do!) we
>> > need
>> > to solve that problem.
>> >
>> >>
>> >> >>>>>>>> The problem is that you don't necessarily know the type at
>> >> >>>>>>>> compile
>> >> >>>>>>>> time (and in the case where you do, you need to do constant
>> >> >>>>>>>> folding to
>> >> >>>>>>>> figure it out)
>> >> >>>>>>>
>> >> >>>>>>> So I have two considerations here
>> >> >>>>>>>
>> >> >>>>>>> 1) for vulkan you always know the mode, even when you don't
>> >> >>>>>>> know
>> >> >>>>>>> the var.
>> >> >>>>>>> 2)  In CL the mode can still get annotated in the source
>> >> >>>>>>> program
>> >> >>>>>>> (CL C
>> >> >>>>>>> non-generic pointers) in cases in which we cannot reasonably
>> >> >>>>>>> figure it
>> >> >>>>>>> out with just constant folding. In those cases the mode is
>> >> >>>>>>> extra
>> >> >>>>>>> information that you really lose.
>> >> >>>>>>
>> >> >>>>>> so, even in cl 1.x, you could do things like 'somefxn(foo ?
>> >> >>>>>> global_ptr
>> >> >>>>>> : local_ptr)'.. depending on how much we inline all the things,
>> >> >>>>>> that
>> >> >>>>>> might not get CF'd away.
>> >> >>>
>> >> >>> How does this even work btw? somefxn has a definition, and the
>> >> >>> definition specifies a mode for the argument right? (which is
>> >> >>> implicitly __private if the app does not specify anything?)
>> >> >>
>> >> >> iirc, the cl spec has an example something along these lines..
>> >> >>
>> >> >> it doesn't require *physical* storage for anything where you don't
>> >> >> know what the ptr type is, however.. so fat ptrs in ssa space works
>> >> >> out
>> >> >>
>> >> >>>>>
>> >> >>>>> But something like
>> >> >>>>> __constant int *ptr_value = ...;
>> >> >>>>> store ptr in complex data structure.
>> >> >>>>> __constant int* ptr2 = load from complex data structure.
>> >> >>>>>
>> >> >>>>> Without explicitly annotating ptr2 it is unlikely that constant
>> >> >>>>> folding would find that ptr2 is pointing to __constant address
>> >> >>>>> space.
>> >> >>>>> Hence removing the modes loses valuable information that you
>> >> >>>>> cannot
>> >> >>>>> get back by constant folding. However, if you have a pointer with
>> >> >>>>> unknown mode, we could have a special mode (or mode_all?) and you
>> >> >>>>> can
>> >> >>>>> use the uvec2 representation in that case?
>> >
>> >
>> > I've thought about using a mode of 0 for this or maybe letting you OR
>> > all
>> > the possible modes together since nir_variable_mode is, after all, a
>> > bitfield.  I don't have any huge preference at the moment.
>> >
>> >>
>> >> >>>> hmm, I'm not really getting how deref->mode could magically have
>> >> >>>> information that fatptr.y doesn't have.. if the mode is known, vtn
>> >> >>>> could stash it in fatptr.y and everyone is happy?  If vtn doesn't
>> >> >>>> know
>> >> >>>> this, then I don't see how deref->mode helps..
>> >> >>>
>> >> >>> You mean insert it into the fatptr every time deref_cast is called?
>> >> >>>
>> >> >>> Wouldn't that blow up the IR size significantly for very little
>> >> >>> benefit?
>> >> >>
>> >> >> in an easy to clean up way, so meh?
>> >> >
>> >> > We can't clean it up if we want to keep the information. Also nir is
>> >> > pretty slow to compile already, so I'd like not to add a significant
>> >> > number of instruction for very little benefit.
>> >
>> >
>> > Really?  I mean, I can believe it, but do you have any actual numbers to
>> > back this up?  It's considerably faster than some other IRs we have in
>> > mesa
>> > though they are known to be pretty big pigs if we're honest.  I'm very
>> > open
>> > to any suggestions on how to improve compile times.  If NIR is
>> > fundamentally
>> > a pig, we should fix that.
>> >
>> > I don't think compile time should be the deciding factor in whether or
>> > not
>> > we use fat pointers as I doubt it will have a huge effect.  However, how
>> > much work it takes to get at information is a reasonable argument.
>> >
>> >>
>> >> I guess I'm failing to see what information you'd loose.. or how there
>> >> would be a problem..
>> >>
>> >> I'm not strictly against having nir_var_unknown as a possible value
>> >> for deref->mode, but I'm not really seeing how that helps anything,
>> >> other than adding extra complexity to the IR..
>> >>
>> >>
>> >> >>>>>>
>> >> >>>>>> I think I'm leaning towards using fat ptrs for the vk case,
>> >> >>>>>> since I
>> >> >>>>>> guess that is a case where you could always expect
>> >> >>>>>> nir_src_as_const_value() to work, to get the variable mode.  If
>> >> >>>>>> for
>> >> >>>>>> no
>> >> >>>>>> other reason than I guess these deref's, if the var is not
>> >> >>>>>> known,
>> >> >>>>>> start w/ deref_cast, and it would be ugly for deref_cast to have
>> >> >>>>>> to
>> >> >>>>>> work differently for compute vs vk.  But maybe Jason already has
>> >> >>>>>> some
>> >> >>>>>> thoughts about it?
>> >
>> >
>> > I don't see why they would be different.  Vulkan and CL are starting to
>> > converge.  In vulkan, you already can't always chase the pointer back to
>> > the
>> > variable VK_KHR_variable_pointers is used.  I don't think that CL is as
>> > special as you think it is.  The only thing special about CL is the fact
>> > that global and local are in the same address space.  Beyond that, all
>> > the
>> > problems we have to solve are fundamentally the same.
>> >
>> >>
>> >> >>>>> I'd like to avoid fat pointers alltogether on AMD since we would
>> >> >>>>> not
>> >> >>>>> use it even for CL. a generic pointer is just a uint64_t for us,
>> >> >>>>> with
>> >> >>>>> no bitfield in there for the address space.
>> >
>> >
>> > I tend to agree with this.  (See above discussion).
>> >
>> >>
>> >> >>>>> I think we may need to think a bit more about representation
>> >> >>>>> however,
>> >> >>>>> as e.g. for AMD a pointer is typically 64-bits (but we can do
>> >> >>>>> e.g.
>> >> >>>>> 32-bits for known workgroup pointers), the current deref
>> >> >>>>> instructions
>> >> >>>>> return 32-bit, and you want something like a uvec2 as pointer
>> >> >>>>> representation?
>> >> >>>>
>> >> >>>> afaiu, newer AMD (and NV) hw can remap shared/private into a
>> >> >>>> single
>> >> >>>> global address space..  But I guess that is an easy subset of the
>> >> >>>> harder case where drivers need to use different instructions.. so
>> >> >>>> a
>> >> >>>> pretty simple lowering pass run before lower_io could remap things
>> >> >>>> that use fatptrs into something that ignores fatptr.y.  Then opt
>> >> >>>> passes make fatptr.y go away.  So both AMD and hw that doesn't
>> >> >>>> have a
>> >> >>>> flat address space are happy.
>> >> >>>
>> >> >>> But then you run into other issues, like how are you going to stuff
>> >> >>> a
>> >> >>> 64-bit fatptr.x + a ?-bit fatptr.y into a 64-bit value for
>> >> >>> Physical64
>> >> >>> addressing? Also this means we have to track to the sources back to
>> >> >>> the cast/var any time we want to do anything at all with any deref
>> >> >>> which seems less efficient to me than just stuffing the deref in
>> >> >>> there.
>> >> >>
>> >> >> so fat ptrs only have to exist in ssa space, not be stored to
>> >> >> something with a physically defined size..
>> >> >
>> >> > how does storing __generic pointers work then? those still need the
>> >> > fat bit for your hw right?
>> >>
>> >> for hw that can't map everything into a single flat address space, you
>> >> *can't* store a fat pointer.. oh well, it doesn't mean that that hw
>> >> can't implement lesser ocl versions (since this is defn not required
>> >> for ocl 1.x and I don't *think* it is required for 2.x, but I haven't
>> >> checked the spec.. I guess if intel supports 2.x then it isn't
>> >> required..)
>> >
>> >
>> > You can still carve up the address space and put local memory at some
>> > absurdly high address and do
>> >
>> > if (ptr > 0xffffffffffff0000)
>> >    store_local(ptr & 0xffff)
>> > else
>> >    store_global(ptr)
>> >
>>
>> That might be an option if we need to physically store a fat pointer
>> (but I'm not convinced we need to).  But as a general solution it is a
>> horrible idea.  You either loose information (since when was TMI a
>> problem for compilers?) or you get to teach constant folding that even
>> though it doesn't know the value of 'a' it does know the value of 'a &
>> 0xffffffffffff0000', which is insanity.
>>
>> And simply not optimizing away the global/local/private turns every
>> load/store into something that looks like:
>>
>>    cmps.f.lt p0.x, c4.x, r0.z
>>    ; delay 6 instructions so branch can see the value in p0.x
>>    (rpt5)nop
>>    br !p0.x, #3
>>    ldg.u32 r0.w, r0.y, 1
>>    jump #11
>>    (jp)cmps.f.lt p0.x, c4.y, r0.z
>>    (rpt5)nop   ; delay 6 instruction slots
>>    br !p0.x, #8
>>    ldl.u32 r0.w, r0.y, 1
>>    jump #3
>>    ; private is really just global with a driver provided buffer
>>    ; and compiler computed offset
>>    (jp)add.s r0.y, c10.x   ; add base address of buffer
>>    (rpt3)nop               ; alu results not immediately avail and no
>> other instr to fill the empty slots
>>    add.s r0.y, ... offset calculated from local_id
>>    (rpt5)nop               ; 6 slots for result from alu avail to mem
>> instructions
>>    ldg.u32 r0.w, r0.y, 1
>>    (jp)(sy)... and now we have a result..
>>
>>
>> which is simply not an option!  And the result is actually worse since
>> we end up with 64b pointer math lowered to 32b instructions!
>>
>> tbh, I *really* don't see the problem with fat pointers, but if
>> someone else doesn't want deref_cast to use fat pointers, then I must
>> insist on a different intrinsic which does, and a compiler option for
>> vtn to choose which to emit.
>
>
> I'm sorry.  That comment was NOT intended as a suggestion for universal
> alternative to fat pointers.  It was merely to demonstrate how an
> architecture like Intel's which uses different messages for local vs. global
> could implement physical 64-bit (uint64_t) generic pointers.  Someone (looks
> like Rob from the indentation) made a comment about difficulties
> implementing OpenCL 2.x and I was just trying to show that it could be done.
> Of course, stuffing the extra information in a vector component is easier.
>
> This was a side comment to what I thought was a side comment in the
> discussion.
>

oh, my bad, I misinterpreted ;-)

yeah, this makes sense for the special case where you need storage, at
least for 64b memory model.

BR,
-R