[Mesa-dev] [PATCH v3 002/104] nir: Add a deref instruction type

Mon Apr 9 05:35:12 UTC 2018

Rather lively discussion we've got going here...

On Sun, Apr 8, 2018 at 3:23 PM, Rob Clark <robdclark at gmail.com> wrote:

> On Sun, Apr 8, 2018 at 5:54 PM, Bas Nieuwenhuizen
> <bas at basnieuwenhuizen.nl> wrote:
> > On Sun, Apr 8, 2018 at 11:40 PM, Rob Clark <robdclark at gmail.com> wrote:
> >> On Sun, Apr 8, 2018 at 5:20 PM, Bas Nieuwenhuizen
> >> <bas at basnieuwenhuizen.nl> wrote:
> >>>>>>>>>>> +
> >>>>>>>>>>> +   /** The mode of the underlying variable */
> >>>>>>>>>>> +   nir_variable_mode mode;
> >>>>>>>>>>
> >>>>>>>>>> In fact, it seems like deref->mode is unused outside of
> nir_print and
> >>>>>>>>>> nir_validate.. for logical addressing we can get the mode from
> the
> >>>>>>>>>> deref_var->var at the start of the chain, and deref->mode has no
> >>>>>>>>>> meaning for physical addressing (where the mode comes from the
> >>>>>>>>>> pointer).
> >>>>>>>>>>
> >>>>>>>>>> So maybe just drop deref->mode?
> >>>>>>>>>
> >>>>>>>>> Isn't it still useful with logical addressing in case a var is
> not
> >>>>>>>>> immediately available? (think VK_KHR_variable_pointers)
>

Yes, the idea here is to basically be the NIR equivalent of storage
classes.  For cases where you can chase it all the way back to the
variable, this is pointless.  For cases where you can't, this can be very
useful.

> >>>>>>>> not sure, maybe this should just also use fat-pointers like
> physical
> >>>>>>>> addressing does??
>

I *think* (without knowing the future with perfect accuracy) that I'd like
us to start moving away from fat pointers in SPIR-V -> NIR.  They've been
working ok thus far but we already have had complaints from the radv guys
that they don't want fat pointers for SLM and I'm not sure there that great
of a fit for other things.  The big reason is that it's actually fairly
painful to change from one fat pointer scheme to another if whatever the
spirv_to_nir authors picked doesn't really fit your hardware.

I'm not 100% sure what to do but the idea I've got at the moment is
something like this:

 1) For every mode (or storage class), the driver provides to spirv_to_nir
a glsl_type which is the type it wants the pointer represented as.
 2) Add a deref_to_pointer intrinsic to convert the deref to that type and
use casts to convert the fat pointer to a deref again

Why the deref_to_pointer intrinsic?  Because right now some annoying
details about nir_intrinsic_info force us to have all drefs have a single
component.  If we didn't, then nir_intrinsic_store_var would have two
variable-size sources which aren't the same size.  We could modify the
rules and have a concept of a "deref source" and to nir_intrinsic_info and
say that a deref source can be any size.  That would also work and may
actually be a better plan.  I'm open to other suggestions as well.

The key point, however, is (1) where we just let the driver choose the
storage type of pointers and then they can have whatever lowering pass they
want.  If that's a "standard" fat-pointers pass in nir_lower_io, so be it.
If they want to translate directly into LLVM derefs, they can do that too,
I suppose.

>>>>>>>>> Also I could see this being useful in physical addressing too to
> avoid
> >>>>>>>>> all passes working with derefs needing to do the constant
> folding?
>

Constant folding is cheap, so I'm not too worried about that.  The bigger
thing that actual derefs gain us is the ability of the compiler to reason
about derefs.  Suppose, for instance, that you had the following:

struct S {
   float a;
   float b;
};

location(set = 0, binding = 0) Block {
   S arr[10];
} foo;

void
my_func(int i)
{
   foo.arr[i].a = make_a_value();
   foo.arr[i].b = make_another_value();
   use_value(foo.arr[i].a);
}

If everything is lowered to fat pointers, the compiler can't very easily
tell that the second line of my_func() didn't stomp foo.arr[i].a and so it
has to re-load the value in the third line.  With derefs, we can easily see
that the second line didn't stomp it and just re-use the re-use the result
of the make_a_value() call.  Yes, you may be able to tell the difference
between foo.arr[i].a and foo.arr[i].b with some fancy analysis and
arithmetic tracking but it'd be very painful to do.  I've given this
substantial thought for almost two years now and haven't actually come up
with a good way to do it without the information provided by derefs.

The biggest thing that I think we will gain from deref instructions isn't
fancy syntax sugar and better nir_printability of pointers.  It also isn't
the removal of explicit of fat pointers in spirv_to_nir (thought that is
also nice).  The biggest thing I'm going for is improving NIR's ability to
optimize UBO, SSBO, and SLM access.  We're pretty good today for local
variables (though there are a couple of known places for improvements) but
we're utterly rubbish for anything that leaves the current shader
invocation.  If we want competent compute performance (which we do!) we
need to solve that problem.

> >>>>>>>> The problem is that you don't necessarily know the type at compile
> >>>>>>>> time (and in the case where you do, you need to do constant
> folding to
> >>>>>>>> figure it out)
> >>>>>>>
> >>>>>>> So I have two considerations here
> >>>>>>>
> >>>>>>> 1) for vulkan you always know the mode, even when you don't know
> the var.
> >>>>>>> 2)  In CL the mode can still get annotated in the source program
> (CL C
> >>>>>>> non-generic pointers) in cases in which we cannot reasonably
> figure it
> >>>>>>> out with just constant folding. In those cases the mode is extra
> >>>>>>> information that you really lose.
> >>>>>>
> >>>>>> so, even in cl 1.x, you could do things like 'somefxn(foo ?
> global_ptr
> >>>>>> : local_ptr)'.. depending on how much we inline all the things, that
> >>>>>> might not get CF'd away.
> >>>
> >>> How does this even work btw? somefxn has a definition, and the
> >>> definition specifies a mode for the argument right? (which is
> >>> implicitly __private if the app does not specify anything?)
> >>
> >> iirc, the cl spec has an example something along these lines..
> >>
> >> it doesn't require *physical* storage for anything where you don't
> >> know what the ptr type is, however.. so fat ptrs in ssa space works
> >> out
> >>
> >>>>>
> >>>>> But something like
> >>>>> __constant int *ptr_value = ...;
> >>>>> store ptr in complex data structure.
> >>>>> __constant int* ptr2 = load from complex data structure.
> >>>>>
> >>>>> Without explicitly annotating ptr2 it is unlikely that constant
> >>>>> folding would find that ptr2 is pointing to __constant address space.
> >>>>> Hence removing the modes loses valuable information that you cannot
> >>>>> get back by constant folding. However, if you have a pointer with
> >>>>> unknown mode, we could have a special mode (or mode_all?) and you can
> >>>>> use the uvec2 representation in that case?
>

I've thought about using a mode of 0 for this or maybe letting you OR all
the possible modes together since nir_variable_mode is, after all, a
bitfield.  I don't have any huge preference at the moment.

> >>>> hmm, I'm not really getting how deref->mode could magically have
> >>>> information that fatptr.y doesn't have.. if the mode is known, vtn
> >>>> could stash it in fatptr.y and everyone is happy?  If vtn doesn't know
> >>>> this, then I don't see how deref->mode helps..
> >>>
> >>> You mean insert it into the fatptr every time deref_cast is called?
> >>>
> >>> Wouldn't that blow up the IR size significantly for very little
> benefit?
> >>
> >> in an easy to clean up way, so meh?
> >
> > We can't clean it up if we want to keep the information. Also nir is
> > pretty slow to compile already, so I'd like not to add a significant
> > number of instruction for very little benefit.
>

Really?  I mean, I can believe it, but do you have any actual numbers to
back this up?  It's considerably faster than some other IRs we have in mesa
though they are known to be pretty big pigs if we're honest.  I'm very open
to any suggestions on how to improve compile times.  If NIR is
fundamentally a pig, we should fix that.

I don't think compile time should be the deciding factor in whether or not
we use fat pointers as I doubt it will have a huge effect.  However, how
much work it takes to get at information is a reasonable argument.

> I guess I'm failing to see what information you'd loose.. or how there
> would be a problem..
>
> I'm not strictly against having nir_var_unknown as a possible value
> for deref->mode, but I'm not really seeing how that helps anything,
> other than adding extra complexity to the IR..
>
>
> >>>>>>
> >>>>>> I think I'm leaning towards using fat ptrs for the vk case, since I
> >>>>>> guess that is a case where you could always expect
> >>>>>> nir_src_as_const_value() to work, to get the variable mode.  If for
> no
> >>>>>> other reason than I guess these deref's, if the var is not known,
> >>>>>> start w/ deref_cast, and it would be ugly for deref_cast to have to
> >>>>>> work differently for compute vs vk.  But maybe Jason already has
> some
> >>>>>> thoughts about it?
>

I don't see why they would be different.  Vulkan and CL are starting to
converge.  In vulkan, you already can't always chase the pointer back to
the variable VK_KHR_variable_pointers is used.  I don't think that CL is as
special as you think it is.  The only thing special about CL is the fact
that global and local are in the same address space.  Beyond that, all the
problems we have to solve are fundamentally the same.

> >>>>> I'd like to avoid fat pointers alltogether on AMD since we would not
> >>>>> use it even for CL. a generic pointer is just a uint64_t for us, with
> >>>>> no bitfield in there for the address space.
>

I tend to agree with this.  (See above discussion).

> >>>>> I think we may need to think a bit more about representation however,
> >>>>> as e.g. for AMD a pointer is typically 64-bits (but we can do e.g.
> >>>>> 32-bits for known workgroup pointers), the current deref instructions
> >>>>> return 32-bit, and you want something like a uvec2 as pointer
> >>>>> representation?
> >>>>
> >>>> afaiu, newer AMD (and NV) hw can remap shared/private into a single
> >>>> global address space..  But I guess that is an easy subset of the
> >>>> harder case where drivers need to use different instructions.. so a
> >>>> pretty simple lowering pass run before lower_io could remap things
> >>>> that use fatptrs into something that ignores fatptr.y.  Then opt
> >>>> passes make fatptr.y go away.  So both AMD and hw that doesn't have a
> >>>> flat address space are happy.
> >>>
> >>> But then you run into other issues, like how are you going to stuff a
> >>> 64-bit fatptr.x + a ?-bit fatptr.y into a 64-bit value for Physical64
> >>> addressing? Also this means we have to track to the sources back to
> >>> the cast/var any time we want to do anything at all with any deref
> >>> which seems less efficient to me than just stuffing the deref in
> >>> there.
> >>
> >> so fat ptrs only have to exist in ssa space, not be stored to
> >> something with a physically defined size..
> >
> > how does storing __generic pointers work then? those still need the
> > fat bit for your hw right?
>
> for hw that can't map everything into a single flat address space, you
> *can't* store a fat pointer.. oh well, it doesn't mean that that hw
> can't implement lesser ocl versions (since this is defn not required
> for ocl 1.x and I don't *think* it is required for 2.x, but I haven't
> checked the spec.. I guess if intel supports 2.x then it isn't
> required..)
>

You can still carve up the address space and put local memory at some
absurdly high address and do

if (ptr > 0xffffffffffff0000)
   store_local(ptr & 0xffff)
else
   store_global(ptr)

> >>
> >> As far as tracking things to the head of the chain of deref
> >> instructions, that is a matter of a simple helper or two.  Not like
> >> the chain of deref's is going to be 1000's of instructions..
>

Yes and no (mostly no) once you start throwing in phis, selects, and
loading pointers from variables.  The first two you can explain away by
just saying, "follow the phi/select sources" but the moment you load a
pointer out of a variable, you have to do something.  Sure, you could just
drop .y and stomp it to the known storage class, but that hardly seems
ideal.

> >>> Also, what would the something which ignores fatptr.y be? I'd assume
> >>> that would be the normal deref based stuff, but requiring fatptr
> >>> contradicts that?
> >>
> >> if you have a flat address space, maybe a pass (or option for
> >> lower_io) to just convert everything to load/store_global (since
> >> essentially what these GPUs are doing is remapping shared/private into
> >> the global address space)
> >
> > but I'd like to only do that if we really don't know the mode
> > statically since it is somewhat slower/less flexible. (also radv is
> > unlikely to use nir_lower_io for a lot of this stuff since we can
> > lower derefs into LLVM GEPs directly)
>
> I mean, we control all the src code, we can add more options to lower_io.
>
> >
> > Hence if we want the cases where we know the mode statically we need
> > to not lower the fatptr, but then we have the whole fatptr mess.
> >
>
> well, fatptr mess is unavoidable.. I mostly just fail to see how a
> more general solution (ie. storing the variable mode in ssa) is worse
> than having to deal with both cases (in ssa or in instr).  The in ssa
> case is something that is easy to recover since anything that does
> pointer math has to split out the .y component and fuse it back in
> after the pointer math.  (And if you have a flat address space, I
> guess I'm failing to see why you care about the mode in the first
> place.)
>

Yes, fat pointers are inevitable for some hardware at some stage in the
compile.  That does not, however, mean that we want fat pointers straight
out of SPIR-V.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180408/7ae28fb0/attachment-0001.html>