[Mesa-dev] [PATCH v3 06/19] RFC: nir/vtn: "raw" pointer support

Sun Mar 25 12:18:39 UTC 2018

On Sun, Mar 25, 2018 at 6:35 AM, Karol Herbst <kherbst at redhat.com> wrote:
> On Sun, Mar 25, 2018 at 12:18 AM, Rob Clark <robdclark at gmail.com> wrote:
>> On Fri, Mar 23, 2018 at 5:18 PM, Jason Ekstrand <jason at jlekstrand.net> wrote:
>>> On Fri, Mar 23, 2018 at 2:15 PM, Karol Herbst <kherbst at redhat.com> wrote:
>>>>
>>>> On Fri, Mar 23, 2018 at 10:07 PM, Jason Ekstrand <jason at jlekstrand.net>
>>>> wrote:
>>>> > +list
>>>> >
>>>> > On Fri, Mar 23, 2018 at 1:45 PM, Karol Herbst <kherbst at redhat.com>
>>>> > wrote:
>>>> >>
>>>> >> On Fri, Mar 23, 2018 at 9:30 PM, Jason Ekstrand <jason at jlekstrand.net>
>>>> >> wrote:
>>>> >> > As I've been rewriting core NIR deref handling, I've been thinking
>>>> >> > about
>>>> >> > this problem quite a bit.  One objective I have is to actually make
>>>> >> > UBO
>>>> >> > and
>>>> >> > SSBO access go through derefs instead of just being an offset and
>>>> >> > index
>>>> >> > so
>>>> >> > that the compiler can better reason about them.  In particular, I
>>>> >> > want
>>>> >> > to be
>>>> >> > able to start doing load/store elimination on SSBOs, SLM, and
>>>> >> > whatever
>>>> >> > CL
>>>> >> > has which would be great for everyone's compute performance (GL,
>>>> >> > Vulkan,
>>>> >> > CL,
>>>> >> > etc.).
>>>> >> >
>>>> >> > I would be lying if I said I had a full plan but I do have part of a
>>>> >> > plan.
>>>> >> > In my patch which adds the deref instructions, I add a new "cast"
>>>> >> > deref
>>>> >> > type
>>>> >> > which takes an arbitrary value as it's source and kicks out a deref
>>>> >> > with
>>>> >> > a
>>>> >> > type.  Whenever we discover that the source of the cast is actually
>>>> >> > another
>>>> >> > deref which is compatible (same type etc.), copy propagation gets rid
>>>> >> > of
>>>> >> > the
>>>> >> > cast for you.  The idea is that, instead of doing a
>>>> >> > load_raw(raw_ptr),
>>>> >> > you
>>>> >> > would do a load((type *)raw_ptr).
>>>> >> >
>>>> >> > Right now, most of the core NIR optimizations will throw a fit if
>>>> >> > they
>>>> >> > ever
>>>> >> > see a cast.  This is intentional because it requires us to manually
>>>> >> > go
>>>> >> > through and handle casts.  This would mean that, at the moment, you
>>>> >> > would
>>>> >> > have to lower to load_raw intrinsics almost immediately after coming
>>>> >> > out
>>>> >> > of
>>>> >> > SPIR-V.
>>>> >> >
>>>> >>
>>>> >> Well it gets more fun with OpenCL 2.0 where you can have generic
>>>> >> pointer where you only know the type at creation type. You can also
>>>> >> declare generic pointers as function inputs in a way, that you never
>>>> >> actually know from where you have to load if you only have that one
>>>> >> function. So the actual load operation depends on when you create the
>>>> >> initial pointer variable (you can cast from X to generic, but not the
>>>> >> other way around).
>>>> >>
>>>> >> Which in the end means you can end up with load(generic_ptr) and only
>>>> >> following the chain up to it's creation (with function inlining in
>>>> >> mind) you know the actual memory target.
>>>> >
>>>> >
>>>> > Yup.  And there will always be crazy cases where you can't actually
>>>> > follow
>>>> > it and you have to emit a pile of code to load different ways depending
>>>> > on
>>>> > some bits somewhere that tell you how to load it.  I'm well aware of the
>>>> > insanity. :-)  This is part of the reason why I'm glad I'm not trying to
>>>> > write an OpenCL 2.0 driver.
>>>> >
>>>> > This insanity is exactly why I'm suggesting the pointer casting.  Sure,
>>>> > you
>>>> > may not know the data type until the actual load.  In that case, you end
>>>> > up
>>>> > with the cast being right before the load.  If you don't know the
>>>> > storage
>>>> > class, maybe you have to switch and do multiple casts based on some
>>>> > bits.
>>>> > Alternatively, if you don't know the storage class, we can just let the
>>>> > deref mode be 0 for "I don't know". or maybe multiple bits for "these
>>>> > are
>>>> > the things it might be".  In any case, I think we can handle it.
>>>> >
>>>>
>>>> there shouldn't be a situation where we don't know, except when you
>>>> don't inline all functions. I think Rob had the idea of fat pointers
>>>> where a pointer is a vec2 and the 2nd component contains the actual
>>>> pointer type and you end up with a switch over the type to get the
>>>> correct storage class. And if the compiler inlines all functions, it
>>>> should be able to optimize that switch away.
>>>
>>>
>>> Right.  Today, we live in a world where all functions are inlined.  Sadly, I
>>> fear that world may come to and end one of these days. :(
>>>
>>
>> fwiw, so far I'm mostly caring about the inline-all-the-fxns case..
>>
>> for the cases where we don't know what sort of pointer we have, Karol
>> (iirc?) suggested name-mangling functions, which seems semi-sane.. but
>> I've mostly tried to ignore that for now until we have more basic
>> things working.
>>
>> Possibly we need a compiler option to lower everything to
>> load/store_global (or maybe "raw" is a better name?) for hw that can
>> remap local memory into a single address space and use the same
>> load/store instructions.  I think that should be at least enough to
>> move forward with nv hw + fxn calls.  Less so for intel/adreno but
>> from my PoV I'm willing to solve that problem later.
>>
>
> I don't think this works out, because it isn't only about local vs
> global. We also have private memory pointers you can assign to generic
> pointers. And I am sure most compilers will use registers for private
> memory if they can.
>
> private memory pointers are used if you for example get the pointer of
> a stack variable.
>

I don't necessarily see why it wouldn't work if you could lower the
called fxn into a different version for a private pointer

I'm a bit undecided as to how to do private pointers in general, at
least in cases where it can't all be converted to SSA, since I can't
really do 8/16b values packed in registers (without extra
shifts/masks).  And fxn call ABI might be something we want to be
driver specific in this case (ie. I think nvir wouldn't use registers
but would use private mem, where ir3 might want to use registers plus
indirect register access to have more flexibility with register
allocation, etc).  But however pointers to private mem are
implemented, having a version of the called fxn which is compiled for
pointer to private mem is not a problem.

The bigger problem seems to me to be combinatorial explosion of fxn
variants for N different generic pointer params.  But for hw that
can't map global/local/private into a single address space, I don't
see any real alternative (other than just inlining everything).  We
might want to give the driver more control over the decision about
which functions to inline... maybe some sort of callback fxn that took
the nir_function plus # of call-sites and returned true/false?

BR,
-R