[Mesa-dev] [PATCH v3 06/19] RFC: nir/vtn: "raw" pointer support

Sun Mar 25 13:31:18 UTC 2018

On Sun, Mar 25, 2018 at 2:18 PM, Rob Clark <robdclark at gmail.com> wrote:
> On Sun, Mar 25, 2018 at 6:35 AM, Karol Herbst <kherbst at redhat.com> wrote:
>> On Sun, Mar 25, 2018 at 12:18 AM, Rob Clark <robdclark at gmail.com> wrote:
>>> On Fri, Mar 23, 2018 at 5:18 PM, Jason Ekstrand <jason at jlekstrand.net> wrote:
>>>> On Fri, Mar 23, 2018 at 2:15 PM, Karol Herbst <kherbst at redhat.com> wrote:
>>>>>
>>>>> On Fri, Mar 23, 2018 at 10:07 PM, Jason Ekstrand <jason at jlekstrand.net>
>>>>> wrote:
>>>>> > +list
>>>>> >
>>>>> > On Fri, Mar 23, 2018 at 1:45 PM, Karol Herbst <kherbst at redhat.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> On Fri, Mar 23, 2018 at 9:30 PM, Jason Ekstrand <jason at jlekstrand.net>
>>>>> >> wrote:
>>>>> >> > As I've been rewriting core NIR deref handling, I've been thinking
>>>>> >> > about
>>>>> >> > this problem quite a bit.  One objective I have is to actually make
>>>>> >> > UBO
>>>>> >> > and
>>>>> >> > SSBO access go through derefs instead of just being an offset and
>>>>> >> > index
>>>>> >> > so
>>>>> >> > that the compiler can better reason about them.  In particular, I
>>>>> >> > want
>>>>> >> > to be
>>>>> >> > able to start doing load/store elimination on SSBOs, SLM, and
>>>>> >> > whatever
>>>>> >> > CL
>>>>> >> > has which would be great for everyone's compute performance (GL,
>>>>> >> > Vulkan,
>>>>> >> > CL,
>>>>> >> > etc.).
>>>>> >> >
>>>>> >> > I would be lying if I said I had a full plan but I do have part of a
>>>>> >> > plan.
>>>>> >> > In my patch which adds the deref instructions, I add a new "cast"
>>>>> >> > deref
>>>>> >> > type
>>>>> >> > which takes an arbitrary value as it's source and kicks out a deref
>>>>> >> > with
>>>>> >> > a
>>>>> >> > type.  Whenever we discover that the source of the cast is actually
>>>>> >> > another
>>>>> >> > deref which is compatible (same type etc.), copy propagation gets rid
>>>>> >> > of
>>>>> >> > the
>>>>> >> > cast for you.  The idea is that, instead of doing a
>>>>> >> > load_raw(raw_ptr),
>>>>> >> > you
>>>>> >> > would do a load((type *)raw_ptr).
>>>>> >> >
>>>>> >> > Right now, most of the core NIR optimizations will throw a fit if
>>>>> >> > they
>>>>> >> > ever
>>>>> >> > see a cast.  This is intentional because it requires us to manually
>>>>> >> > go
>>>>> >> > through and handle casts.  This would mean that, at the moment, you
>>>>> >> > would
>>>>> >> > have to lower to load_raw intrinsics almost immediately after coming
>>>>> >> > out
>>>>> >> > of
>>>>> >> > SPIR-V.
>>>>> >> >
>>>>> >>
>>>>> >> Well it gets more fun with OpenCL 2.0 where you can have generic
>>>>> >> pointer where you only know the type at creation type. You can also
>>>>> >> declare generic pointers as function inputs in a way, that you never
>>>>> >> actually know from where you have to load if you only have that one
>>>>> >> function. So the actual load operation depends on when you create the
>>>>> >> initial pointer variable (you can cast from X to generic, but not the
>>>>> >> other way around).
>>>>> >>
>>>>> >> Which in the end means you can end up with load(generic_ptr) and only
>>>>> >> following the chain up to it's creation (with function inlining in
>>>>> >> mind) you know the actual memory target.
>>>>> >
>>>>> >
>>>>> > Yup.  And there will always be crazy cases where you can't actually
>>>>> > follow
>>>>> > it and you have to emit a pile of code to load different ways depending
>>>>> > on
>>>>> > some bits somewhere that tell you how to load it.  I'm well aware of the
>>>>> > insanity. :-)  This is part of the reason why I'm glad I'm not trying to
>>>>> > write an OpenCL 2.0 driver.
>>>>> >
>>>>> > This insanity is exactly why I'm suggesting the pointer casting.  Sure,
>>>>> > you
>>>>> > may not know the data type until the actual load.  In that case, you end
>>>>> > up
>>>>> > with the cast being right before the load.  If you don't know the
>>>>> > storage
>>>>> > class, maybe you have to switch and do multiple casts based on some
>>>>> > bits.
>>>>> > Alternatively, if you don't know the storage class, we can just let the
>>>>> > deref mode be 0 for "I don't know". or maybe multiple bits for "these
>>>>> > are
>>>>> > the things it might be".  In any case, I think we can handle it.
>>>>> >
>>>>>
>>>>> there shouldn't be a situation where we don't know, except when you
>>>>> don't inline all functions. I think Rob had the idea of fat pointers
>>>>> where a pointer is a vec2 and the 2nd component contains the actual
>>>>> pointer type and you end up with a switch over the type to get the
>>>>> correct storage class. And if the compiler inlines all functions, it
>>>>> should be able to optimize that switch away.
>>>>
>>>>
>>>> Right.  Today, we live in a world where all functions are inlined.  Sadly, I
>>>> fear that world may come to and end one of these days. :(
>>>>
>>>
>>> fwiw, so far I'm mostly caring about the inline-all-the-fxns case..
>>>
>>> for the cases where we don't know what sort of pointer we have, Karol
>>> (iirc?) suggested name-mangling functions, which seems semi-sane.. but
>>> I've mostly tried to ignore that for now until we have more basic
>>> things working.
>>>
>>> Possibly we need a compiler option to lower everything to
>>> load/store_global (or maybe "raw" is a better name?) for hw that can
>>> remap local memory into a single address space and use the same
>>> load/store instructions.  I think that should be at least enough to
>>> move forward with nv hw + fxn calls.  Less so for intel/adreno but
>>> from my PoV I'm willing to solve that problem later.
>>>
>>
>> I don't think this works out, because it isn't only about local vs
>> global. We also have private memory pointers you can assign to generic
>> pointers. And I am sure most compilers will use registers for private
>> memory if they can.
>>
>> private memory pointers are used if you for example get the pointer of
>> a stack variable.
>>
>
> I don't necessarily see why it wouldn't work if you could lower the
> called fxn into a different version for a private pointer
>

I meant the lowering everything to global memory part.

> I'm a bit undecided as to how to do private pointers in general, at
> least in cases where it can't all be converted to SSA, since I can't
> really do 8/16b values packed in registers (without extra
> shifts/masks).  And fxn call ABI might be something we want to be
> driver specific in this case (ie. I think nvir wouldn't use registers
> but would use private mem, where ir3 might want to use registers plus
> indirect register access to have more flexibility with register
> allocation, etc).  But however pointers to private mem are
> implemented, having a version of the called fxn which is compiled for
> pointer to private mem is not a problem.
>

We kind of have the same problem with arrays, no? I am sure we could
just rely on what we have for those. And requiring having special
functions for each type it was called with sounds reasonable anyway.

> The bigger problem seems to me to be combinatorial explosion of fxn
> variants for N different generic pointer params.  But for hw that
> can't map global/local/private into a single address space, I don't
> see any real alternative (other than just inlining everything).  We
> might want to give the driver more control over the decision about
> which functions to inline... maybe some sort of callback fxn that took
> the nir_function plus # of call-sites and returned true/false?
>

I don't think that inlining those is a big problem. Are there any
drivers where we don't inline functions currently? But yeah in the
case of not inlining we have to come up with a different solution and
I think there are basically two things we can do: add a second value
for generic pointers with it's target address space or function
variants.

> BR,
> -R