[Mesa-dev] [PATCH v3 06/19] RFC: nir/vtn: "raw" pointer support

Sun Mar 25 10:35:39 UTC 2018

On Sun, Mar 25, 2018 at 12:18 AM, Rob Clark <robdclark at gmail.com> wrote:
> On Fri, Mar 23, 2018 at 5:18 PM, Jason Ekstrand <jason at jlekstrand.net> wrote:
>> On Fri, Mar 23, 2018 at 2:15 PM, Karol Herbst <kherbst at redhat.com> wrote:
>>>
>>> On Fri, Mar 23, 2018 at 10:07 PM, Jason Ekstrand <jason at jlekstrand.net>
>>> wrote:
>>> > +list
>>> >
>>> > On Fri, Mar 23, 2018 at 1:45 PM, Karol Herbst <kherbst at redhat.com>
>>> > wrote:
>>> >>
>>> >> On Fri, Mar 23, 2018 at 9:30 PM, Jason Ekstrand <jason at jlekstrand.net>
>>> >> wrote:
>>> >> > As I've been rewriting core NIR deref handling, I've been thinking
>>> >> > about
>>> >> > this problem quite a bit.  One objective I have is to actually make
>>> >> > UBO
>>> >> > and
>>> >> > SSBO access go through derefs instead of just being an offset and
>>> >> > index
>>> >> > so
>>> >> > that the compiler can better reason about them.  In particular, I
>>> >> > want
>>> >> > to be
>>> >> > able to start doing load/store elimination on SSBOs, SLM, and
>>> >> > whatever
>>> >> > CL
>>> >> > has which would be great for everyone's compute performance (GL,
>>> >> > Vulkan,
>>> >> > CL,
>>> >> > etc.).
>>> >> >
>>> >> > I would be lying if I said I had a full plan but I do have part of a
>>> >> > plan.
>>> >> > In my patch which adds the deref instructions, I add a new "cast"
>>> >> > deref
>>> >> > type
>>> >> > which takes an arbitrary value as it's source and kicks out a deref
>>> >> > with
>>> >> > a
>>> >> > type.  Whenever we discover that the source of the cast is actually
>>> >> > another
>>> >> > deref which is compatible (same type etc.), copy propagation gets rid
>>> >> > of
>>> >> > the
>>> >> > cast for you.  The idea is that, instead of doing a
>>> >> > load_raw(raw_ptr),
>>> >> > you
>>> >> > would do a load((type *)raw_ptr).
>>> >> >
>>> >> > Right now, most of the core NIR optimizations will throw a fit if
>>> >> > they
>>> >> > ever
>>> >> > see a cast.  This is intentional because it requires us to manually
>>> >> > go
>>> >> > through and handle casts.  This would mean that, at the moment, you
>>> >> > would
>>> >> > have to lower to load_raw intrinsics almost immediately after coming
>>> >> > out
>>> >> > of
>>> >> > SPIR-V.
>>> >> >
>>> >>
>>> >> Well it gets more fun with OpenCL 2.0 where you can have generic
>>> >> pointer where you only know the type at creation type. You can also
>>> >> declare generic pointers as function inputs in a way, that you never
>>> >> actually know from where you have to load if you only have that one
>>> >> function. So the actual load operation depends on when you create the
>>> >> initial pointer variable (you can cast from X to generic, but not the
>>> >> other way around).
>>> >>
>>> >> Which in the end means you can end up with load(generic_ptr) and only
>>> >> following the chain up to it's creation (with function inlining in
>>> >> mind) you know the actual memory target.
>>> >
>>> >
>>> > Yup.  And there will always be crazy cases where you can't actually
>>> > follow
>>> > it and you have to emit a pile of code to load different ways depending
>>> > on
>>> > some bits somewhere that tell you how to load it.  I'm well aware of the
>>> > insanity. :-)  This is part of the reason why I'm glad I'm not trying to
>>> > write an OpenCL 2.0 driver.
>>> >
>>> > This insanity is exactly why I'm suggesting the pointer casting.  Sure,
>>> > you
>>> > may not know the data type until the actual load.  In that case, you end
>>> > up
>>> > with the cast being right before the load.  If you don't know the
>>> > storage
>>> > class, maybe you have to switch and do multiple casts based on some
>>> > bits.
>>> > Alternatively, if you don't know the storage class, we can just let the
>>> > deref mode be 0 for "I don't know". or maybe multiple bits for "these
>>> > are
>>> > the things it might be".  In any case, I think we can handle it.
>>> >
>>>
>>> there shouldn't be a situation where we don't know, except when you
>>> don't inline all functions. I think Rob had the idea of fat pointers
>>> where a pointer is a vec2 and the 2nd component contains the actual
>>> pointer type and you end up with a switch over the type to get the
>>> correct storage class. And if the compiler inlines all functions, it
>>> should be able to optimize that switch away.
>>
>>
>> Right.  Today, we live in a world where all functions are inlined.  Sadly, I
>> fear that world may come to and end one of these days. :(
>>
>
> fwiw, so far I'm mostly caring about the inline-all-the-fxns case..
>
> for the cases where we don't know what sort of pointer we have, Karol
> (iirc?) suggested name-mangling functions, which seems semi-sane.. but
> I've mostly tried to ignore that for now until we have more basic
> things working.
>
> Possibly we need a compiler option to lower everything to
> load/store_global (or maybe "raw" is a better name?) for hw that can
> remap local memory into a single address space and use the same
> load/store instructions.  I think that should be at least enough to
> move forward with nv hw + fxn calls.  Less so for intel/adreno but
> from my PoV I'm willing to solve that problem later.
>

I don't think this works out, because it isn't only about local vs
global. We also have private memory pointers you can assign to generic
pointers. And I am sure most compilers will use registers for private
memory if they can.

private memory pointers are used if you for example get the pointer of
a stack variable.

> BR,
> -R