[Nouveau] Dealing with opencl kernel parameters in nouveau now that RES support is gone

Mon Feb 22 12:41:15 UTC 2016

Hi there,

On 02/22/2016 12:26 PM, Hans de Goede wrote:
> Hi,
>
> On 19-02-16 20:43, Ilia Mirkin wrote:
>> On Fri, Feb 19, 2016 at 5:36 AM, Hans de Goede <hdegoede at redhat.com>
>> wrote:
>>> Hi,
>>>
>>> On 18-02-16 17:39, Ilia Mirkin wrote:
>>>>
>>>> On Thu, Feb 18, 2016 at 9:45 AM, Hans de Goede <hdegoede at redhat.com>
>>>> wrote:
>>>>>
>>>>> But this does not seem to be hooked up yet for nouveau.
>>>>
>>>>
>>>> Samuel has patches. See
>>>> https://cgit.freedesktop.org/~hakzsam/mesa/log/?h=arb_compute_shader_v3
>>>
>>>
>>> Cool, I will take a look at those.
>>>
>>>>> So some questions:
>>>>> -The commit by Samual says:
>>>>>    This introduces TGSI_FILE_MEMORY for shared, global and local
>>>>> memory.
>>>>> Only
>>>>> shared memory is currently supported.
>>>>>
>>>>>    The commit introduces MEMORY[x] and MEMORY[x],SHARED so in
>>>>> reality it
>>>>> also
>>>>> introduces
>>>>>    a second option next to shared, so what are we going to use plain
>>>>> MEMORY[x]
>>>>> for?
>>>>>    I suggest using it for global memory but we need to be in
>>>>> agreement on
>>>>> this.
>>>>
>>>>
>>>> That sounds fine to me. However what I had in mind was switching the
>>>> SHARED field into a 2-bit field and making it
>>>>
>>>> 1 = SHARED
>>>> 2 = GLOBAL
>>>> 3 = LOCAL
>>>>
>>>> (since for OpenCL you also need to be able to address local or private
>>>> memory). I sorta wanted Samuel to do it, but since I had no idea where
>>>> you were at, or if you were even still working on this, I figured it
>>>> should be fixed up by the first person who needed it.
>>>
>>>
>>> Sounds good, only the naming is somewhat unfortunate since opencl uses
>>> different
>>> naming. I.e. it has no "shared"
>>
>> Sad. Well, "shared" is what OpenGL compute shaders use, which is why I
>> proposed it.
>>
>>>
>>> OpenCL has:
>>> -global:  accessible by all worker-groups as well as by the CPU
>>> -const:   read-only global
>>> -local:   shared by worker-items in the same worker-group, not shared
>>> between worker-groups
>>> -private: accessible only to a single worker-item
>>>
>>> So how do these map to the TGSI:
>>>
>>>> 1 = SHARED
>>>> 2 = GLOBAL
>>>> 3 = LOCAL
>>
>> OpenCL global = TGSI global
>> OpenCL const = TGSI global
>> OpenCL local = TGSI shared
>> OpenCL private = TGSI local
>>
>> Not sure what the distinction is between OpenCL const and global is.
>> If the const stuff is actually just user-supplied uniforms (and
>> doesn't need to be in a particular place in memory), then those should
>> go into TGSI CONST somehow.
>
> AFAIK OpenCL const is really read-only global, so the data is filled
> in by the CPU, then passed to the opencl-kernel running on the GPU
> where all worker-items have access to it. I think that TGSI CONST might
> indeed be usable for this, but it is probably easiest to treat
> it as GLOBAL for now.
>
>>>>> -What about kernel input parameters, so far these have been using
>>>>> RES[32764]
>>>>>    I must admit that I do not understand where the file_index of 32764
>>>>> comes
>>>>>    from (or where any of the file indexes come from for
>>>>> src/gallium/tests/trivial/compute.c ?
>>>>>    I have the feeling that these are not used at all, and everything
>>>>> simply
>>>>> goes
>>>>>    to a flat (virtual) memory space, with the params at address 0,
>>>>> correct
>>>>> ?
>>>>
>>>>
>>>> It was never particularly well-specified, which was one of the reasons
>>>> it went away. It also didn't map nicely onto the OpenGL model. There
>>>> is a remaining question of how to do addressing in memory... there's
>>>> 40 bits of address space. Should these implicitly be U64
>>>> (dual-component in TGSI) addresses that are passed around? Not sure
>>>> what the OpenCL position on all this is.
>>>
>>>
>>> So far I've been using U32 for addresses as that is what Francisco's
>>> original
>>> code was using. And this also is what things like the tgsi LOAD
>>> instruction
>>> take. If you're doing a LOAD on a 1d buffer then you will use
>>> TEMP[#].x to
>>> specify the index, and the way how this currently works with OpenCL
>>> is that
>>> clCreateBuffer() will return a cl_mem type which then gets passed into
>>> the kernel as input parameter and gets treated as a pointer by the
>>> compiler,
>>> so e.g. global mem gets treated as a single address space even if there
>>> are multiple global buffers and TEMP[#].x contains the value passed in
>>> via cl_mem as start offset for the buffer + the index into the buffer.
>>>
>>> So this means that currently we are limited to U32 since TEMP[#].x is
>>> only 32 bits wide. Internally 40 bits addresses can and should probably
>>> be used so that at least the different memory spaces each have the full
>>> 32 bits available.
>>>
>>> Note that we could fix this by adding some sort of LOAD64 opcode, which
>>> uses TEMP[#].x and TEMP[#].y as address for 1d buffers, I'm not sure
>>> how this would work for 3d buffers though. I foresee the llvm backend
>>> eventually getting a 64 bit mode where it will use 64 bits for all
>>> pointers and use something like a LOAD64 opcode to indicate that
>>> the indexes (which it effectively uses as addresses / pointers)
>>> are 64 bit wide.
>>
>> Well, this LOAD64/STORE64 would just only be defined for MEMORY[]
>> src/dest, so you don't need to worry about 3d or anything like that. I
>> believe this is a good solution to the problem.
>
> Right for MEMORY this will work fine. I've no clue yet how images will
> work with OpenCL though, hopefully we can avoid the one flat address
> space thing there, but I simply don't know yet.
>
>>>> Also it would be highly preferable to avoid using GLOBAL at all in the
>>>> first place, and just use BUFFER[] or IMAGE[] or whatever. However
>>>> after having a look at how OpenCL works, I don't think that's
>>>> possible. But perhaps I missed something?
>>>
>>>
>>> OpenCL is very C-like and as such uses a flat address space per memory
>>> space, getting around that is going to be very tricky, if possible at
>>> all.
>>
>> That was my assessment as well.
>>
>>>
>>>> Perhaps I didn't make it clear enough though -- sorry. The reality is
>>>> that TGSI is a living thing, and things change every so often. I don't
>>>> think people would be comfortable with locking down TGSI in such a
>>>> way.
>>>
>>>
>>> Hmm, I do think that if we end up using a llvm-ir -> tgsi step in some
>>> cases that we do need to lock TGSI down. Taken the RES thing for
>>> example,
>>> it would have been pretty trivial actually to not nuke it, and
>>> instead just
>>> add the image support next to it. Note I'm fine with making changes
>>> while
>>> we are figuring this out, but once I start pushing the llvm bits to llvm
>>> upstream (which is still months away) we are going to need some sort of
>>> TGSI stability.
>>
>> I can't unilaterally make such promises. This would have to be
>> discussed and approved by the other gallium/tgsi stakeholders.
>
> Understood.
>
> So back to the problem of getting OpenCL(ish) code to work again with
> the recent mesa changes. For starters I would like to get:
>
> src/gallium/tests/trivial/compute.c and then the test with mask 8,
> test_input_global() to work again, when that is working I should be
> able to adjust my llvm work (and if necessary clover) to start to
> work again.
>
> Currently the test_input_global() test uses the following bit of
> TGSI code:
>
> COMP
> DCL SV[0], THREAD_ID[0]
> DCL TEMP[0], LOCAL
> DCL TEMP[1], LOCAL
> IMM UINT32 { 8, 0, 0, 0 }
>
>     BGNSUB\n"
>         UMUL TEMP[0], SV[0], IMM[0]
>         LOAD TEMP[1].xy, RINPUT, TEMP[0]
>         LOAD TEMP[0].x, RGLOBAL, TEMP[1].yyyy
>         UADD TEMP[1].x, TEMP[0], -TEMP[1]
>         STORE RGLOBAL.x, TEMP[1].yyyy, TEMP[1]
>         RET
>     ENDSUB
>
>
> Where by RINPUT and RGLOBAL get replaces by processing the
> code with cpp and the following defines:
>
> #define RGLOBAL        RES[32767]
> #define RLOCAL         RES[32766]
> #define RPRIVATE       RES[32765]
> #define RINPUT         RES[32764]
>
> If I understand how memory is supposed to work, then I would need to
> change the TGSI as follows:
>
> COMP
> DCL SV[0], THREAD_ID[0]
> DCL MEMORY[0]
> DCL TEMP[0], LOCAL
> DCL TEMP[1], LOCAL
> IMM UINT32 { 8, 0, 0, 0 }
>
> BGNSUB\n"
> UMUL TEMP[0], SV[0], IMM[0]
> LOAD TEMP[1].xy, RINPUT, TEMP[0]
> LOAD TEMP[0].x, MEMORY[0], TEMP[1].yyyy
> UADD TEMP[1].x, TEMP[0], -TEMP[1]
> STORE MEMORY[0].x, TEMP[1].yyyy, TEMP[1]
> RET
> ENDSUB

Nope, this won't work because RINPUT is RES[32764]. And you have to 
remove all occurrences to RES because it's not longer supported. In my 
opinion, using BUFFER[0] in a first time should work. Currently, only 
SHARED with MEMORY is supported.

>
> This assumes, that as discussed declaring memory without a , SHARED or
> other
> flag means the memory is global.
>
> So 2 questions:
>
> 1) Do the above changes for using the new MEMORY keyword look as intended
> to you?
>
> 2) This only solves the accessing of the global memory, it does not solve
> getting to the kernel input kernel parameters, how would I deal with
> those ?

The input kernel parameters are directly passed through a call to 
pipe_context::launch_grid. You just have to fill the 
pipe_grid_info::input array with your parameters and they will be 
uploaded by nvXX_compute_upload_input().

I will have a look at the test_input_global().

Thanks!

>
> Currently the kernel input parameters are uploaded by
> src/gallium/drivers/nouveau/nv50/nv50_compute.c:
> nv50_compute_upload_input()
>
> Are we going to change how these parameters get uploaded ? And if not
> how do we
> get to these parameters from TGSI now that we no longer have RES[#] to
> do so ?
>
> I think we should also keep in mind that what works for nouveau may not
> work
> for other GPU-s, so we should either introduce a MEMORY[x], INPUT or
> a new DCL INPUT[0] for this, so that the code for other GPU-s can have its
> own handling for this.
>
> Regards,
>
> Hans

-- 
-Samuel