[Mesa-dev] RFC: buffer support in TGSI for SSBO/atomic

Mon Nov 2 11:55:13 PST 2015

FTR these are the various operators on nvidia hw:

http://docs.nvidia.com/cuda/parallel-thread-execution/#cache-operators

Most of these map directly to instruction things (ca/cg/cs/cv sound
familiar, dunno about lu, could just be an assembler helper).

How backwards-compatible is TGSI supposed to be? Can we change the
encoding willy-nilly, or are there separate systems that talk to each
other using TGSI that would need coordination?

  -ilia

On Mon, Nov 2, 2015 at 2:49 PM, Roland Scheidegger <sroland at vmware.com> wrote:
> Ok, I guess if it's really flagged on the instructions in hw, it seems
> reasonable to do it on the instructions in tgsi as well.
> Using the last two bits there doesn't sound nice indeed (in particular
> if maybe you'd wanted to encode the read/write bits as well at some
> point too), but it's not THAT bad I think. We can scrap some bits later
> if needed from it (token type is 4 bits but never larger than 3, NumSrcs
> could easily do with 3 instead of 4 bits too and at some point the
> predicate bit can go too). Albeit an extra token might be a good option
> too (if you decided to add those r/w bits...)
>
> Though I still don't quite understand how gpus can do that efficiently
> if you can do different flags with data which might be in the same cache
> line. But maybe it's less of a problem than I thought...
>
> Roland
>
>
> Am 02.11.2015 um 20:07 schrieb Ilia Mirkin:
>> I haven't the faintest idea about efficiently, but these things flags
>> on the ld/st instructions in the nvidia ISA for SM20+ (and I just
>> plain don't know about SM10). I'm moderately sure that's the case for
>> GCN as well.
>>
>> The difficulty with TGSI is that you might have something like
>>
>> layout (std430) buffer foo {
>>   coherent int a;
>>   int b;
>> }
>>
>> Now I don't remember if they get baked into the same vec4, but I think
>> they do. If they don't, then ARB_enhanced_layouts will fix that right
>> up. Since TGSI is vec4-oriented, it's really awkward to specify that
>> sort of thing... how would you do it?
>>
>> DECL BUFFER[0][0].x COHERENT
>> DECL BUFFER[0][0].y
>>
>> And then totally unrelated to the separate bits, you can end up with
>>
>> layout (std430) buffer foo {
>>   int foo[5];
>> }
>>
>> and I have no idea how to even express that in TGSI -- it'd want
>> things to be aligned to 16 bytes, but it'll be packed tightly here.
>> This worked OK for layout (std140), but won't work with more advanced
>> layouts. This will be a problem for UBOs too -- perhaps we need to
>> allow something like
>>
>> LOAD dst, CONST[1][0], offset
>>
>> to account for that. And lastly, ssbo allows for something like
>>
>> layout (std430) buffer foo {
>>   int foo[];
>> }
>>
>> And you can access foo[anything-you-want] -- difficult to declare that
>> in TGSI. I could invent stuff for all of these situations, but it
>> seems to be a lot easier to just feed the data to load and forget
>> about it. That's how it's all encoded in the GLSL IR as well.
>>
>>   -ilia
>>
>>
>> On Mon, Nov 2, 2015 at 1:56 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>> I don't know much about ssbo, but since it looks like in glsl the
>>> coherent etc. bits are on the variables, not the ops, it seems unnatural
>>> to mark the op bits instead. So I'd guess it would be better if the
>>> variables could be marked instead. If this isn't expressible in tgsi
>>> maybe this needs to be fixed. Albeit I have to say it sounds odd to me
>>> from a hw perspective if this variables with different bits can be
>>> stuffed together and then the hw is expected to handle that efficiently...
>>>
>>> Roland
>>>
>>> Am 01.11.2015 um 23:45 schrieb Ilia Mirkin:
>>>> Just wanted to note down some thoughts and get some feedback before
>>>> going forward. I've already sent out a series which covered a lot of
>>>> this, but in the end I realized it came up a bit short (available at
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_imirkin_mesa_commits_fd2&d=BQIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=Vjtt0vs_iqoI31UfJxBl7yv9I2FeiaeAYgMTLKRBc_I&m=ZEO6K764MpKKCTrBFReM7jS6WlerLtMTWbj_OABE6K8&s=yJ3Ee990VBHMVTEQzdXBcPDd1ioo-BizrAGpP4kU-Cg&e= ).
>>>>
>>>> There are two separate buffer-related features --
>>>> ARB_shader_atomic_counters(_ops) and
>>>> ARB_shader_storage_buffer_objects. The former are implementable more
>>>> efficiently on EG/NI hardware by performing the atomic ops on
>>>> not-main-memory (GDS? LDS?). However I think that the gallium-side
>>>> interface can be mostly identical for both cases, perhaps we can mark
>>>> the buffer as atomic-only in the TGSI.
>>>>
>>>> Just like there is a CONST tgsi file, I want to add a BUFFER file,
>>>> which will map to ->set_shader_buffers() indices. The tricky bit comes
>>>> in from the fact that individual variables inside of a buffer may have
>>>> different access/store properties. I see two ways to resolve this:
>>>>
>>>> 1. Declare each variable explicitly, much like UBO's still get
>>>> individual decls per slot. These decls could contain the relevant
>>>> caching property.
>>>>
>>>> 2. Make each LOAD/STORE op declare what caching it wants explicitly.
>>>>
>>>> The first option would work well for images, but for ssbo, it feels
>>>> problematic, as with all the various packing options that exist, you
>>>> could still specify odd per-variable cache rules, which would be
>>>> difficult to express in the TGSI DECL. However I'm not sure how to
>>>> implement the second option.
>>>>
>>>> There is a precedent of a saturate flag, but looking at
>>>> tgsi_instruction, there are only 2 free bits. Since there are only 4
>>>> different caching values (none, coherent, volatile, restrict; I'm not
>>>> counting readonly/writeonly), this fits. However that would leave no
>>>> more bits in tgsi_instruction. I could add a texture-style bit, saying
>>>> to expect an additional tgsi_instruction_buffer packet with more info
>>>> but that seems wasteful.
>>>>
>>>> Another option is to just pass an immediate directly to the LOAD/STORE
>>>> ops which would specify this caching spec as an extra source. This
>>>> seems much simpler, but a little dirtier. Opinions much appreciated.
>>>>
>>>> I think that one this is worked out, I'll be able to resend my series
>>>> adding SSBO/atomic support to freedreno, and partial SSBO (without
>>>> atomic*) support for nvc0.
>>>>
>>>> Cheers,
>>>>
>>>>   -ilia
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_mailman_listinfo_mesa-2Ddev&d=BQIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=Vjtt0vs_iqoI31UfJxBl7yv9I2FeiaeAYgMTLKRBc_I&m=ZEO6K764MpKKCTrBFReM7jS6WlerLtMTWbj_OABE6K8&s=OnyoWgHxyrDIN6esIAWVu0pQP5Mk8Iz3wNrzeeuTbvo&e=
>>>>
>>>
>