[Mesa-dev] RFC: buffer support in TGSI for SSBO/atomic

Mon Nov 2 13:31:30 PST 2015

Am 02.11.2015 um 20:55 schrieb Ilia Mirkin:
> FTR these are the various operators on nvidia hw:
> 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.nvidia.com_cuda_parallel-2Dthread-2Dexecution_-23cache-2Doperators&d=BQIFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=Vjtt0vs_iqoI31UfJxBl7yv9I2FeiaeAYgMTLKRBc_I&m=AffBQlVu5ht00Ignz67m4YHn6ePeNQFrDUYljvo28Vc&s=m5WnnOkcD2MS1JiJjK4XTWgUXjyEbZWhWSjq_V_O6oA&e= 
> 
> Most of these map directly to instruction things (ca/cg/cs/cv sound
> familiar, dunno about lu, could just be an assembler helper).
Ah I see, that's how it works. Makes sense I guess, though I guess there
could be some slight inefficiences if data is packed "strangely"
(because global coherent write will evict l1 cache lines, thus for
instance some non-coherent access to a different variable but same cache
line would have to re-fetch that from l2), but probably that's not too
bad...


> 
> How backwards-compatible is TGSI supposed to be? Can we change the
> encoding willy-nilly, or are there separate systems that talk to each
> other using TGSI that would need coordination?
I think it's not really different than the rest of gallium, so not
actually considered stable. Obviously it was written to be quite
extensible, but I don't think there's really anything preventing us from
changing it in binary-incompatible ways, IFF there's a really good
reason for it. If some driver relies on binary compatibility for tgsi
shaders (let's say for recognizing specific shaders to be able to do
shader replacements) it will need to be adapted to such changes (and I
know of some code which does that...). So reducing field width just
because you could do with less is not really a good idea, but doing it
because you actually need the now free bits should be ok.

Roland


> 
>   -ilia
> 
> On Mon, Nov 2, 2015 at 2:49 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>> Ok, I guess if it's really flagged on the instructions in hw, it seems
>> reasonable to do it on the instructions in tgsi as well.
>> Using the last two bits there doesn't sound nice indeed (in particular
>> if maybe you'd wanted to encode the read/write bits as well at some
>> point too), but it's not THAT bad I think. We can scrap some bits later
>> if needed from it (token type is 4 bits but never larger than 3, NumSrcs
>> could easily do with 3 instead of 4 bits too and at some point the
>> predicate bit can go too). Albeit an extra token might be a good option
>> too (if you decided to add those r/w bits...)
>>
>> Though I still don't quite understand how gpus can do that efficiently
>> if you can do different flags with data which might be in the same cache
>> line. But maybe it's less of a problem than I thought...
>>
>> Roland
>>
>>
>> Am 02.11.2015 um 20:07 schrieb Ilia Mirkin:
>>> I haven't the faintest idea about efficiently, but these things flags
>>> on the ld/st instructions in the nvidia ISA for SM20+ (and I just
>>> plain don't know about SM10). I'm moderately sure that's the case for
>>> GCN as well.
>>>
>>> The difficulty with TGSI is that you might have something like
>>>
>>> layout (std430) buffer foo {
>>>   coherent int a;
>>>   int b;
>>> }
>>>
>>> Now I don't remember if they get baked into the same vec4, but I think
>>> they do. If they don't, then ARB_enhanced_layouts will fix that right
>>> up. Since TGSI is vec4-oriented, it's really awkward to specify that
>>> sort of thing... how would you do it?
>>>
>>> DECL BUFFER[0][0].x COHERENT
>>> DECL BUFFER[0][0].y
>>>
>>> And then totally unrelated to the separate bits, you can end up with
>>>
>>> layout (std430) buffer foo {
>>>   int foo[5];
>>> }
>>>
>>> and I have no idea how to even express that in TGSI -- it'd want
>>> things to be aligned to 16 bytes, but it'll be packed tightly here.
>>> This worked OK for layout (std140), but won't work with more advanced
>>> layouts. This will be a problem for UBOs too -- perhaps we need to
>>> allow something like
>>>
>>> LOAD dst, CONST[1][0], offset
>>>
>>> to account for that. And lastly, ssbo allows for something like
>>>
>>> layout (std430) buffer foo {
>>>   int foo[];
>>> }
>>>
>>> And you can access foo[anything-you-want] -- difficult to declare that
>>> in TGSI. I could invent stuff for all of these situations, but it
>>> seems to be a lot easier to just feed the data to load and forget
>>> about it. That's how it's all encoded in the GLSL IR as well.
>>>
>>>   -ilia
>>>
>>>
>>> On Mon, Nov 2, 2015 at 1:56 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>>> I don't know much about ssbo, but since it looks like in glsl the
>>>> coherent etc. bits are on the variables, not the ops, it seems unnatural
>>>> to mark the op bits instead. So I'd guess it would be better if the
>>>> variables could be marked instead. If this isn't expressible in tgsi
>>>> maybe this needs to be fixed. Albeit I have to say it sounds odd to me
>>>> from a hw perspective if this variables with different bits can be
>>>> stuffed together and then the hw is expected to handle that efficiently...
>>>>
>>>> Roland
>>>>
>>>> Am 01.11.2015 um 23:45 schrieb Ilia Mirkin:
>>>>> Just wanted to note down some thoughts and get some feedback before
>>>>> going forward. I've already sent out a series which covered a lot of
>>>>> this, but in the end I realized it came up a bit short (available at
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_imirkin_mesa_commits_fd2&d=BQIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=Vjtt0vs_iqoI31UfJxBl7yv9I2FeiaeAYgMTLKRBc_I&m=ZEO6K764MpKKCTrBFReM7jS6WlerLtMTWbj_OABE6K8&s=yJ3Ee990VBHMVTEQzdXBcPDd1ioo-BizrAGpP4kU-Cg&e= ).
>>>>>
>>>>> There are two separate buffer-related features --
>>>>> ARB_shader_atomic_counters(_ops) and
>>>>> ARB_shader_storage_buffer_objects. The former are implementable more
>>>>> efficiently on EG/NI hardware by performing the atomic ops on
>>>>> not-main-memory (GDS? LDS?). However I think that the gallium-side
>>>>> interface can be mostly identical for both cases, perhaps we can mark
>>>>> the buffer as atomic-only in the TGSI.
>>>>>
>>>>> Just like there is a CONST tgsi file, I want to add a BUFFER file,
>>>>> which will map to ->set_shader_buffers() indices. The tricky bit comes
>>>>> in from the fact that individual variables inside of a buffer may have
>>>>> different access/store properties. I see two ways to resolve this:
>>>>>
>>>>> 1. Declare each variable explicitly, much like UBO's still get
>>>>> individual decls per slot. These decls could contain the relevant
>>>>> caching property.
>>>>>
>>>>> 2. Make each LOAD/STORE op declare what caching it wants explicitly.
>>>>>
>>>>> The first option would work well for images, but for ssbo, it feels
>>>>> problematic, as with all the various packing options that exist, you
>>>>> could still specify odd per-variable cache rules, which would be
>>>>> difficult to express in the TGSI DECL. However I'm not sure how to
>>>>> implement the second option.
>>>>>
>>>>> There is a precedent of a saturate flag, but looking at
>>>>> tgsi_instruction, there are only 2 free bits. Since there are only 4
>>>>> different caching values (none, coherent, volatile, restrict; I'm not
>>>>> counting readonly/writeonly), this fits. However that would leave no
>>>>> more bits in tgsi_instruction. I could add a texture-style bit, saying
>>>>> to expect an additional tgsi_instruction_buffer packet with more info
>>>>> but that seems wasteful.
>>>>>
>>>>> Another option is to just pass an immediate directly to the LOAD/STORE
>>>>> ops which would specify this caching spec as an extra source. This
>>>>> seems much simpler, but a little dirtier. Opinions much appreciated.
>>>>>
>>>>> I think that one this is worked out, I'll be able to resend my series
>>>>> adding SSBO/atomic support to freedreno, and partial SSBO (without
>>>>> atomic*) support for nvc0.
>>>>>
>>>>> Cheers,
>>>>>
>>>>>   -ilia
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.freedesktop.org_mailman_listinfo_mesa-2Ddev&d=BQIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=Vjtt0vs_iqoI31UfJxBl7yv9I2FeiaeAYgMTLKRBc_I&m=ZEO6K764MpKKCTrBFReM7jS6WlerLtMTWbj_OABE6K8&s=OnyoWgHxyrDIN6esIAWVu0pQP5Mk8Iz3wNrzeeuTbvo&e=
>>>>>
>>>>
>>