[Mesa-dev] [PATCH 12/15] ac: add support for SPV_AMD_shader_ballot

Connor Abbott cwabbott0 at gmail.com
Thu Nov 2 17:24:10 UTC 2017


On Thu, Nov 2, 2017 at 12:10 PM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> On 31.10.2017 16:36, Connor Abbott wrote:
>>
>> On Tue, Oct 31, 2017 at 2:08 AM, Dave Airlie <airlied at gmail.com> wrote:
>>>>
>>>> +LLVMValueRef
>>>> +ac_build_subgroup_inclusive_scan(struct ac_llvm_context *ctx,
>>>> +                                LLVMValueRef src,
>>>> +                                ac_reduce_op reduce,
>>>> +                                LLVMValueRef identity)
>>>> +{
>>>> +       /* See
>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
>>>> +        *
>>>> +        * Note that each dpp/reduce pair is supposed to be compiled
>>>> down to
>>>> +        * one instruction by LLVM, at least for 32-bit values.
>>>> +        *
>>>> +        * TODO: use @llvm.amdgcn.ds.swizzle on SI and CI
>>>> +        */
>>>> +       LLVMValueRef value = src;
>>>> +       value = reduce(ctx, value,
>>>> +                      ac_build_dpp(ctx, identity, src,
>>>> +                                   dpp_row_sr(1), 0xf, 0xf, false));
>>>> +       value = reduce(ctx, value,
>>>> +                      ac_build_dpp(ctx, identity, src,
>>>> +                                   dpp_row_sr(2), 0xf, 0xf, false));
>>>> +       value = reduce(ctx, value,
>>>> +                      ac_build_dpp(ctx, identity, src,
>>>> +                                   dpp_row_sr(3), 0xf, 0xf, false));
>>>> +       value = reduce(ctx, value,
>>>> +                      ac_build_dpp(ctx, identity, value,
>>>> +                                   dpp_row_sr(4), 0xf, 0xe, false));
>>>> +       value = reduce(ctx, value,
>>>> +                      ac_build_dpp(ctx, identity, value,
>>>> +                                   dpp_row_sr(8), 0xf, 0xc, false));
>>>> +       value = reduce(ctx, value,
>>>> +                      ac_build_dpp(ctx, identity, value,
>>>> +                                   dpp_row_bcast15, 0xa, 0xf, false));
>>>> +       value = reduce(ctx, value,
>>>> +                      ac_build_dpp(ctx, identity, value,
>>>> +                                   dpp_row_bcast31, 0xc, 0xf, false));
>>>
>>>
>>> btw I dumped some shaders from doom on pro,
>>>
>>> it looked like it ended up with
>>>
>>> 1, 0xf, 0xf,
>>> 2, 0xf, 0xf,
>>> 4, 0xf, 0xf
>>> 8, 0xf, 0xf
>>> bcast15 0xa, 0xf
>>> bcast31 0xc, 0xf
>>>
>>> It also seems to apply these direct to instructions like
>>> /*000000002b80*/ s_nop           0x0
>>> /*000000002b84*/ v_min_u32       v83, v83, v83 row_shr:1 bank_mask:15
>>> row_mask:15
>>> /*000000002b8c*/ s_nop           0x1
>>> /*000000002b90*/ v_min_u32       v83, v83, v83 row_shr:2 bank_mask:15
>>> row_mask:15
>>> /*000000002b98*/ s_nop           0x1
>>> /*000000002b9c*/ v_min_u32       v83, v83, v83 row_shr:4 bank_mask:15
>>> row_mask:15
>>> /*000000002ba4*/ s_nop           0x1
>>> /*000000002ba8*/ v_min_u32       v83, v83, v83 row_shr:8 bank_mask:15
>>> row_mask:15
>>> /*000000002bb0*/ s_nop           0x1
>>> /*000000002bb4*/ v_min_u32       v83, v83, v83 row_bcast15
>>> bank_mask:15 row_mask:10
>>> /*000000002bbc*/ s_nop           0x1
>>> /*000000002bc0*/ v_min_u32       v83, v83, v83 row_bcast31
>>> bank_mask:15 row_mask:12
>>>
>>> I think the instruction combining is probably an llvm job, but I
>>> wonder if the different row_shr
>>> etc is what we should use as well.
>>
>>
>> Yeah, LLVM should be combining the move and min -- hence the comment
>> here -- but it isn't yet. That shouldn't be too hard to do once we get
>> it working. Also, I've seen that way of doing it before, and IIRC it's
>> one instruction slower than the sequence in the blog post I cited,
>> since even though there's one less instruction, there's an extra
>> two-cycle stall between the first two instructions since v83 is the
>> destination of the first instruction and DPP source of the second
>> (hence the s_nop 0x1). So once we combine instructions this should be
>> better than what -pro does :)
>
>
> Agreed, though even more ideally, LLVM would be able to fill those gaps with
> other instructions ;)

Well, that isn't really possible when the sequence is in WWM and
everything else isn't. We could fill the slot with a scalar
instruction, but I think LLVM is currently overly conservative and
treats instructions writing EXEC as barriers even though it doesn't
need to.

>
> Anyway, the combining of instructions is really the important task.

Agreed. Although I think getting it working first is even more important :)

>
> Cheers,
> Nicolai
>
>
>>
>>>
>>> Dave.
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>
>
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.


More information about the mesa-dev mailing list