[Mesa-dev] [PATCH 2/2] radeonsi: rewrite late alloc VS limit computation

Mon Aug 28 19:43:38 UTC 2017

On Mon, Aug 28, 2017 at 12:50 PM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> On 25.08.2017 02:57, Marek Olšák wrote:
>>
>> From: Marek Olšák <marek.olsak at amd.com>
>>
>> This is still very simple, but it's better than before.
>>
>> Loosely ported from Vulkan.
>> ---
>>   src/gallium/drivers/radeonsi/si_state.c | 38
>> ++++++++++++++++++++++-----------
>>   1 file changed, 26 insertions(+), 12 deletions(-)
>>
>> diff --git a/src/gallium/drivers/radeonsi/si_state.c
>> b/src/gallium/drivers/radeonsi/si_state.c
>> index 24e509c..393f960 100644
>> --- a/src/gallium/drivers/radeonsi/si_state.c
>> +++ b/src/gallium/drivers/radeonsi/si_state.c
>> @@ -4649,40 +4649,54 @@ static void si_init_config(struct si_context
>> *sctx)
>>                         /* If this is 0, Bonaire can hang even if GS isn't
>> being used.
>>                          * Other chips are unaffected. These are
>> suboptimal values,
>>                          * but we don't use on-chip GS.
>>                          */
>>                         si_pm4_set_reg(pm4, R_028A44_VGT_GS_ONCHIP_CNTL,
>>                                        S_028A44_ES_VERTS_PER_SUBGRP(64) |
>>                                        S_028A44_GS_PRIMS_PER_SUBGRP(4));
>>                 }
>>                 si_pm4_set_reg(pm4, R_00B21C_SPI_SHADER_PGM_RSRC3_GS,
>> S_00B21C_CU_EN(0xffff));
>>   -             if (sscreen->b.info.num_good_compute_units /
>> -                   (sscreen->b.info.max_se *
>> sscreen->b.info.max_sh_per_se) <= 4) {
>> +               /* Compute LATE_ALLOC_VS.LIMIT. */
>> +               unsigned num_cu_per_sh =
>> sscreen->b.info.num_good_compute_units /
>> +                                        (sscreen->b.info.max_se *
>> +                                         sscreen->b.info.max_sh_per_se);
>> +               unsigned late_alloc_limit; /* The limit is per SH. */
>> +
>> +               if (sctx->b.family == CHIP_KABINI) {
>> +                       late_alloc_limit = 0; /* Potential hang on Kabini.
>> */
>> +               } else if (num_cu_per_sh <= 4) {
>>                         /* Too few available compute units per SH.
>> Disallowing
>> -                        * VS to run on CU0 could hurt us more than late
>> VS
>> +                        * VS to run on one CU could hurt us more than
>> late VS
>>                          * allocation would help.
>>                          *
>> -                        * LATE_ALLOC_VS = 2 is the highest safe number.
>> +                        * 2 is the highest safe number that allows us to
>> keep
>> +                        * all CUs enabled.
>
>
> Just to clarify: is the logic here that LATE_ALLOC_VS = 2 means up to 3 late
> alloc waves, which means at least one SIMD is always free, even when there's
> only a single CU?

Yes according to the Vulkan driver.

>
>
>>                          */
>> -                       si_pm4_set_reg(pm4,
>> R_00B118_SPI_SHADER_PGM_RSRC3_VS, S_00B118_CU_EN(0xffff));
>> -                       si_pm4_set_reg(pm4,
>> R_00B11C_SPI_SHADER_LATE_ALLOC_VS, S_00B11C_LIMIT(2));
>> +                       late_alloc_limit = 2;
>>                 } else {
>> -                       /* Set LATE_ALLOC_VS == 31. It should be less than
>> -                        * the number of scratch waves. Limitations:
>> -                        * - VS can't execute on CU0.
>> -                        * - If HS writes outputs to LDS, LS can't execute
>> on CU0.
>> +                       /* This is a good initial value.
>> +                        * We shouldn't run into a VS-PS deadlock, because
>> it
>> +                        * only allows 1 late_alloc wave per SIMD on
>> num_cu - 2.
>
>
> Disabling VS on CU0 should already guarantee that there's no VS-PS deadlock.
> So maybe with this change, we can allow VS to run on all CUs? At least the
> comment should reflect that.
>
> On the other hand, Vulkan actually looks at the "always-on CUs", which I
> guess is related to power-gating turning off CUs. So as long as we don't
> look at that -- and I kind of think we shouldn't, because we care about the
> high performance case where all CUs are turned on anyway -- we do need to
> keep CU0 disabled for VS to avoid deadlocks when CUs happen to be
> power-gated.

Well, I can change the comment, but I don't see anything else here
that should be changed.

>
> BTW, I recently realized that "late alloc" simply means that the VS waves
> get their export allocation as soon as possible, i.e. as soon as there's
> free space in the PC / POS export. So even late alloc waves might end up
> never being stalled on exports, depending on the overall latency of the VS
> relative to PS. Not important for the driver, but I found it interesting to
> know.

OK makes sense.

Marek