[Mesa-dev] [PATCH 2/2] radeonsi: rewrite late alloc VS limit computation

Mon Aug 28 20:25:42 UTC 2017

On 28.08.2017 21:43, Marek Olšák wrote:
> On Mon, Aug 28, 2017 at 12:50 PM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
>> On 25.08.2017 02:57, Marek Olšák wrote:
>>>
>>> From: Marek Olšák <marek.olsak at amd.com>
>>>
>>> This is still very simple, but it's better than before.
>>>
>>> Loosely ported from Vulkan.
>>> ---
>>>    src/gallium/drivers/radeonsi/si_state.c | 38
>>> ++++++++++++++++++++++-----------
>>>    1 file changed, 26 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/src/gallium/drivers/radeonsi/si_state.c
>>> b/src/gallium/drivers/radeonsi/si_state.c
>>> index 24e509c..393f960 100644
>>> --- a/src/gallium/drivers/radeonsi/si_state.c
>>> +++ b/src/gallium/drivers/radeonsi/si_state.c
>>> @@ -4649,40 +4649,54 @@ static void si_init_config(struct si_context
>>> *sctx)
>>>                          /* If this is 0, Bonaire can hang even if GS isn't
>>> being used.
>>>                           * Other chips are unaffected. These are
>>> suboptimal values,
>>>                           * but we don't use on-chip GS.
>>>                           */
>>>                          si_pm4_set_reg(pm4, R_028A44_VGT_GS_ONCHIP_CNTL,
>>>                                         S_028A44_ES_VERTS_PER_SUBGRP(64) |
>>>                                         S_028A44_GS_PRIMS_PER_SUBGRP(4));
>>>                  }
>>>                  si_pm4_set_reg(pm4, R_00B21C_SPI_SHADER_PGM_RSRC3_GS,
>>> S_00B21C_CU_EN(0xffff));
>>>    -             if (sscreen->b.info.num_good_compute_units /
>>> -                   (sscreen->b.info.max_se *
>>> sscreen->b.info.max_sh_per_se) <= 4) {
>>> +               /* Compute LATE_ALLOC_VS.LIMIT. */
>>> +               unsigned num_cu_per_sh =
>>> sscreen->b.info.num_good_compute_units /
>>> +                                        (sscreen->b.info.max_se *
>>> +                                         sscreen->b.info.max_sh_per_se);
>>> +               unsigned late_alloc_limit; /* The limit is per SH. */
>>> +
>>> +               if (sctx->b.family == CHIP_KABINI) {
>>> +                       late_alloc_limit = 0; /* Potential hang on Kabini.
>>> */
>>> +               } else if (num_cu_per_sh <= 4) {
>>>                          /* Too few available compute units per SH.
>>> Disallowing
>>> -                        * VS to run on CU0 could hurt us more than late
>>> VS
>>> +                        * VS to run on one CU could hurt us more than
>>> late VS
>>>                           * allocation would help.
>>>                           *
>>> -                        * LATE_ALLOC_VS = 2 is the highest safe number.
>>> +                        * 2 is the highest safe number that allows us to
>>> keep
>>> +                        * all CUs enabled.
>>
>>
>> Just to clarify: is the logic here that LATE_ALLOC_VS = 2 means up to 3 late
>> alloc waves, which means at least one SIMD is always free, even when there's
>> only a single CU?
> 
> Yes according to the Vulkan driver.
> 
>>
>>
>>>                           */
>>> -                       si_pm4_set_reg(pm4,
>>> R_00B118_SPI_SHADER_PGM_RSRC3_VS, S_00B118_CU_EN(0xffff));
>>> -                       si_pm4_set_reg(pm4,
>>> R_00B11C_SPI_SHADER_LATE_ALLOC_VS, S_00B11C_LIMIT(2));
>>> +                       late_alloc_limit = 2;
>>>                  } else {
>>> -                       /* Set LATE_ALLOC_VS == 31. It should be less than
>>> -                        * the number of scratch waves. Limitations:
>>> -                        * - VS can't execute on CU0.
>>> -                        * - If HS writes outputs to LDS, LS can't execute
>>> on CU0.
>>> +                       /* This is a good initial value.
>>> +                        * We shouldn't run into a VS-PS deadlock, because
>>> it
>>> +                        * only allows 1 late_alloc wave per SIMD on
>>> num_cu - 2.
>>
>>
>> Disabling VS on CU0 should already guarantee that there's no VS-PS deadlock.
>> So maybe with this change, we can allow VS to run on all CUs? At least the
>> comment should reflect that.
>>
>> On the other hand, Vulkan actually looks at the "always-on CUs", which I
>> guess is related to power-gating turning off CUs. So as long as we don't
>> look at that -- and I kind of think we shouldn't, because we care about the
>> high performance case where all CUs are turned on anyway -- we do need to
>> keep CU0 disabled for VS to avoid deadlocks when CUs happen to be
>> power-gated.
> 
> Well, I can change the comment, but I don't see anything else here
> that should be changed.

Sure, with some comment change you can add my R-b.

Cheers,
Nicolai

> 
>>
>> BTW, I recently realized that "late alloc" simply means that the VS waves
>> get their export allocation as soon as possible, i.e. as soon as there's
>> free space in the PC / POS export. So even late alloc waves might end up
>> never being stalled on exports, depending on the overall latency of the VS
>> relative to PS. Not important for the driver, but I found it interesting to
>> know.
> 
> OK makes sense.
> 
> Marek
> 

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.