[Mesa-dev] [PATCH 2/2] radeonsi: rewrite late alloc VS limit computation

Nicolai Hähnle nhaehnle at gmail.com
Mon Aug 28 10:50:57 UTC 2017


On 25.08.2017 02:57, Marek Olšák wrote:
> From: Marek Olšák <marek.olsak at amd.com>
> 
> This is still very simple, but it's better than before.
> 
> Loosely ported from Vulkan.
> ---
>   src/gallium/drivers/radeonsi/si_state.c | 38 ++++++++++++++++++++++-----------
>   1 file changed, 26 insertions(+), 12 deletions(-)
> 
> diff --git a/src/gallium/drivers/radeonsi/si_state.c b/src/gallium/drivers/radeonsi/si_state.c
> index 24e509c..393f960 100644
> --- a/src/gallium/drivers/radeonsi/si_state.c
> +++ b/src/gallium/drivers/radeonsi/si_state.c
> @@ -4649,40 +4649,54 @@ static void si_init_config(struct si_context *sctx)
>   			/* If this is 0, Bonaire can hang even if GS isn't being used.
>   			 * Other chips are unaffected. These are suboptimal values,
>   			 * but we don't use on-chip GS.
>   			 */
>   			si_pm4_set_reg(pm4, R_028A44_VGT_GS_ONCHIP_CNTL,
>   				       S_028A44_ES_VERTS_PER_SUBGRP(64) |
>   				       S_028A44_GS_PRIMS_PER_SUBGRP(4));
>   		}
>   		si_pm4_set_reg(pm4, R_00B21C_SPI_SHADER_PGM_RSRC3_GS, S_00B21C_CU_EN(0xffff));
>   
> -		if (sscreen->b.info.num_good_compute_units /
> -		    (sscreen->b.info.max_se * sscreen->b.info.max_sh_per_se) <= 4) {
> +		/* Compute LATE_ALLOC_VS.LIMIT. */
> +		unsigned num_cu_per_sh = sscreen->b.info.num_good_compute_units /
> +					 (sscreen->b.info.max_se *
> +					  sscreen->b.info.max_sh_per_se);
> +		unsigned late_alloc_limit; /* The limit is per SH. */
> +
> +		if (sctx->b.family == CHIP_KABINI) {
> +			late_alloc_limit = 0; /* Potential hang on Kabini. */
> +		} else if (num_cu_per_sh <= 4) {
>   			/* Too few available compute units per SH. Disallowing
> -			 * VS to run on CU0 could hurt us more than late VS
> +			 * VS to run on one CU could hurt us more than late VS
>   			 * allocation would help.
>   			 *
> -			 * LATE_ALLOC_VS = 2 is the highest safe number.
> +			 * 2 is the highest safe number that allows us to keep
> +			 * all CUs enabled.

Just to clarify: is the logic here that LATE_ALLOC_VS = 2 means up to 3 
late alloc waves, which means at least one SIMD is always free, even 
when there's only a single CU?


>   			 */
> -			si_pm4_set_reg(pm4, R_00B118_SPI_SHADER_PGM_RSRC3_VS, S_00B118_CU_EN(0xffff));
> -			si_pm4_set_reg(pm4, R_00B11C_SPI_SHADER_LATE_ALLOC_VS, S_00B11C_LIMIT(2));
> +			late_alloc_limit = 2;
>   		} else {
> -			/* Set LATE_ALLOC_VS == 31. It should be less than
> -			 * the number of scratch waves. Limitations:
> -			 * - VS can't execute on CU0.
> -			 * - If HS writes outputs to LDS, LS can't execute on CU0.
> +			/* This is a good initial value.
> +			 * We shouldn't run into a VS-PS deadlock, because it
> +			 * only allows 1 late_alloc wave per SIMD on num_cu - 2.

Disabling VS on CU0 should already guarantee that there's no VS-PS 
deadlock. So maybe with this change, we can allow VS to run on all CUs? 
At least the comment should reflect that.

On the other hand, Vulkan actually looks at the "always-on CUs", which I 
guess is related to power-gating turning off CUs. So as long as we don't 
look at that -- and I kind of think we shouldn't, because we care about 
the high performance case where all CUs are turned on anyway -- we do 
need to keep CU0 disabled for VS to avoid deadlocks when CUs happen to 
be power-gated.

BTW, I recently realized that "late alloc" simply means that the VS 
waves get their export allocation as soon as possible, i.e. as soon as 
there's free space in the PC / POS export. So even late alloc waves 
might end up never being stalled on exports, depending on the overall 
latency of the VS relative to PS. Not important for the driver, but I 
found it interesting to know.

Cheers,
Nicolai

>   			 */
> -			si_pm4_set_reg(pm4, R_00B118_SPI_SHADER_PGM_RSRC3_VS, S_00B118_CU_EN(0xfffe));
> -			si_pm4_set_reg(pm4, R_00B11C_SPI_SHADER_LATE_ALLOC_VS, S_00B11C_LIMIT(31));
> +			late_alloc_limit = (num_cu_per_sh - 2) * 4;
> +
> +			/* The limit is 0-based, so 0 means 1. */
> +			assert(late_alloc_limit > 0 && late_alloc_limit <= 64);
> +			late_alloc_limit -= 1;
>   		}
>   
> +		/* VS can't execute on one CU if the limit is > 2. */
> +		si_pm4_set_reg(pm4, R_00B118_SPI_SHADER_PGM_RSRC3_VS,
> +			       S_00B118_CU_EN(late_alloc_limit > 2 ? 0xfffe : 0xffff));
> +		si_pm4_set_reg(pm4, R_00B11C_SPI_SHADER_LATE_ALLOC_VS,
> +			       S_00B11C_LIMIT(late_alloc_limit));
>   		si_pm4_set_reg(pm4, R_00B01C_SPI_SHADER_PGM_RSRC3_PS, S_00B01C_CU_EN(0xffff));
>   	}
>   
>   	if (sctx->b.chip_class >= VI) {
>   		unsigned vgt_tess_distribution;
>   
>   		si_pm4_set_reg(pm4, R_028424_CB_DCC_CONTROL,
>   			       S_028424_OVERWRITE_COMBINER_MRT_SHARING_DISABLE(1) |
>   			       S_028424_OVERWRITE_COMBINER_WATERMARK(4));
>   
> 


-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.


More information about the mesa-dev mailing list