[Mesa-dev] [PATCH 13/14] radeonsi: Process multiple patches per threadgroup.
Nicolai Hähnle
nhaehnle at gmail.com
Tue May 10 16:55:42 UTC 2016
On 10.05.2016 05:53, Bas Nieuwenhuizen wrote:
> Using more than 1 wave per threadgroup does increase performance
> generally. Not using too many patches per threadgroup also
> increases performance. Both catalyst and amdgpu-pro seem to
> use 40 patches as their maximum, but I haven't really seen
> any performance increase from limiting the number of patches
> to 40 instead of 64.
>
> Note that the trick where we overlap the input and output LDS
> does not work anymore as the insertion of the tess factors
> changes the patch stride.
So this still doesn't take into account potential LDS use by the shaders
themselves, e.g. because of alloca lowering. Perhaps this should be
noted in a comment somewhere?
>
> Signed-off-by: Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl>
> ---
> src/gallium/drivers/radeonsi/si_state_draw.c | 42 ++++++++++++++++++----------
> 1 file changed, 27 insertions(+), 15 deletions(-)
>
> diff --git a/src/gallium/drivers/radeonsi/si_state_draw.c b/src/gallium/drivers/radeonsi/si_state_draw.c
> index 7aa886a..3150489 100644
> --- a/src/gallium/drivers/radeonsi/si_state_draw.c
> +++ b/src/gallium/drivers/radeonsi/si_state_draw.c
> @@ -108,20 +108,7 @@ static void si_emit_derived_tess_state(struct si_context *sctx,
> unsigned input_patch_size, output_patch_size, output_patch0_offset;
> unsigned perpatch_output_offset, lds_size, ls_rsrc2;
> unsigned tcs_in_layout, tcs_out_layout, tcs_out_offsets;
> - unsigned offchip_layout;
> -
> - *num_patches = 1; /* TODO: calculate this */
> -
> - if (sctx->last_ls == ls->current &&
> - sctx->last_tcs == tcs &&
> - sctx->last_tes_sh_base == tes_sh_base &&
> - sctx->last_num_tcs_input_cp == num_tcs_input_cp)
> - return;
> -
> - sctx->last_ls = ls->current;
> - sctx->last_tcs = tcs;
> - sctx->last_tes_sh_base = tes_sh_base;
> - sctx->last_num_tcs_input_cp = num_tcs_input_cp;
> + unsigned offchip_layout, hardware_lds_size;
>
> /* This calculates how shader inputs and outputs among VS, TCS, and TES
> * are laid out in LDS. */
> @@ -146,9 +133,23 @@ static void si_emit_derived_tess_state(struct si_context *sctx,
> pervertex_output_patch_size = num_tcs_output_cp * output_vertex_size;
> output_patch_size = pervertex_output_patch_size + num_tcs_patch_outputs * 16;
>
> - output_patch0_offset = sctx->tcs_shader.cso ? input_patch_size * *num_patches : 0;
> + *num_patches = MIN2(256 / num_tcs_input_cp, 256 / num_tcs_output_cp);
> +
> + /* Make sure that the data fits in LDS. */
> + hardware_lds_size = sctx->b.chip_class >= CIK ? 65536 : 32768;
> + *num_patches = MIN2(*num_patches, hardware_lds_size / (input_patch_size +
> + output_patch_size));
> +
> + /* Make sure the output data fits in the offchip buffer */
> + *num_patches = MIN2(*num_patches, 32768 / output_patch_size);
That's a bit too much magic constants for my taste. Can you define a
named constant somewhere that both this calculation and the offchip
buffer allocation are based on?
> + /* Not necessary for correctness, but improves performance */
> + *num_patches = MIN2(*num_patches, 64);
> +
> + output_patch0_offset = input_patch_size * *num_patches;
> perpatch_output_offset = output_patch0_offset + pervertex_output_patch_size;
>
> +
Extraneous blank line.
Cheers,
Nicolai
> lds_size = output_patch0_offset + output_patch_size * *num_patches;
> ls_rsrc2 = ls->current->config.rsrc2;
>
> @@ -160,6 +161,17 @@ static void si_emit_derived_tess_state(struct si_context *sctx,
> ls_rsrc2 |= S_00B52C_LDS_SIZE(align(lds_size, 256) / 256);
> }
>
> + if (sctx->last_ls == ls->current &&
> + sctx->last_tcs == tcs &&
> + sctx->last_tes_sh_base == tes_sh_base &&
> + sctx->last_num_tcs_input_cp == num_tcs_input_cp)
> + return;
> +
> + sctx->last_ls = ls->current;
> + sctx->last_tcs = tcs;
> + sctx->last_tes_sh_base = tes_sh_base;
> + sctx->last_num_tcs_input_cp = num_tcs_input_cp;
> +
> /* Due to a hw bug, RSRC2_LS must be written twice with another
> * LS register written in between. */
> if (sctx->b.chip_class == CIK && sctx->b.family != CHIP_HAWAII)
>
More information about the mesa-dev
mailing list