Mesa (master): radeonsi: limit HS LDS usage per workgroup to 16K to allow at least 2 WGs/CU

Mon Nov 23 03:15:54 UTC 2020

Module: Mesa
Branch: master
Commit: 5df5ee2722f44782d8bb6562d0e11ffff813ed46
URL:    http://cgit.freedesktop.org/mesa/mesa/commit/?id=5df5ee2722f44782d8bb6562d0e11ffff813ed46

Author: Marek Olšák <marek.olsak at amd.com>
Date:   Fri Nov 13 00:38:06 2020 -0500

radeonsi: limit HS LDS usage per workgroup to 16K to allow at least 2 WGs/CU

This increases occupancy when the LDS size is e.g. 20K for 3 waves.
If we limit the size to 16K, we can fit 2 workgroups with 2 waves each,
so 4 waves in total.

Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer at amd.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/7623>

---

 src/gallium/drivers/radeonsi/si_state_draw.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/gallium/drivers/radeonsi/si_state_draw.c b/src/gallium/drivers/radeonsi/si_state_draw.c
index 2179f1f2488..beff65f3786 100644
--- a/src/gallium/drivers/radeonsi/si_state_draw.c
+++ b/src/gallium/drivers/radeonsi/si_state_draw.c
@@ -85,7 +85,7 @@ static void si_emit_derived_tess_state(struct si_context *sctx, const struct pip
    unsigned input_patch_size, output_patch_size, output_patch0_offset;
    unsigned perpatch_output_offset, lds_per_patch, lds_size;
    unsigned tcs_in_layout, tcs_out_layout, tcs_out_offsets;
-   unsigned offchip_layout, hardware_lds_size, ls_hs_config;
+   unsigned offchip_layout, max_lds_size, target_lds_size, ls_hs_config;
 
    /* Since GFX9 has merged LS-HS in the TCS state, set LS = TCS. */
    if (sctx->chip_class >= GFX9) {
@@ -163,9 +163,14 @@ static void si_emit_derived_tess_state(struct si_context *sctx, const struct pip
     * While GFX7 can use 64K per threadgroup, there is a hang on Stoney
     * with 2 CUs if we use more than 32K. The closed Vulkan driver also
     * uses 32K at most on all GCN chips.
+    *
+    * Use 16K so that we can fit 2 workgroups on the same CU.
     */
-   hardware_lds_size = 32768;
-   *num_patches = MIN2(*num_patches, hardware_lds_size / lds_per_patch);
+   max_lds_size = 32 * 1024; /* hw limit */
+   target_lds_size = 16 * 1024; /* target at least 2 workgroups per CU, 16K each */
+   *num_patches = MIN2(*num_patches, target_lds_size / lds_per_patch);
+   *num_patches = MAX2(*num_patches, 1);
+   assert(*num_patches * lds_per_patch <= max_lds_size);
 
    /* Make sure the output data fits in the offchip buffer */
    *num_patches =