[Mesa-dev] [PATCH 2/7] radeonsi: compute shaders w/out LDS/barriers don't have lower register limit
Nicolai Hähnle
nhaehnle at gmail.com
Fri May 26 09:49:15 UTC 2017
On 25.05.2017 19:04, Marek Olšák wrote:
> From: Marek Olšák <marek.olsak at amd.com>
>
> Or do they? This doesn't hang, so it seems right, but I'm not 100% sure.
> Setting VGPRS=256 (i.e. above the limit) with big threadgroups works fine.
>
> shader-db: Spilled VGPRs: 107 -> 50 (-53.27 %)
>
> DiRT Showdown and GRID Autosport have 100% reduction in VGPR spilling.
> There are no other changes for shader-db.
Hmm. I'm very confused that this supposedly works. Which hardware did
you test this with?
I see no mention or logic anywhere that will split threadgroups across
CUs. I also don't see how that's possible, since the resources to manage
barriers are allocated per-CU, and the hardware always allocates those.
Okay... digging deeper, it looks like this will probably just launch the
waves in sequence: with VGPRS=256, you'll get up to 4 waves at a time
(one per SIMD), and the rest will be stalled. Since there are no
barriers, that's okay, but it's not great: more stalling / less
parallelism. Also, judging from a skim of the RTL, some of the logic at
least on gfx6 that tries to be smart about selecting which CU to launch
the threadgroups on will get confused, because the total number of VGPRs
required for the threadgroup will overflow.
Unless you get measurable performance benefits somewhere (which it seems
you don't), I'd rather drop this. Even if you do get performance
benefits, it would probably be even better to use smaller
workgroups/threadgroups at the hardware level, and use a bit of prolog
code to fixup the application-visible local/global invocation IDs.
Cheers,
Nicolai
> ---
> src/gallium/drivers/radeonsi/si_shader.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/src/gallium/drivers/radeonsi/si_shader.c b/src/gallium/drivers/radeonsi/si_shader.c
> index 61f1384..0ffe402 100644
> --- a/src/gallium/drivers/radeonsi/si_shader.c
> +++ b/src/gallium/drivers/radeonsi/si_shader.c
> @@ -4040,20 +4040,29 @@ static unsigned si_get_max_workgroup_size(const struct si_shader *shader)
> properties[TGSI_PROPERTY_CS_FIXED_BLOCK_WIDTH] *
> properties[TGSI_PROPERTY_CS_FIXED_BLOCK_HEIGHT] *
> properties[TGSI_PROPERTY_CS_FIXED_BLOCK_DEPTH];
>
> if (!max_work_group_size) {
> /* This is a variable group size compute shader,
> * compile it for the maximum possible group size.
> */
> max_work_group_size = SI_MAX_VARIABLE_THREADS_PER_BLOCK;
> }
> +
> + /* Compute shader threadgroups without LDS usage and barriers don't
> + * have to be stuck on the same compute unit, and so register usage
> + * doesn't have to be limited.
> + */
> + if (!shader->selector->local_size &&
> + !shader->selector->info.uses_barrier)
> + return MIN2(64, max_work_group_size);
> +
> return max_work_group_size;
> }
>
> static void declare_per_stage_desc_pointers(struct si_shader_context *ctx,
> LLVMTypeRef *params,
> unsigned *num_params,
> bool assign_params)
> {
> params[(*num_params)++] = si_const_array(ctx->v4i32,
> SI_NUM_SHADER_BUFFERS + SI_NUM_CONST_BUFFERS);
>
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
More information about the mesa-dev
mailing list