[Mesa-dev] [PATCH 2/7] radeonsi: compute shaders w/out LDS/barriers don't have lower register limit

Fri May 26 09:49:15 UTC 2017

On 25.05.2017 19:04, Marek Olšák wrote:
> From: Marek Olšák <marek.olsak at amd.com>
> 
> Or do they? This doesn't hang, so it seems right, but I'm not 100% sure.
> Setting VGPRS=256 (i.e. above the limit) with big threadgroups works fine.
> 
> shader-db: Spilled VGPRs: 107 -> 50 (-53.27 %)
> 
> DiRT Showdown and GRID Autosport have 100% reduction in VGPR spilling.
> There are no other changes for shader-db.

Hmm. I'm very confused that this supposedly works. Which hardware did 
you test this with?

I see no mention or logic anywhere that will split threadgroups across 
CUs. I also don't see how that's possible, since the resources to manage 
barriers are allocated per-CU, and the hardware always allocates those.

Okay... digging deeper, it looks like this will probably just launch the 
waves in sequence: with VGPRS=256, you'll get up to 4 waves at a time 
(one per SIMD), and the rest will be stalled. Since there are no 
barriers, that's okay, but it's not great: more stalling / less 
parallelism. Also, judging from a skim of the RTL, some of the logic at 
least on gfx6 that tries to be smart about selecting which CU to launch 
the threadgroups on will get confused, because the total number of VGPRs 
required for the threadgroup will overflow.

Unless you get measurable performance benefits somewhere (which it seems 
you don't), I'd rather drop this. Even if you do get performance 
benefits, it would probably be even better to use smaller 
workgroups/threadgroups at the hardware level, and use a bit of prolog 
code to fixup the application-visible local/global invocation IDs.

Cheers,
Nicolai

> ---
>   src/gallium/drivers/radeonsi/si_shader.c | 9 +++++++++
>   1 file changed, 9 insertions(+)
> 
> diff --git a/src/gallium/drivers/radeonsi/si_shader.c b/src/gallium/drivers/radeonsi/si_shader.c
> index 61f1384..0ffe402 100644
> --- a/src/gallium/drivers/radeonsi/si_shader.c
> +++ b/src/gallium/drivers/radeonsi/si_shader.c
> @@ -4040,20 +4040,29 @@ static unsigned si_get_max_workgroup_size(const struct si_shader *shader)
>   	               properties[TGSI_PROPERTY_CS_FIXED_BLOCK_WIDTH] *
>   	               properties[TGSI_PROPERTY_CS_FIXED_BLOCK_HEIGHT] *
>   	               properties[TGSI_PROPERTY_CS_FIXED_BLOCK_DEPTH];
>   
>   	if (!max_work_group_size) {
>   		/* This is a variable group size compute shader,
>   		 * compile it for the maximum possible group size.
>   		 */
>   		max_work_group_size = SI_MAX_VARIABLE_THREADS_PER_BLOCK;
>   	}
> +
> +	/* Compute shader threadgroups without LDS usage and barriers don't
> +	 * have to be stuck on the same compute unit, and so register usage
> +	 * doesn't have to be limited.
> +	 */
> +	if (!shader->selector->local_size &&
> +	    !shader->selector->info.uses_barrier)
> +		return MIN2(64, max_work_group_size);
> +
>   	return max_work_group_size;
>   }
>   
>   static void declare_per_stage_desc_pointers(struct si_shader_context *ctx,
>   					    LLVMTypeRef *params,
>   					    unsigned *num_params,
>   					    bool assign_params)
>   {
>   	params[(*num_params)++] = si_const_array(ctx->v4i32,
>   						 SI_NUM_SHADER_BUFFERS + SI_NUM_CONST_BUFFERS);
> 

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.