[Mesa-dev] [PATCH v2 13/14] radeonsi: Process multiple patches per threadgroup.

Mon May 16 23:52:05 UTC 2016

On Mon, May 16, 2016 at 10:15 PM, Marek Olšák <maraeo at gmail.com> wrote:
> On Fri, May 13, 2016 at 3:37 AM, Bas Nieuwenhuizen
> <bas at basnieuwenhuizen.nl> wrote:
>> Using more than 1 wave per threadgroup does increase performance
>> generally.  Not using too many patches per threadgroup also
>> increases performance. Both catalyst and amdgpu-pro seem to
>> use 40 patches as their maximum, but I haven't really seen
>> any performance increase from limiting the number of patches
>> to 40 instead of 64.
>
> 40 may be optimal for existing OpenGL apps on some chips.
>
> Vulkan doesn't set more than 16.
>
> Let's set either 40 or 16 with a comment where the value comes from.

IIRC heaven was more performant with multiple waves per threadgroup,
which means >16 patches, as it uses 3 CP's per patch. Not sure about
40 and I'm away from my dev machine at the moment.

>
>>
>> Note that the trick where we overlap the input and output LDS
>> does not work anymore as the insertion of the tess factors
>> changes the patch stride.
>
> I don't understand this. Can you explain it more?

When we didn't have a TCS, we would just use TCS input as TCS output
and let the fixed function TCS add the per patch outputs (tess
factors) at the end.

This works fine when you have a single patch, but not with multiple.
To see why we have to look at the input/output format in LDS. This is

Attributes for patch 0 vertex 0.
Attributes for patch 0 vertex 1.
...
Per patch attributes for patch 0.
Attributes for patch 1 vertex 0.
...

So the number of per patch attributes changes the stride between
patches.  As the LS output has 0 per patch attributes, and TCS output
has at least the tess factors this differs. Therefore the second and
later patches start at different offset in TCS input and output, so we
need to copy or move them.

I hope this makes things a bit more clear.

- Bas

>
> Marek