[Mesa-dev] [PATCH 1/2] gallium: add PIPE_COMPUTE_CAP_SUBGROUP_SIZE

Fri Jun 5 06:18:27 PDT 2015

Hello,

On Fri, Jun 5, 2015 at 2:22 PM, Francisco Jerez <currojerez at riseup.net> wrote:
> Giuseppe Bilotta <giuseppe.bilotta at gmail.com> writes:
>>
>> Ok, scratch that. I was confused by the fact that Beignet reports a
>> preferred work-group size multiple of 16. Intel IGPs support _logical_
>> SIMD width of up to 32, but the _hardware_ SIMD width is just 4. So
>> the question is if here we should report the _hardware_ width, or the
>> maximum _logical_ width.
>>
> The physical SIMD width of any Intel GPU that as far as I'm aware ILO
> supports is 8, however, the hardware can execute 16- and in some cases
> 32-wide instructions by splitting them internally into instructions of
> the native SIMD width.

Well, according to the Gen7.5 and 8 manuals I found on Intel's site,
it's actually 4, although with 2 FPUs. If the FPUs can execute
different (and independent) instructions, then the “lower SIMD limit”
would be 4, not 8, although in practice each execution unit has 8 PEs
available.

[snip]

> As this cap is just a performance hint, I think it makes sense to assume
> the best-case scenario as Grigori has done.  If the driver later on
> decides it doesn't pay off to use the maximum SIMD width it can always
> use less, but using more may be difficult if the application didn't keep
> it in mind while choosing the workgroup layout.

OTOH, at least in OpenCL, this cap wouldn't be used 'raw' as
performance hint, since the actual value returned (the
PREFERRED_WORK_GROUP_SIZE_MULTIPLE) is a kernel property rather than a
device property, so it may be tuned at kernel compilation time,
according to effective work-item SIMD usage. In this sense I think the
cap itself should be a 'lower limit', i.e. the value under which the
kernel simply cannot fully utilize the hardware.

IOW, I believe that if a larger group size than the physical SIMD
width is needed for a specific kernel to fully utilize the hardware,
this should be handled higher up in the stack, not at the level of
this cap, since the value here is is going to be manipulated _anyway_
(e.g. a kernel written for float16 might even end up recommending a
work-group size multiple of 1, as an extreme example).

-- 
Giuseppe "Oblomov" Bilotta