[Mesa-dev] [PATCH 1/2] gallium: add PIPE_COMPUTE_CAP_SUBGROUP_SIZE

Fri Jun 5 22:41:12 PDT 2015

On Fri, Jun 5, 2015 at 10:35 PM, Francisco Jerez <currojerez at riseup.net> wrote:
>> OTOH, at least in OpenCL, this cap wouldn't be used 'raw' as
>> performance hint, since the actual value returned (the
>> PREFERRED_WORK_GROUP_SIZE_MULTIPLE) is a kernel property rather than a
>> device property, so it may be tuned at kernel compilation time,
>> according to effective work-item SIMD usage.
>
> At least the way it's implemented in this series, it's a per-device
> property, and even though I see your point that it might be useful to
> have a finer-grained value in some cases, I don't think it's worth doing
> unless there is any evidence that the unlikely over-alignment of the
> work-group size will actually hurt performance for some application --
> And there isn't at this point because ILO doesn't currently support
> OpenCL AFAIK.

What I was trying to say is that while the cap itself is per-device,
the OpenCL property that relies on this cap isn't.
In this sense, I would expect the cap to report the actual _hardware_
property, and the higher level stack (OpenCL or whatever, if and when
it will be supported) to massage the value as appropriate (e.g. by
multiplying by 4x —the overcommit needed to keep the device pipes
full— and then dividing by the vector width of the kernel).

Ultimately the question is if the device property (i.e. not the OpenCL
kernel property) should expose just the actual physical SIMD width, or
a less raw value that takes into consideration other aspects of the
device too.

In some sense the situation is similar to the one with the older
NVIDIA and AMD architectures, where the processing elements were
clustered in smaller blocks (e.g. 8s or 16s for NVIDIAs,  even though
the warp size was 32), which meant you _could_ efficiently use half-
or quater-warps under specific conditions, but in most cases you
wanted to use multiple of full warps anyway.

However, on that hardware the instruction dispatch was actually at the
warp level. This has significant implications when implementing
lockless algorithms, for example: the warp or wavefront size on NVIDIA
and AMD becomes the largest number of work-items that can exchange
data without barriers. With the “dynamic SIMD” thing Intel has, would
we have any guarantee of synchronized forward progress? (Yet, people
relying on the PREFERRED_WORK_GROUP_SIZE_MULTIPLE for lockless
algorithm are abusing a value for something it wasn't intended to be.)

Ok, I'm convinced that 16 is a good choice for this cap on Intel, at
least for the current generation.

-- 
Giuseppe "Oblomov" Bilotta