[Mesa-dev] [PATCH 1/2] gallium: add PIPE_COMPUTE_CAP_SUBGROUP_SIZE

Sat Jun 6 06:21:06 PDT 2015

Giuseppe Bilotta <giuseppe.bilotta at gmail.com> writes:

> On Fri, Jun 5, 2015 at 10:35 PM, Francisco Jerez <currojerez at riseup.net> wrote:
>>> OTOH, at least in OpenCL, this cap wouldn't be used 'raw' as
>>> performance hint, since the actual value returned (the
>>> PREFERRED_WORK_GROUP_SIZE_MULTIPLE) is a kernel property rather than a
>>> device property, so it may be tuned at kernel compilation time,
>>> according to effective work-item SIMD usage.
>>
>> At least the way it's implemented in this series, it's a per-device
>> property, and even though I see your point that it might be useful to
>> have a finer-grained value in some cases, I don't think it's worth doing
>> unless there is any evidence that the unlikely over-alignment of the
>> work-group size will actually hurt performance for some application --
>> And there isn't at this point because ILO doesn't currently support
>> OpenCL AFAIK.
>
> What I was trying to say is that while the cap itself is per-device,
> the OpenCL property that relies on this cap isn't.
> In this sense, I would expect the cap to report the actual _hardware_
> property, and the higher level stack (OpenCL or whatever, if and when
> it will be supported) to massage the value as appropriate (e.g. by
> multiplying by 4x —the overcommit needed to keep the device pipes
> full— and then dividing by the vector width of the kernel).
>
The problem is that it requires a lot of hardware-specific knowledge to
find the right over-commit factor (instruction latencies, issue
overhead, the fact that in some cases the pipeline is twice as wide),
and whether and to what extent the kernel needs to be scalarized -- and
the OpenCL state tracker is hardware-independent.

> Ultimately the question is if the device property (i.e. not the OpenCL
> kernel property) should expose just the actual physical SIMD width, or
> a less raw value that takes into consideration other aspects of the
> device too.
>
> In some sense the situation is similar to the one with the older
> NVIDIA and AMD architectures, where the processing elements were
> clustered in smaller blocks (e.g. 8s or 16s for NVIDIAs,  even though
> the warp size was 32), which meant you _could_ efficiently use half-
> or quater-warps under specific conditions, but in most cases you
> wanted to use multiple of full warps anyway.
>
> However, on that hardware the instruction dispatch was actually at the
> warp level. This has significant implications when implementing
> lockless algorithms, for example: the warp or wavefront size on NVIDIA
> and AMD becomes the largest number of work-items that can exchange
> data without barriers. With the “dynamic SIMD” thing Intel has, would
> we have any guarantee of synchronized forward progress?

Yes, you do.  The fact that the FPUs are 4-wide is completely
transparent for the application (and even for the driver), it's just an
implementation detail: The EUs behave pretty much as if they really had
the logical SIMD width, executing instructions in order (e.g. a 4-wide
chunk of an instruction will never start execution before all 4-wide
chunks of the previous instruction have) and atomically (e.g. an FPU
instruction won't be able to see the effects of some 4-wide chunks of
another instruction but not others -- not even its own effects).

> (Yet, people relying on the PREFERRED_WORK_GROUP_SIZE_MULTIPLE for
> lockless algorithm are abusing a value for something it wasn't
> intended to be.)

Yeah, true.
>
> Ok, I'm convinced that 16 is a good choice for this cap on Intel, at
> least for the current generation.
>
> -- 
> Giuseppe "Oblomov" Bilotta
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150606/9cdfe592/attachment.sig>