[Mesa-dev] [PATCH 1/2] gallium: add PIPE_COMPUTE_CAP_SUBGROUP_SIZE

Fri Jun 5 13:35:53 PDT 2015

Giuseppe Bilotta <giuseppe.bilotta at gmail.com> writes:

> Hello,
>
> On Fri, Jun 5, 2015 at 2:22 PM, Francisco Jerez <currojerez at riseup.net> wrote:
>> Giuseppe Bilotta <giuseppe.bilotta at gmail.com> writes:
>>>
>>> Ok, scratch that. I was confused by the fact that Beignet reports a
>>> preferred work-group size multiple of 16. Intel IGPs support _logical_
>>> SIMD width of up to 32, but the _hardware_ SIMD width is just 4. So
>>> the question is if here we should report the _hardware_ width, or the
>>> maximum _logical_ width.
>>>
>> The physical SIMD width of any Intel GPU that as far as I'm aware ILO
>> supports is 8, however, the hardware can execute 16- and in some cases
>> 32-wide instructions by splitting them internally into instructions of
>> the native SIMD width.
>
> Well, according to the Gen7.5 and 8 manuals I found on Intel's site,
> it's actually 4, although with 2 FPUs. If the FPUs can execute
> different (and independent) instructions, then the “lower SIMD limit”
> would be 4, not 8, although in practice each execution unit has 8 PEs
> available.
>

That sounds roughly correct, but AFAIK before Gen8 there was only one
real FPU per EU, the other "pipe" was the special function unit which
could also process some normal arithmetic instructions, what in some
(fairly restricted) cases allowed 8-wide execution of a single
instruction or partial 4-wide execution of two instructions from
different threads at the same time.  In Gen8 what used to be the math
pipe can in addition process general FPU instructions allowing 8-wide
execution in more situations.  In any case you are unlikely to get close
to full utilization of the EU by doing 4-wide only, not only because of
the cases you miss in which you could issue a single instruction 8-wide,
but because of the fixed per-instruction overhead which is (at least) 2
cycles regardless of whether you are doing 4- or 8-wide -- We definitely
don't want to encourage applications to use a work-group size of four,
because it's inefficient.

> [snip]
>
>> As this cap is just a performance hint, I think it makes sense to assume
>> the best-case scenario as Grigori has done.  If the driver later on
>> decides it doesn't pay off to use the maximum SIMD width it can always
>> use less, but using more may be difficult if the application didn't keep
>> it in mind while choosing the workgroup layout.
>
> OTOH, at least in OpenCL, this cap wouldn't be used 'raw' as
> performance hint, since the actual value returned (the
> PREFERRED_WORK_GROUP_SIZE_MULTIPLE) is a kernel property rather than a
> device property, so it may be tuned at kernel compilation time,
> according to effective work-item SIMD usage.

At least the way it's implemented in this series, it's a per-device
property, and even though I see your point that it might be useful to
have a finer-grained value in some cases, I don't think it's worth doing
unless there is any evidence that the unlikely over-alignment of the
work-group size will actually hurt performance for some application --
And there isn't at this point because ILO doesn't currently support
OpenCL AFAIK.

> In this sense I think the cap itself should be a 'lower limit',
> i.e. the value under which the kernel simply cannot fully utilize the
> hardware.
>
Yeah, and a kernel using less than SIMD16 will most likely be unable to
fully utilize the hardware due to the pipeline stalls and issue
overhead, so I think it's the lower limit you're looking for.

> IOW, I believe that if a larger group size than the physical SIMD
> width is needed for a specific kernel to fully utilize the hardware,
> this should be handled higher up in the stack, not at the level of
> this cap,

I don't think it can be handled higher up than in the pipe driver, only
the pipe driver has the hardware-specific knowledge required to answer
the question of which is the best work-group multiple for some specific
device.

> since the value here is is going to be manipulated _anyway_
> (e.g. a kernel written for float16 might even end up recommending a
> work-group size multiple of 1, as an extreme example).
>
On Intel hardware a kernel using 16-vectors would typically be run
scalarized with one SIMD channel per logical thread, so the driver would
still want the work-group size to be a multiple of 16.

> -- 
> Giuseppe "Oblomov" Bilotta
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150605/e9954425/attachment.sig>