[Beignet] [PATCH] Add memory fence before barrier to support global memory barrier.

Mon Jun 17 23:38:54 PDT 2013

Zhigang Gong <zhigang.gong at linux.intel.com> writes:

> On Mon, Jun 17, 2013 at 11:41:17PM +0200, Dag Lem wrote:
>> Zhigang Gong <zhigang.gong at linux.intel.com> writes:
>> 
>> > This patch looks good to me. And it can pass the global memory barrier
>> > case. Thanks for the patch, I will push it latter.
>> > As to the local memory fence, according to the bspec, we don't need to
>> > issue a fence to ensure the memory access ordering. I heard from you
>> > that you tried to modify the test case to use local memory and local fence,
>> > it also had problem. Could you submit your modification as a new test
>> > case for local memory barrier? Then we can all take a look at that case
>> > and investigate the underlying problem.
>> >
>> 
>> I suspect that any problems with local memory may have a different
>> cause.
>> 
>> As far as I understand, SLM (shared local memory) can only be allocated
>> per half-slice (i.e. 8 EUs for IVB GT2).
>> 
>> In OpenCL, on the other hand, local memory is allocated per work group.
>> 
>> This implies that Beignet can either
>> 
>> a) Always make a work group correspond to a half-slice (inflexible).
>> b) Never run more than one work group (<= half-slice) at once (slow).
>> c) Subdivide local memory per work group (<= half-slice) (good).
>> 
>> However I suspect that Beignet does none of these, but rather lets all
>> work groups share the same local memory. This will lead to different
>> work groups stomping on each others' supposedly local memory.
>> 
>> Please apologize if I should be talking nonsense here; my understanding
>> of these issues is quite limited :-)
>
> Thanks for your comments here. And I dig into the spec, and think there
> should be no SLM allocation issue. As we are using GPU_WALKER and the
> automatic SLM allocation mechanism provided by the Gen7 hardware.
>
> We already set the group size, and the SLM size at the interface descriptor.
> And we use the GPGPU walker to dispatch the threads automaticly. The hardware
> should make sure dispatch one thread group to the same half-slice.

But, does a thread group really have a one-to-one correspondence to a
work group in Beignet?

If a thread group encompasses more than one work group, there will be
problems with local memory.

On the other hand, if a thread group is equal to one work group, Beignet
will be unable to run more than one work group at once, which will
severely limit the performance of runs with small local sizes.

>
> But we do meet issue here, and a memory fence does fix the issue which is
> really not comply with the spec which is said SLM doesn't need a memory fence.
>
> Really strange for me. Any other comments?
>
>> 
>> -- 
>> Dag
>> _______________________________________________
>> Beignet mailing list
>> Beignet at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/beignet
>

-- 
Dag