[Beignet] [PATCH] Add memory fence before barrier to support global memory barrier.

Mon Jun 17 23:57:27 PDT 2013

On Tue, Jun 18, 2013 at 08:38:54AM +0200, Dag Lem wrote:
> Zhigang Gong <zhigang.gong at linux.intel.com> writes:
> 
> > On Mon, Jun 17, 2013 at 11:41:17PM +0200, Dag Lem wrote:
> >> Zhigang Gong <zhigang.gong at linux.intel.com> writes:
> >> 
> >> > This patch looks good to me. And it can pass the global memory barrier
> >> > case. Thanks for the patch, I will push it latter.
> >> > As to the local memory fence, according to the bspec, we don't need to
> >> > issue a fence to ensure the memory access ordering. I heard from you
> >> > that you tried to modify the test case to use local memory and local fence,
> >> > it also had problem. Could you submit your modification as a new test
> >> > case for local memory barrier? Then we can all take a look at that case
> >> > and investigate the underlying problem.
> >> >
> >> 
> >> I suspect that any problems with local memory may have a different
> >> cause.
> >> 
> >> As far as I understand, SLM (shared local memory) can only be allocated
> >> per half-slice (i.e. 8 EUs for IVB GT2).
> >> 
> >> In OpenCL, on the other hand, local memory is allocated per work group.
> >> 
> >> This implies that Beignet can either
> >> 
> >> a) Always make a work group correspond to a half-slice (inflexible).
> >> b) Never run more than one work group (<= half-slice) at once (slow).
> >> c) Subdivide local memory per work group (<= half-slice) (good).
> >> 
> >> However I suspect that Beignet does none of these, but rather lets all
> >> work groups share the same local memory. This will lead to different
> >> work groups stomping on each others' supposedly local memory.
> >> 
> >> Please apologize if I should be talking nonsense here; my understanding
> >> of these issues is quite limited :-)
> >
> > Thanks for your comments here. And I dig into the spec, and think there
> > should be no SLM allocation issue. As we are using GPU_WALKER and the
> > automatic SLM allocation mechanism provided by the Gen7 hardware.
> >
> > We already set the group size, and the SLM size at the interface descriptor.
> > And we use the GPGPU walker to dispatch the threads automaticly. The hardware
> > should make sure dispatch one thread group to the same half-slice.
> 
> But, does a thread group really have a one-to-one correspondence to a
> work group in Beignet?
Right, It is designed to work that way. If it doesn't, then we have a serious
bug here. To support massive threads dispatching, we must use GPGPU_WALKER rather
then GPGPU_OBJECT. And to use GPGPU_WALKER, the hardware even don't give us
a chance to set the half-slice id.

> 
> If a thread group encompasses more than one work group, there will be
> problems with local memory.
> 
> On the other hand, if a thread group is equal to one work group, Beignet
> will be unable to run more than one work group at once, which will
> severely limit the performance of runs with small local sizes.
IMO, it will not only dispatch one thread gropu at one time. The GPGPU walker
will dispatch as much as possible thread group at one time based on current
available EUs. The hardware just need to ensure one work/thread group should not
exceed one half-slice's boundary.

> 
> >
> > But we do meet issue here, and a memory fence does fix the issue which is
> > really not comply with the spec which is said SLM doesn't need a memory fence.
> >
> > Really strange for me. Any other comments?
> >
> >> 
> >> -- 
> >> Dag
> >> _______________________________________________
> >> Beignet mailing list
> >> Beignet at lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/beignet
> >
> 
> -- 
> Dag