[Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

Sun Aug 13 03:14:50 UTC 2017

Jan Vesely <jan.vesely at rutgers.edu> writes:

> On Sat, 2017-08-05 at 12:34 -0700, Francisco Jerez wrote:
>> Francisco Jerez <currojerez at riseup.net> writes:
>> 
>> > Jan Vesely <jan.vesely at rutgers.edu> writes:
>> > 
>> > > Hi,
>> > > 
>> > > thanks for detailed explanation. I indeed missed the writeBuffer part
>> > > in specs.
>> > > 
>> > > On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
>> > > > These changes are somewhat redundant and potentially
>> > > > performance-impacting, the reason is that in the OpenCL API,
>> > > > clEnqueueWrite* commands are specified to block until the memory
>> > > > provided by the application as origin can be reused safely (i.e. until
>> > > > soft_copy_op()() runs), not necessarily until the transfer to graphics
>> > > > memory has completed (which is what hard_event::wait() will wait for).
>> > > > OTOH reads and maps as implemented by soft_copy_op and friends are
>> > > > essentially blocking so the wait() call is redundant in most cases.
>> > > 
>> > > That explains a noticeable slowdown running piglit with these changes.
>> > > I'm not sure about the read part though. I expected it to be basically
>> > > a noop, but it changes behaviour.
>> > 
>> > I think this change would have slowed you down the most whenever the
>> > mapping operation performed by soft_copy_op() is able to proceed
>> > immediately, either because the buffer is idle (so the driver doesn't
>> > stall on transfer_map()) *or* because the driver is trying to be smart
>> > and creates a bounce buffer where data can be copied into immediately
>> > without stalling, then inserts a pipelined GPU copy from the bounce
>> > buffer into the real buffer.  With this patch you will stall on the GPU
>> > copy regardless (and whatever other work was already on the command
>> > stream which might be substantial), even though it wouldn't have been
>> > necessary in any of these cases.
>> > 
>> > > Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
>> > > blocking read in one of the piglit tests surprisingly returns
>> > > CL_QUEUED.
>> > > 
>> > 
>> > Hmm, yeah, that seems kind of debatable behaviour, although it's
>> > definitely legit for writes, not quite sure for reads...  I believe the
>> > reason why that happens is because the CPU copy proceeds very quickly
>> > (due to the reasons mentioned in the last paragraph), but the hard_event
>> > is still associated with a pipe_fence synchronous with the GPU's command
>> > stream, so it won't get signalled until the GPU catches up.
>
> Hi,
> sorry for the delay, last week was submission week...
>

No worries.

> The part that I'm still missing is what kind of GPU work needs to be
> done after clEnqueueRead*(). I assume all necessary work is completed
> before I can access the data.
> Also CL_QUEUED status was surprising. since we performed at least some
> of the work (we got the data), I'd expect CL_RUNNING or CL_SUBMITTED at
> least.
>

The lag is not due to GPU work that needs to be performed after the
clEnqueueRead call, but due to GPU work that may precede it in the
command stream: Because clover doesn't know when the last time was that
the buffer was referenced by GPU work, the event is associated with a
fence synchronous with the current batch (which obviously won't signal
before any of the GPU work that actually referenced the buffer
completes).  However the pipe driver has a more accurate idea of when
the buffer was used last, so the transfer_map() operation is likely to
complete more quickly than the CL event status changes to CL_COMPLETE.
The reason you're seeing CL_QUEUED rather than CL_SUBMITTED is because
the read operation didn't even need to flush the current batch, so
there's no fence associated with the CL event object yet.

>> > 
>> > > The specs don't mention use of events with blocking read, but it does
>> > > say that "When the read command has completed, the contents of the
>> > > buffer that ptr points to can be used by the application." in the non-
>> > > blocking section. I'd say that the expectation is for the event to be
>> > > CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk
>> > > follow this).
>> > > 
>> > > > 
>> > > > The only reason why it might be useful to behave differently on blocking
>> > > > transfers is that the application may have specified a user event in the
>> > > > event dependency list, which may cause the soft_copy_op() call to be
>> > > > delayed until the application signals the user event.  In order to fix
>> > > > that it should really only be necessary to wait for the event action to
>> > > > be executed, not necessarily its associated GPU work.
>> > > 
>> > > I think that another use is that non-blocking writes do not create an
>> > > extra copy of the buffer. Thus
>> > > clEnqueueWriteBuffer(...,cl_false, ev, ...)
>> > > clWaitForEvents(ev)
>> > > is more memory efficient.
>> > > 
>> > > > 
>> > > > Last time this issue came up (yeah it's not the first time) I proposed
>> > > > the patches below to add a mechanism to wait for the event action only,
>> > > > feel free to include it as PATCH 0.1 and 0.2 of this series (it's been a
>> > > > while so they may no longer apply cleanly).
>> > > 
>> > > I think we can just add comments explaining why the blocking argument
>> > > is ignored, until someone chooses to fix this problem
>> > 
>> > I think the problem is definitely worth fixing, and it shouldn't really
>> > take more effort than adding comments explaining the current behaviour
>> > ;), basically just add a bunch of 'if (blocking)
>> > hev().wait_signalled();' where the spec requires it, roughly as you had
>> > been doing in this patch, but wait_signalled() should only stall on the
>> > CPU work associated with the event, which should give you the same
>> > performance as the current approach.
>
> I can send a patch that replaces wait() -> wait_signalled()
>

Thanks :)

>> > 
>> > > and/or to
>> > > implement proper non-blocking variants (would std::async work for
>> > > trivial cases like ReadBuffer?)
>> > > 
>> 
>> Hm, and to answer this question -- Yeah, std::async would probably work,
>> but I'm not certain whether it would actually perform better than the
>> current approach, because on the one hand the actual DMA-ing of the
>> buffer is likely to happen quasi-asynchronously already assuming the
>> driver is competent, and OTOH because spawning a new thread for the copy
>> would introduce additional overhead that might defeat your purpose
>> unless the copy is very large -- Only experimentation will tell whether
>> it pays off.
>
> it was just a speculation. it looks like Vedran is interested in
> implementing non-blocking reads/writes[0] so I'll leave it to him. r600
> has bigger problems elsewhere atm.
>

Yeah, I'm aware of his work, I suspect the above are the reasons why he
got rather mixed performance results from his changes.

> thanks,
> Jan
>
> [0]https://bugs.freedesktop.org/show_bug.cgi?id=100199
>
>> 
>> > > thanks,
>> > > Jan
>> > > 
>> > > > 
>> > > > Thank you.
>> > > > 
>> > > > Jan Vesely <jan.vesely at rutgers.edu> writes:
>> > > > 
>> > > > > v2: wait in map_buffer and map_image as well
>> > > > > 
>> > > > > Signed-off-by: Jan Vesely <jan.vesely at rutgers.edu>
>> > > > > ---
>> > > > > Hi Aaron,
>> > > > > 
>> > > > > yes, I think you're right, we should wait in Map* as well.
>> > > > > If nothing else it's consistent, even if passing the flag to
>> > > > > add_map might make it unnecessary (haven't actually checked).
>> > > > > 
>> > > > > thanks,
>> > > > > Jan
>> > > > > 
>> > > > >  src/gallium/state_trackers/clover/api/transfer.cpp | 30 ++++++++++++++++++++--
>> > > > >  1 file changed, 28 insertions(+), 2 deletions(-)
>> > > > > 
>> > > > > diff --git a/src/gallium/state_trackers/clover/api/transfer.cpp b/src/gallium/state_trackers/clover/api/transfer.cpp
>> > > > > index f7046253be..729a34590e 100644
>> > > > > --- a/src/gallium/state_trackers/clover/api/transfer.cpp
>> > > > > +++ b/src/gallium/state_trackers/clover/api/transfer.cpp
>> > > > > @@ -295,6 +295,9 @@ clEnqueueReadBuffer(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>> > > > >                     &mem, obj_origin, obj_pitch,
>> > > > >                     region));
>> > > > >  
>> > > > > +   if (blocking)
>> > > > > +       hev().wait();
>> > > > > +
>> > > > >     ret_object(rd_ev, hev);
>> > > > >     return CL_SUCCESS;
>> > > > >  
>> > > > > @@ -325,6 +328,9 @@ clEnqueueWriteBuffer(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>> > > > >                     ptr, {}, obj_pitch,
>> > > > >                     region));
>> > > > >  
>> > > > > +   if (blocking)
>> > > > > +       hev().wait();
>> > > > > +
>> > > > >     ret_object(rd_ev, hev);
>> > > > >     return CL_SUCCESS;
>> > > > >  
>> > > > > @@ -362,6 +368,9 @@ clEnqueueReadBufferRect(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>> > > > >                     &mem, obj_origin, obj_pitch,
>> > > > >                     region));
>> > > > >  
>> > > > > +   if (blocking)
>> > > > > +       hev().wait();
>> > > > > +
>> > > > >     ret_object(rd_ev, hev);
>> > > > >     return CL_SUCCESS;
>> > > > >  
>> > > > > @@ -399,6 +408,9 @@ clEnqueueWriteBufferRect(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>> > > > >                     ptr, host_origin, host_pitch,
>> > > > >                     region));
>> > > > >  
>> > > > > +   if (blocking)
>> > > > > +       hev().wait();
>> > > > > +
>> > > > >     ret_object(rd_ev, hev);
>> > > > >     return CL_SUCCESS;
>> > > > >  
>> > > > > @@ -504,6 +516,9 @@ clEnqueueReadImage(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>> > > > >                     &img, src_origin, src_pitch,
>> > > > >                     region));
>> > > > >  
>> > > > > +   if (blocking)
>> > > > > +       hev().wait();
>> > > > > +
>> > > > >     ret_object(rd_ev, hev);
>> > > > >     return CL_SUCCESS;
>> > > > >  
>> > > > > @@ -538,6 +553,9 @@ clEnqueueWriteImage(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>> > > > >                     ptr, {}, src_pitch,
>> > > > >                     region));
>> > > > >  
>> > > > > +   if (blocking)
>> > > > > +       hev().wait();
>> > > > > +
>> > > > >     ret_object(rd_ev, hev);
>> > > > >     return CL_SUCCESS;
>> > > > >  
>> > > > > @@ -667,7 +685,11 @@ clEnqueueMapBuffer(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>> > > > >  
>> > > > >     void *map = mem.resource(q).add_map(q, flags, blocking, obj_origin, region);
>> > > > >  
>> > > > > -   ret_object(rd_ev, create<hard_event>(q, CL_COMMAND_MAP_BUFFER, deps));
>> > > > > +   auto hev = create<hard_event>(q, CL_COMMAND_MAP_BUFFER, deps);
>> > > > > +   if (blocking)
>> > > > > +       hev().wait();
>> > > > > +
>> > > > > +   ret_object(rd_ev, hev);
>> > > > >     ret_error(r_errcode, CL_SUCCESS);
>> > > > >     return map;
>> > > > >  
>> > > > > @@ -695,7 +717,11 @@ clEnqueueMapImage(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>> > > > >  
>> > > > >     void *map = img.resource(q).add_map(q, flags, blocking, origin, region);
>> > > > >  
>> > > > > -   ret_object(rd_ev, create<hard_event>(q, CL_COMMAND_MAP_IMAGE, deps));
>> > > > > +   auto hev = create<hard_event>(q, CL_COMMAND_MAP_IMAGE, deps);
>> > > > > +   if (blocking)
>> > > > > +       hev().wait();
>> > > > > +
>> > > > > +   ret_object(rd_ev, hev);
>> > > > >     ret_error(r_errcode, CL_SUCCESS);
>> > > > >     return map;
>> > > > >  
>> > > > > -- 
>> > > > > 2.13.3
>> > > > 
>> > > > _______________________________________________
>> > > > mesa-dev mailing list
>> > > > mesa-dev at lists.freedesktop.org
>> > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>> > > 
>> > > -- 
>> > > Jan Vesely <jan.vesely at rutgers.edu>
>
> -- 
> Jan Vesely <jan.vesely at rutgers.edu>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170812/4a170834/attachment.sig>