[Mesa-dev] [PATCH v2 1/1] clover: Wait for requested operation if blocking flag is set

Francisco Jerez currojerez at riseup.net
Sat Aug 5 19:34:55 UTC 2017


Francisco Jerez <currojerez at riseup.net> writes:

> Jan Vesely <jan.vesely at rutgers.edu> writes:
>
>> Hi,
>>
>> thanks for detailed explanation. I indeed missed the writeBuffer part
>> in specs.
>>
>> On Wed, 2017-08-02 at 15:05 -0700, Francisco Jerez wrote:
>>> These changes are somewhat redundant and potentially
>>> performance-impacting, the reason is that in the OpenCL API,
>>> clEnqueueWrite* commands are specified to block until the memory
>>> provided by the application as origin can be reused safely (i.e. until
>>> soft_copy_op()() runs), not necessarily until the transfer to graphics
>>> memory has completed (which is what hard_event::wait() will wait for).
>>> OTOH reads and maps as implemented by soft_copy_op and friends are
>>> essentially blocking so the wait() call is redundant in most cases.
>>
>> That explains a noticeable slowdown running piglit with these changes.
>> I'm not sure about the read part though. I expected it to be basically
>> a noop, but it changes behaviour.
>
> I think this change would have slowed you down the most whenever the
> mapping operation performed by soft_copy_op() is able to proceed
> immediately, either because the buffer is idle (so the driver doesn't
> stall on transfer_map()) *or* because the driver is trying to be smart
> and creates a bounce buffer where data can be copied into immediately
> without stalling, then inserts a pipelined GPU copy from the bounce
> buffer into the real buffer.  With this patch you will stall on the GPU
> copy regardless (and whatever other work was already on the command
> stream which might be substantial), even though it wouldn't have been
> necessary in any of these cases.
>
>> Adding clGetEventInfo(CL_EVENT_COMMAND_EXECUTION_STATUS) after a
>> blocking read in one of the piglit tests surprisingly returns
>> CL_QUEUED.
>>
>
> Hmm, yeah, that seems kind of debatable behaviour, although it's
> definitely legit for writes, not quite sure for reads...  I believe the
> reason why that happens is because the CPU copy proceeds very quickly
> (due to the reasons mentioned in the last paragraph), but the hard_event
> is still associated with a pipe_fence synchronous with the GPU's command
> stream, so it won't get signalled until the GPU catches up.
>
>> The specs don't mention use of events with blocking read, but it does
>> say that "When the read command has completed, the contents of the
>> buffer that ptr points to can be used by the application." in the non-
>> blocking section. I'd say that the expectation is for the event to be
>> CL_COMPLETE after blocking read (at least beignet/pocl/intel-cpu-sdk
>> follow this).
>>
>>> 
>>> The only reason why it might be useful to behave differently on blocking
>>> transfers is that the application may have specified a user event in the
>>> event dependency list, which may cause the soft_copy_op() call to be
>>> delayed until the application signals the user event.  In order to fix
>>> that it should really only be necessary to wait for the event action to
>>> be executed, not necessarily its associated GPU work.
>>
>> I think that another use is that non-blocking writes do not create an
>> extra copy of the buffer. Thus
>> clEnqueueWriteBuffer(...,cl_false, ev, ...)
>> clWaitForEvents(ev)
>> is more memory efficient.
>>
>>> 
>>> Last time this issue came up (yeah it's not the first time) I proposed
>>> the patches below to add a mechanism to wait for the event action only,
>>> feel free to include it as PATCH 0.1 and 0.2 of this series (it's been a
>>> while so they may no longer apply cleanly).
>>
>> I think we can just add comments explaining why the blocking argument
>> is ignored, until someone chooses to fix this problem
>
> I think the problem is definitely worth fixing, and it shouldn't really
> take more effort than adding comments explaining the current behaviour
> ;), basically just add a bunch of 'if (blocking)
> hev().wait_signalled();' where the spec requires it, roughly as you had
> been doing in this patch, but wait_signalled() should only stall on the
> CPU work associated with the event, which should give you the same
> performance as the current approach.
>
>> and/or to
>> implement proper non-blocking variants (would std::async work for
>> trivial cases like ReadBuffer?)
>>

Hm, and to answer this question -- Yeah, std::async would probably work,
but I'm not certain whether it would actually perform better than the
current approach, because on the one hand the actual DMA-ing of the
buffer is likely to happen quasi-asynchronously already assuming the
driver is competent, and OTOH because spawning a new thread for the copy
would introduce additional overhead that might defeat your purpose
unless the copy is very large -- Only experimentation will tell whether
it pays off.

>> thanks,
>> Jan
>>
>>> 
>>> Thank you.
>>> 
>>> Jan Vesely <jan.vesely at rutgers.edu> writes:
>>> 
>>> > v2: wait in map_buffer and map_image as well
>>> > 
>>> > Signed-off-by: Jan Vesely <jan.vesely at rutgers.edu>
>>> > ---
>>> > Hi Aaron,
>>> > 
>>> > yes, I think you're right, we should wait in Map* as well.
>>> > If nothing else it's consistent, even if passing the flag to add_map might make it unnecessary (haven't actually checked).
>>> > 
>>> > thanks,
>>> > Jan
>>> > 
>>> >  src/gallium/state_trackers/clover/api/transfer.cpp | 30 ++++++++++++++++++++--
>>> >  1 file changed, 28 insertions(+), 2 deletions(-)
>>> > 
>>> > diff --git a/src/gallium/state_trackers/clover/api/transfer.cpp b/src/gallium/state_trackers/clover/api/transfer.cpp
>>> > index f7046253be..729a34590e 100644
>>> > --- a/src/gallium/state_trackers/clover/api/transfer.cpp
>>> > +++ b/src/gallium/state_trackers/clover/api/transfer.cpp
>>> > @@ -295,6 +295,9 @@ clEnqueueReadBuffer(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>>> >                     &mem, obj_origin, obj_pitch,
>>> >                     region));
>>> >  
>>> > +   if (blocking)
>>> > +       hev().wait();
>>> > +
>>> >     ret_object(rd_ev, hev);
>>> >     return CL_SUCCESS;
>>> >  
>>> > @@ -325,6 +328,9 @@ clEnqueueWriteBuffer(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>>> >                     ptr, {}, obj_pitch,
>>> >                     region));
>>> >  
>>> > +   if (blocking)
>>> > +       hev().wait();
>>> > +
>>> >     ret_object(rd_ev, hev);
>>> >     return CL_SUCCESS;
>>> >  
>>> > @@ -362,6 +368,9 @@ clEnqueueReadBufferRect(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>>> >                     &mem, obj_origin, obj_pitch,
>>> >                     region));
>>> >  
>>> > +   if (blocking)
>>> > +       hev().wait();
>>> > +
>>> >     ret_object(rd_ev, hev);
>>> >     return CL_SUCCESS;
>>> >  
>>> > @@ -399,6 +408,9 @@ clEnqueueWriteBufferRect(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>>> >                     ptr, host_origin, host_pitch,
>>> >                     region));
>>> >  
>>> > +   if (blocking)
>>> > +       hev().wait();
>>> > +
>>> >     ret_object(rd_ev, hev);
>>> >     return CL_SUCCESS;
>>> >  
>>> > @@ -504,6 +516,9 @@ clEnqueueReadImage(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>>> >                     &img, src_origin, src_pitch,
>>> >                     region));
>>> >  
>>> > +   if (blocking)
>>> > +       hev().wait();
>>> > +
>>> >     ret_object(rd_ev, hev);
>>> >     return CL_SUCCESS;
>>> >  
>>> > @@ -538,6 +553,9 @@ clEnqueueWriteImage(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>>> >                     ptr, {}, src_pitch,
>>> >                     region));
>>> >  
>>> > +   if (blocking)
>>> > +       hev().wait();
>>> > +
>>> >     ret_object(rd_ev, hev);
>>> >     return CL_SUCCESS;
>>> >  
>>> > @@ -667,7 +685,11 @@ clEnqueueMapBuffer(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>>> >  
>>> >     void *map = mem.resource(q).add_map(q, flags, blocking, obj_origin, region);
>>> >  
>>> > -   ret_object(rd_ev, create<hard_event>(q, CL_COMMAND_MAP_BUFFER, deps));
>>> > +   auto hev = create<hard_event>(q, CL_COMMAND_MAP_BUFFER, deps);
>>> > +   if (blocking)
>>> > +       hev().wait();
>>> > +
>>> > +   ret_object(rd_ev, hev);
>>> >     ret_error(r_errcode, CL_SUCCESS);
>>> >     return map;
>>> >  
>>> > @@ -695,7 +717,11 @@ clEnqueueMapImage(cl_command_queue d_q, cl_mem d_mem, cl_bool blocking,
>>> >  
>>> >     void *map = img.resource(q).add_map(q, flags, blocking, origin, region);
>>> >  
>>> > -   ret_object(rd_ev, create<hard_event>(q, CL_COMMAND_MAP_IMAGE, deps));
>>> > +   auto hev = create<hard_event>(q, CL_COMMAND_MAP_IMAGE, deps);
>>> > +   if (blocking)
>>> > +       hev().wait();
>>> > +
>>> > +   ret_object(rd_ev, hev);
>>> >     ret_error(r_errcode, CL_SUCCESS);
>>> >     return map;
>>> >  
>>> > -- 
>>> > 2.13.3
>>> 
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>> -- 
>> Jan Vesely <jan.vesely at rutgers.edu>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170805/24ae9882/attachment-0001.sig>


More information about the mesa-dev mailing list