[Mesa-dev] EXTERNAL: Re: Clover clEnqueue* function don't implement blocking?

Tue Apr 15 04:34:56 PDT 2014

> -----Original Message-----
> From: Francisco Jerez [mailto:currojerez at riseup.net]
 > "Dorrington, Albert" <albert.dorrington at lmco.com> writes:
> >
> > From reading the OpenCL spec (and perhaps I'm misinterpreting something
> again), section 5.10 Flush and Finish says:
> >
> > 	Any blocking commands queued in a command-queue such as
> > 	clEnqueueRead{Image|Buffer} with blocking_read set to CL_TRUE,
> > 	clEnqueueWrite{Image|Buffer} with blocking_write set to CL_TRUE,
> > 	clEnqueueMap{Buffer|Image} with blocking_map set to CL_TRUE or
> > 	clWaitForEvents perform an implicit flush of the command-queue.
> >
> > From this statement, I would expect that the command-queue would be
> flushed when the blocking flag is set.
> 
> clEnqueueRead*, clEnqueueMap* and clWaitForEvents already flush the
> command queue (the first two are flushing indirectly as we try to map a
> buffer referenced by the GPU).  clEnqueueWrite* doesn't flush, but it's not
> clear to me that not doing it can be considered a violation of the spec.  The
> guarantees given by clFlush() are rather vague (to some extent an empty
> function could be a valid implementation) and it seems to me that a
> compliant implementation might, for instance, choose to batch up
> commands across flushes if that's the most efficient thing to do, as long as
> the user has no way to tell the difference.
> 
> I'd like to see some real-world example where clover's behavior represents a
> problem before we change it to flush more frequently, because I'm worried
> that changing this will actually worsen performance rather than improving it.

I have been working with a modified version of Mesa code, which accepts kernels compiled with AMD's compiler.
(Our project's goal is to host Mesa in an environment which does not currently support LLVM/Clang or C++11)
While testing 2D image read capabilities, I have been encountering an issue where the command queue's 'queued_events' continues to be populated, with none of the events being removed until the clFinish call. At that point, I have 23,328 events in the queue and encounter a segmentation fault during the command_queue flush.

After seeing the statement in the OpenCL spec about the implicit flush during the clEnqueue calls, I added the previously mentioned conditional hev().wait() calls to initiate a flush. 
This seems to have resolved the issue with the segFaults during the clFinish call; although I'll admit it likely isn't the most efficient method.

While I have not benchmarked the runtimes precisely, the run-time did not seem to be significantly impacted. The test ran for ~20 minutes before crashing, and now runs for ~20 minutes before completing successfully.