[Mesa-dev] Leaked hardware event if kernel launch fails?

Francisco Jerez currojerez at riseup.net
Sun Dec 25 22:33:29 UTC 2016


Pierre Moreau <pierre.morrow at free.fr> writes:

> Hello,
>
Hi Pierre!

> I noticed that, if trying to enqueue a kernel which had no
> `module::section::text_executable` attached to its clover module, I would get a
> `std::out_of_range` exception, instead of the expected
> CL_INVALID_PROGRAM_EXECUTABLE (see [0]; I tried enqueueing using
> `clEnqueueNDRangeKernel). I modified the `kernel::exec_context::bind()` method
> to catch the out-of-range exceptions, and throw the proper clover exception
> instead:
>
> ```
> diff --git a/src/gallium/state_trackers/clover/core/kernel.cpp b/src/gallium/state_trackers/clover/core/kernel.cpp
> index 328323b6b0..1bb4f612cb 100644
> --- a/src/gallium/state_trackers/clover/core/kernel.cpp
> +++ b/src/gallium/state_trackers/clover/core/kernel.cpp
> @@ -161,8 +161,18 @@ kernel::exec_context::bind(intrusive_ptr<command_queue> _q,
>  
>     // Bind kernel arguments.
>     auto &m = kern.program().build(q->device()).binary;
> -   auto margs = find(name_equals(kern.name()), m.syms).args;
> -   auto msec = find(type_equals(module::section::text_executable), m.secs);
> +   std::vector<module::argument> margs;
> +   try {
> +      margs = find(name_equals(kern.name()), m.syms).args;
> +   } catch (const std::out_of_range &e) {
> +      throw error(CL_INVALID_KERNEL);
> +   }
> +   module::section msec;
> +   try {
> +      msec = find(type_equals(module::section::text_executable), m.secs);
> +   } catch (const std::out_of_range &e) {
> +      throw error(CL_INVALID_PROGRAM_EXECUTABLE);
> +   }

I think we should be validating this condition beforehand in the
clEnqueueNDRangeKernel() entry point.  The reason is that
clover::kernel::launch() will in general be executed asynchronously, so
the exceptions you're throwing above won't necessarily cause
clEnqueueNDRangeKernel() to return a failure status code as you'd expect
[Consider e.g. what would happen if the clEnqueueNDRangeKernel() call
was made explicitly dependent on a user event which wasn't initially
signalled, what would delay the execution of clover::kernel::launch()
until *after* clEnqueueNDRangeKernel() has returned control to the
user].

>     auto explicit_arg = kern._args.begin();
>  
>     for (auto &marg : margs) {
> ```
>
> But now, when my OpenCL program exists after the error, the destruction of the
> `cl::CommandQueue` object doesn’t happen in a peaceful manner:
>
> ```
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000000000652e40 in ?? ()
> (gdb) bt
> #0  0x0000000000652e40 in ?? ()
> #1  0x00007ffff7b2d29a in clover::command_queue::flush (this=this at entry=0x653310) at ../../../../../mesa_spirv/src/gallium/state_trackers/clover/core/queue.cpp:77
> #2  0x00007ffff7b0ebc0 in clReleaseCommandQueue (d_q=0x653318) at ../../../../../mesa_spirv/src/gallium/state_trackers/clover/api/queue.cpp:63
> #3  0x0000000000405125 in cl::detail::ReferenceHandler<_cl_command_queue*>::release (queue=0x653318) at /usr/include/CL/cl.hpp:1686
> #4  0x0000000000405108 in cl::detail::Wrapper<_cl_command_queue*>::release (this=0x7fffffffda98) at /usr/include/CL/cl.hpp:1863
> #5  0x00000000004050c7 in cl::detail::Wrapper<_cl_command_queue*>::~Wrapper (this=0x7fffffffda98) at /usr/include/CL/cl.hpp:1802
> #6  0x00000000004047e5 in cl::CommandQueue::~CommandQueue (this=0x7fffffffda98) at /usr/include/CL/cl.hpp:5482
> #7  0x00000000004046f8 in main () at instruction-set_OpenCL-std.cpp:58
> (gdb) up
> #1  0x00007ffff7b2d29a in clover::command_queue::flush (this=this at entry=0x653310) at ../../../../../mesa_spirv/src/gallium/state_trackers/clover/core/queue.cpp:77
> 77               queued_events.front()().fence(fence);
> ```
>
> (I am using the OpenCL-C++ binding, and Mesa is quite recent (88b5acfa09) with
> custom patches to get some OpenCL support for Nouveau.)
>
> Looking around a bit with the debugger, it seems like the event created by the
> `clEnqueueNDRangeKernel()` function still exists within the command queue:
> shouldn’t it have been automatically removed from the queue as the enqueue
> function failed?
>
>
> Thank you for your help!
> Pierre
>
>
> [0]: https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20161225/81cfc5bd/attachment.sig>


More information about the mesa-dev mailing list