[Mesa-dev] Leaked hardware event if kernel launch fails?
Pierre Moreau
pierre.morrow at free.fr
Sun Dec 25 23:59:17 UTC 2016
Hello Francisco!
Thank you for your quick reply!
On 02:33 pm - Dec 25 2016, Francisco Jerez wrote:
> Pierre Moreau <pierre.morrow at free.fr> writes:
>
> > Hello,
> >
> Hi Pierre!
>
> > I noticed that, if trying to enqueue a kernel which had no
> > `module::section::text_executable` attached to its clover module, I would get a
> > `std::out_of_range` exception, instead of the expected
> > CL_INVALID_PROGRAM_EXECUTABLE (see [0]; I tried enqueueing using
> > `clEnqueueNDRangeKernel). I modified the `kernel::exec_context::bind()` method
> > to catch the out-of-range exceptions, and throw the proper clover exception
> > instead:
> >
> > ```
> > diff --git a/src/gallium/state_trackers/clover/core/kernel.cpp b/src/gallium/state_trackers/clover/core/kernel.cpp
> > index 328323b6b0..1bb4f612cb 100644
> > --- a/src/gallium/state_trackers/clover/core/kernel.cpp
> > +++ b/src/gallium/state_trackers/clover/core/kernel.cpp
> > @@ -161,8 +161,18 @@ kernel::exec_context::bind(intrusive_ptr<command_queue> _q,
> >
> > // Bind kernel arguments.
> > auto &m = kern.program().build(q->device()).binary;
> > - auto margs = find(name_equals(kern.name()), m.syms).args;
> > - auto msec = find(type_equals(module::section::text_executable), m.secs);
> > + std::vector<module::argument> margs;
> > + try {
> > + margs = find(name_equals(kern.name()), m.syms).args;
> > + } catch (const std::out_of_range &e) {
> > + throw error(CL_INVALID_KERNEL);
> > + }
> > + module::section msec;
> > + try {
> > + msec = find(type_equals(module::section::text_executable), m.secs);
> > + } catch (const std::out_of_range &e) {
> > + throw error(CL_INVALID_PROGRAM_EXECUTABLE);
> > + }
>
> I think we should be validating this condition beforehand in the
> clEnqueueNDRangeKernel() entry point. The reason is that
> clover::kernel::launch() will in general be executed asynchronously, so
> the exceptions you're throwing above won't necessarily cause
> clEnqueueNDRangeKernel() to return a failure status code as you'd expect
> [Consider e.g. what would happen if the clEnqueueNDRangeKernel() call
> was made explicitly dependent on a user event which wasn't initially
> signalled, what would delay the execution of clover::kernel::launch()
> until *after* clEnqueueNDRangeKernel() has returned control to the
> user].
I hadn’t really thought of what would happen if the launch happened after
clEnqueueNDRangeKernel() had returned…
I had looked at validate_common() too quickly, and missed that it was already
returning CL_INVALID_PROGRAM_EXECUTABLE, but only when the program was not set
up for the current device. This new patch should be less intrusive than the
previous one, and it does work better. :-) I’ll send a proper patch tomorrow.
Pierre
diff --git a/src/gallium/state_trackers/clover/api/kernel.cpp b/src/gallium/state_trackers/clover/api/kernel.cpp
index 73ba34abe8..61737ede5e 100644
--- a/src/gallium/state_trackers/clover/api/kernel.cpp
+++ b/src/gallium/state_trackers/clover/api/kernel.cpp
@@ -215,7 +215,8 @@ namespace {
}, kern.args()))
throw error(CL_INVALID_KERNEL_ARGS);
- if (!count(q.device(), kern.program().devices()))
+ auto &m = kern.program().build(q.device()).binary;
+ if (!any_of(type_equals(module::section::text_executable), m.secs))
throw error(CL_INVALID_PROGRAM_EXECUTABLE);
}
>
> > auto explicit_arg = kern._args.begin();
> >
> > for (auto &marg : margs) {
> > ```
> >
> > But now, when my OpenCL program exists after the error, the destruction of the
> > `cl::CommandQueue` object doesn’t happen in a peaceful manner:
> >
> > ```
> > Program received signal SIGSEGV, Segmentation fault.
> > 0x0000000000652e40 in ?? ()
> > (gdb) bt
> > #0 0x0000000000652e40 in ?? ()
> > #1 0x00007ffff7b2d29a in clover::command_queue::flush (this=this at entry=0x653310) at ../../../../../mesa_spirv/src/gallium/state_trackers/clover/core/queue.cpp:77
> > #2 0x00007ffff7b0ebc0 in clReleaseCommandQueue (d_q=0x653318) at ../../../../../mesa_spirv/src/gallium/state_trackers/clover/api/queue.cpp:63
> > #3 0x0000000000405125 in cl::detail::ReferenceHandler<_cl_command_queue*>::release (queue=0x653318) at /usr/include/CL/cl.hpp:1686
> > #4 0x0000000000405108 in cl::detail::Wrapper<_cl_command_queue*>::release (this=0x7fffffffda98) at /usr/include/CL/cl.hpp:1863
> > #5 0x00000000004050c7 in cl::detail::Wrapper<_cl_command_queue*>::~Wrapper (this=0x7fffffffda98) at /usr/include/CL/cl.hpp:1802
> > #6 0x00000000004047e5 in cl::CommandQueue::~CommandQueue (this=0x7fffffffda98) at /usr/include/CL/cl.hpp:5482
> > #7 0x00000000004046f8 in main () at instruction-set_OpenCL-std.cpp:58
> > (gdb) up
> > #1 0x00007ffff7b2d29a in clover::command_queue::flush (this=this at entry=0x653310) at ../../../../../mesa_spirv/src/gallium/state_trackers/clover/core/queue.cpp:77
> > 77 queued_events.front()().fence(fence);
> > ```
> >
> > (I am using the OpenCL-C++ binding, and Mesa is quite recent (88b5acfa09) with
> > custom patches to get some OpenCL support for Nouveau.)
> >
> > Looking around a bit with the debugger, it seems like the event created by the
> > `clEnqueueNDRangeKernel()` function still exists within the command queue:
> > shouldn’t it have been automatically removed from the queue as the enqueue
> > function failed?
> >
> >
> > Thank you for your help!
> > Pierre
> >
> >
> > [0]: https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20161226/2c01a283/attachment.sig>
More information about the mesa-dev
mailing list