[Intel-gfx] [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
Daniel Vetter
daniel at ffwll.ch
Fri Oct 9 00:48:01 PDT 2015
On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
>
> On 28/09/15 15:14, Daniel Vetter wrote:
> >On Mon, Sep 28, 2015 at 02:52:30PM +0100, Chris Wilson wrote:
> >>On Mon, Sep 28, 2015 at 03:42:22PM +0200, Daniel Vetter wrote:
> >>>On Wed, Sep 23, 2015 at 09:07:24PM +0100, Chris Wilson wrote:
> >>>>If the client revokes the virtual address it asked to be mapped into GPU
> >>>>space via userptr (by using anything along the lines of mmap, mprotect,
> >>>>madvise, munmap, ftruncate etc) the mmu notifier sends a range
> >>>>invalidate command to userptr. Upon receiving the invalidation signal
> >>>>for the revoked range, we try to release the struct pages we pinned into
> >>>>the GTT. However, this can fail if any of the GPU's VMA are pinned for
> >>>>use by the hardware (i.e. despite the user's intention we cannot
> >>>>relinquish the client's address range and keep uptodate with whatever is
> >>>>placed in there). Currently we emit a few WARN so that we would notice
> >>>>if this every occurred in the wild; it has. Sadly this means we need to
> >>>>replace those WARNs with the proper SIGBUS to the offending clients
> >>>>instead.
> >>>>
> >>>>Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >>>>Cc: MichaĆ Winiarski <michal.winiarski at intel.com>
> >>>>---
> >>>> drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
> >>>> 1 file changed, 37 insertions(+), 4 deletions(-)
> >>>>
> >>>>diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> >>>>index f75d90118888..efb404b9fe69 100644
> >>>>--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> >>>>+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> >>>>@@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)
> >>>> was_interruptible = dev_priv->mm.interruptible;
> >>>> dev_priv->mm.interruptible = false;
> >>>>
> >>>>- list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
> >>>>- int ret = i915_vma_unbind(vma);
> >>>>- WARN_ON(ret && ret != -EIO);
> >>>>+ list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
> >>>>+ i915_vma_unbind(vma);
> >>>>+ if (i915_gem_object_put_pages(obj)) {
> >>>>+ struct task_struct *p;
> >>>>+
> >>>>+ DRM_ERROR("Unable to revoke ownership by userptr of"
> >>>>+ " invalidated address range, sending SIGBUS"
> >>>>+ " to attached clients.\n");
> >>>>+
> >>>>+ rcu_read_lock();
> >>>>+ for_each_process(p) {
> >>>>+ siginfo_t info;
> >>>>+
> >>>>+ /* This doesn't capture everyone who has
> >>>>+ * the pages pinned behind a VMA as we
> >>>>+ * do not have that tracking information
> >>>>+ * available, it does however kill the
> >>>>+ * original process (and siblings) who
> >>>>+ * created the userptr and presumably tried
> >>>>+ * to reuse the address space despite having
> >>>>+ * pinned it (possibly indirectly) to the hw.
> >>>>+ * Arguably, we don't even want to kill the
> >>>>+ * other processes as they are not at fault,
> >>>>+ * likely to be a display server, and hopefully
> >>>>+ * will release the pages in due course once
> >>>>+ * the client is dead.
> >>>>+ */
> >>>>+ if (p->mm != obj->userptr.mm->mm)
> >>>>+ continue;
> >>>>+
> >>>>+ info.si_signo = SIGBUS;
> >>>>+ info.si_errno = 0;
> >>>>+ info.si_code = BUS_ADRERR;
> >>>>+ info.si_addr = (void __user *)obj->userptr.ptr;
> >>>>+ force_sig_info(SIGBUS, &info, p);
> >>>>+ }
> >>>>+ rcu_read_unlock();
> >>>
> >>>Why do we need to send a SIGBUS? It won't tear down the offending gem bo,
> >>>any new users will hopefully get it, and abusing SIGBUS without the thread
> >>>actually doing a memory access is a bit surprising. DRM_DEBUG seems to be
> >>>the most we can do here I think - I think userspace being able to do this
> >>>is just a fundamental property of userptr.
> >>
> >>It is not the bo that is at fault but the *client's* *address* *space*
> >>that is being changed. It is equivalent to mmap on a truncated file i.e.
> >>if the client tries to use its mmapping after it has truncated the file
> >>it is scolded via SIGBUS.
> >
> >But existing SIGBUS is thread-bound and comes with the fault address
> >attached. This is just the gpu being a bit unhappy, so the SIGBUS comes
> >out of complete nowhere to smack the userspace thread. Any kind of SIGBUS
> >catcher userspace has for other reasons might be supremely surprised by
> >this and do stupid things. Hence I don't think throwing SIGBUS here is
> >correct behaviour. And there doesn't seem to be anything else suitable
> >really.
>
> Te offending address is provided with the signal as far as I can see.
>
> I think it is fine to do this, even required since the alternative is for
> GPU to keep using random memory indefinitely and userspace never gets to
> know.
>
> And I don't see any reason to keep the process running who did such an
> elementary and serious mistake.
>
> Is the only concern that the process can catch it and not exit?
The concern is that this isn't how SIG_SEGV works, it's a signal the
thread who made the invalid access gets directly. You never get a SIG_SEGV
for bad access someone else has made. So essentially it's new ABI.
And the other bit is that doing new ABI with signals is considered bad
taste - if you want to have something that gets to userspace directly then
create an fd which is polleable. And then, if userspace chooses so, it can
kill itself with a SIG_IO using fcntl.
But the goal here seems to just to be telling userspace that something bad
has happened with its gpu context, and we already have gem_reset_stats for
that. That would imo be the correct interface to provide this information
to userspace. After all after a gpu reset it's also all garbage and we
just let everything continue merilly without killing the process that
submitted the work. I don't see how a fault is any different.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
More information about the Intel-gfx
mailing list