[Intel-gfx] [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS

Thu Oct 8 02:45:47 PDT 2015

On 28/09/15 15:14, Daniel Vetter wrote:
> On Mon, Sep 28, 2015 at 02:52:30PM +0100, Chris Wilson wrote:
>> On Mon, Sep 28, 2015 at 03:42:22PM +0200, Daniel Vetter wrote:
>>> On Wed, Sep 23, 2015 at 09:07:24PM +0100, Chris Wilson wrote:
>>>> If the client revokes the virtual address it asked to be mapped into GPU
>>>> space via userptr (by using anything along the lines of mmap, mprotect,
>>>> madvise, munmap, ftruncate etc) the mmu notifier sends a range
>>>> invalidate command to userptr. Upon receiving the invalidation signal
>>>> for the revoked range, we try to release the struct pages we pinned into
>>>> the GTT. However, this can fail if any of the GPU's VMA are pinned for
>>>> use by the hardware (i.e. despite the user's intention we cannot
>>>> relinquish the client's address range and keep uptodate with whatever is
>>>> placed in there). Currently we emit a few WARN so that we would notice
>>>> if this every occurred in the wild; it has. Sadly this means we need to
>>>> replace those WARNs with the proper SIGBUS to the offending clients
>>>> instead.
>>>>
>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>> Cc: Michał Winiarski <michal.winiarski at intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
>>>>   1 file changed, 37 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
>>>> index f75d90118888..efb404b9fe69 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
>>>> @@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)
>>>>   		was_interruptible = dev_priv->mm.interruptible;
>>>>   		dev_priv->mm.interruptible = false;
>>>>
>>>> -		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
>>>> -			int ret = i915_vma_unbind(vma);
>>>> -			WARN_ON(ret && ret != -EIO);
>>>> +		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
>>>> +			i915_vma_unbind(vma);
>>>> +		if (i915_gem_object_put_pages(obj)) {
>>>> +			struct task_struct *p;
>>>> +
>>>> +			DRM_ERROR("Unable to revoke ownership by userptr of"
>>>> +				  " invalidated address range, sending SIGBUS"
>>>> +				  " to attached clients.\n");
>>>> +
>>>> +			rcu_read_lock();
>>>> +			for_each_process(p) {
>>>> +				siginfo_t info;
>>>> +
>>>> +				/* This doesn't capture everyone who has
>>>> +				 * the pages pinned behind a VMA as we
>>>> +				 * do not have that tracking information
>>>> +				 * available, it does however kill the
>>>> +				 * original process (and siblings) who
>>>> +				 * created the userptr and presumably tried
>>>> +				 * to reuse the address space despite having
>>>> +				 * pinned it (possibly indirectly) to the hw.
>>>> +				 * Arguably, we don't even want to kill the
>>>> +				 * other processes as they are not at fault,
>>>> +				 * likely to be a display server, and hopefully
>>>> +				 * will release the pages in due course once
>>>> +				 * the client is dead.
>>>> +				 */
>>>> +				if (p->mm != obj->userptr.mm->mm)
>>>> +					continue;
>>>> +
>>>> +				info.si_signo = SIGBUS;
>>>> +				info.si_errno = 0;
>>>> +				info.si_code = BUS_ADRERR;
>>>> +				info.si_addr = (void __user *)obj->userptr.ptr;
>>>> +				force_sig_info(SIGBUS, &info, p);
>>>> +			}
>>>> +			rcu_read_unlock();
>>>
>>> Why do we need to send a SIGBUS? It won't tear down the offending gem bo,
>>> any new users will hopefully get it, and abusing SIGBUS without the thread
>>> actually doing a memory access is a bit surprising. DRM_DEBUG seems to be
>>> the most we can do here I think - I think userspace being able to do this
>>> is just a fundamental property of userptr.
>>
>> It is not the bo that is at fault but the *client's* *address* *space*
>> that is being changed. It is equivalent to mmap on a truncated file i.e.
>> if the client tries to use its mmapping after it has truncated the file
>> it is scolded via SIGBUS.
>
> But existing SIGBUS is thread-bound and comes with the fault address
> attached. This is just the gpu being a bit unhappy, so the SIGBUS comes
> out of complete nowhere to smack the userspace thread. Any kind of SIGBUS
> catcher userspace has for other reasons might be supremely surprised by
> this and do stupid things. Hence I don't think throwing SIGBUS here is
> correct behaviour. And there doesn't seem to be anything else suitable
> really.

Te offending address is provided with the signal as far as I can see.

I think it is fine to do this, even required since the alternative is 
for GPU to keep using random memory indefinitely and userspace never 
gets to know.

And I don't see any reason to keep the process running who did such an 
elementary and serious mistake.

Is the only concern that the process can catch it and not exit?

I am just not sure about the locking requirement for for_each_process 
since existing call sites give conflicting examples. I don't see how 
turning the preemption off can be safe without the tasklist lock but 
perhaps I am wrong, don't know.

Regards,

Tvrtko