[Intel-gfx] [PATCH] drm/i915: WARN if we hit a signal from kernel context

Wed Apr 4 09:24:43 UTC 2018

On Tue, Apr 03, 2018 at 07:58:55PM +0200, Daniel Vetter wrote:
> After a discussion with Wily I got the nagging feeling we might have
> some cases of nasty busy loops. The window is fairly small since we
> always have a non-faulting fastpath (using page_fault_dis|enable())
> first, usually followed by a pile of pending signal checks, before we
> go into the slowpath copy_to|from_user that might blow up for real.
> 
> Test patch to check what CI thinks of this theory.
> 
> v2: Don't WARN on success (Ville).
> 
> Cc: Matthew Wilcox <willy at infradead.org>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Ville Syrjälä <ville.syrjala at linux.intel.com>
> Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>

Ok, results from CI are in, we're hitting this path with our test-suite,
and nothing bad happens.

>From my very layman reading of the x86 uaccess code what happens is that
we take a fault, don't insert the pte because there's a signal pending
(VM_FAULT_NOPAGE case). But since it's a kernel fault we don't just
restart the same instruction, but go do the fixup code.

That calls copy_user_handle_tail, which does a dead-slow bytewise copy. If
that fails again, then there's a different fixup handler which just gives
up and results in a short read.

In eb_relocate_slow we retry the entire thing upon short reads, and the
first thing we do is check for signals. If that's the case we bail out,
and change the errno from -EFAULT to -ERESTARTSYS.

For most other parts (and everything else in the kernel I suspect) we get
an -EFAULT or similar indicator of short reads/writes.

Kernel at large seems indeed not prepared to deal with page faults not
going through because of signals, but then I think we get away with that
because you should pass graphics buffer addresses to random things anyway.
Both GL and vk (and everything else) have explicit upload/download
functions, and you can't just write() into gpu memory. I think.

tldr; I agree with Wily that we're probably breaking kernel fault api
expectations here, but I think it's ok ...

Only way I can see us fix this reliably is if we have a
pending_signal_disable/enable() pair, which makes all interruptible sleeps
non-interruptible. I don't core think kernel devs would like that :-)

Cheers, Daniel
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 9650a7b10c5f..9766fa152e05 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1902,6 +1902,7 @@ int i915_gem_fault(struct vm_fault *vmf)
>  	struct i915_vma *vma;
>  	pgoff_t page_offset;
>  	unsigned int flags;
> +	bool user_fault = vmf->flags & FAULT_FLAG_USER;
>  	int ret;
>  
>  	/* We don't use vmf->pgoff since that has the fake offset */
> @@ -2017,9 +2018,10 @@ int i915_gem_fault(struct vm_fault *vmf)
>  		 * handler to reset everything when re-faulting in
>  		 * i915_mutex_lock_interruptible.
>  		 */
> -	case 0:
>  	case -ERESTARTSYS:
>  	case -EINTR:
> +		WARN_ON(!user_fault);
> +	case 0:
>  	case -EBUSY:
>  		/*
>  		 * EBUSY is ok: this just means that another thread
> -- 
> 2.16.2
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch