[Nouveau] [PATCH] drm/ttm/nouveau: add DRM_NOUVEAU_GEM_CPU_PREP_TIMEOUT

Sun Sep 18 06:59:50 PDT 2011

On Sun, Sep 18, 2011 at 03:18:57PM +0200, Marcin Slusarz wrote:
> Currently DRM_NOUVEAU_GEM_CPU_PREP ioctl is broken WRT handling of signals.
> 
> nouveau_gem_ioctl_cpu_prep calls ttm_bo_wait which waits for fence to
> "signal" or 3 seconds timeout pass.
> But if it detects pending signal, it returns ERESTARTSYS and goes back
> to userspace. After signal handler, userspace repeats the same ioctl which
> starts _new 3 seconds loop_.
> So when the application relies on signals, some ioctls may never finish
> from application POV.
> 
> There is one important application which does this - Xorg. It uses SIGIO
> (for input handling) and SIGALARM.
> 
> GPU lockups lead to endless ioctl loop which eventually manifests in crash
> with "[mi] EQ overflowing. The server is probably stuck in an infinite loop."
> message instead of being propagated to DDX.
> 
> The solutions is to add new ioctl NOUVEAU_GEM_CPU_PREP_TIMEOUT with
> timeout parameter and decrease it on every signal.

Just fyi: We handle that issue in i915 by returning -EIO when the kernel
decides that the gpu has died for good and that resetting doesn't help.
Until then we rely on the ioctl restarting to kick everyone out of kernel
mode so the reset handler can do its business. If the reset is
successfull, userspace continues (due to the ioctl being restarted)
hopefully mostly undisturbed. While the gpu is hung, but not yet reset, we
stall all ioctls before taking the struct_mutex (see i915_gem_wait_error
in i915_mutex_lock_interruptible).

Imo the advantage of that approach is that the kernel utlimately decides
when the gpu is gone, and userspace (lacking much of the required
information) must not engage in such guessing-games, too.
-Daniel
-- 
Daniel Vetter
Mail: daniel at ffwll.ch
Mobile: +41 (0)79 365 57 48