[Nouveau] [PATCH] drm/ttm/nouveau: add DRM_NOUVEAU_GEM_CPU_PREP_TIMEOUT

Sun Sep 18 07:30:04 PDT 2011

On Sun, Sep 18, 2011 at 03:59:50PM +0200, Daniel Vetter wrote:
> On Sun, Sep 18, 2011 at 03:18:57PM +0200, Marcin Slusarz wrote:
> > Currently DRM_NOUVEAU_GEM_CPU_PREP ioctl is broken WRT handling of signals.
> > 
> > nouveau_gem_ioctl_cpu_prep calls ttm_bo_wait which waits for fence to
> > "signal" or 3 seconds timeout pass.
> > But if it detects pending signal, it returns ERESTARTSYS and goes back
> > to userspace. After signal handler, userspace repeats the same ioctl which
> > starts _new 3 seconds loop_.
> > So when the application relies on signals, some ioctls may never finish
> > from application POV.
> > 
> > There is one important application which does this - Xorg. It uses SIGIO
> > (for input handling) and SIGALARM.
> > 
> > GPU lockups lead to endless ioctl loop which eventually manifests in crash
> > with "[mi] EQ overflowing. The server is probably stuck in an infinite loop."
> > message instead of being propagated to DDX.
> > 
> > The solutions is to add new ioctl NOUVEAU_GEM_CPU_PREP_TIMEOUT with
> > timeout parameter and decrease it on every signal.
> 
> Just fyi: We handle that issue in i915 by returning -EIO when the kernel
> decides that the gpu has died for good and that resetting doesn't help.
> Until then we rely on the ioctl restarting to kick everyone out of kernel
> mode so the reset handler can do its business. If the reset is
> successfull, userspace continues (due to the ioctl being restarted)
> hopefully mostly undisturbed. While the gpu is hung, but not yet reset, we
> stall all ioctls before taking the struct_mutex (see i915_gem_wait_error
> in i915_mutex_lock_interruptible).
> 
> Imo the advantage of that approach is that the kernel utlimately decides
> when the gpu is gone, and userspace (lacking much of the required
> information) must not engage in such guessing-games, too.

This approach would be preferrable, but we don't know yet how to reset
nvidia's gpu. Fixing this API bug could at least let us degrade to noaccel.
And I believe there are cases where ttm_bo_wait can fail with EBUSY and it
doesn't mean GPU locked up...

Marcin