[PATCH 13/13] drm/radeon: rework recursive gpu reset handling

Fri Apr 20 02:38:23 PDT 2012

On 20.04.2012 09:50, Daniel Vetter wrote:
> On Fri, Apr 20, 2012 at 07:57:09AM +0100, Dave Airlie wrote:
>> 2012/4/19 Christian König<deathsimple at vodafone.de>:
>>> Instead of all this humpy pumpy with recursive
>>> mutex (which also fixes only halve of the problem)
>>> move the actual gpu reset out of the fence code,
>>> return -EDEADLK and then reset the gpu in the
>>> calling ioctl function.
>> I'm trying to figure out if this has any disadvantages over doing what
>> I proposed before and just kicking a thread to reset the gpu.
>>
>> It seems like this should also avoid the locking problems, I'd like to
>> make sure we don't return -EDEADLK to userspace by accident anywhere,
>> since I don't think it prepared for it and it would be an ABI change.
> Fyi, the trick i915 uses to solve the reset problem is to bail out with
> -EAGAIN and rely on drmIOCtl restarting the ioctl. This way we use the
> same codepaths we use to bail out when getting a signal, and thanks to X
> these are rather well-tested. The hangcheck code also fires of a work item to
> do all the reset magic. In all the ioctls that might wait for the gpu we
> have a fancy piece of code which checks whether a gpu reset is pending,
> and if so waits for that to complete. It also checks whether the reset
> succeeded and if not bails out with -EIO.
> -Daniel
Well I considered using an asynchronous work item also, but didn't know 
how to probably prevent multiple GPU resets at the same time, signaling 
the result back to the ioctls, etc.. It just seemed to be more 
complicated without any real benefit (maybe except that you don't have 
to check every ioctl result separately, but there are not so many).

Also I didn't know what to tell userspace to retry the current 
operation, but if it's already prepared for -EAGAIN than this sounds 
like the proper solution here.

And regarding returning -EDEADLK to userspace: I think I handle every 
ioctl that could cause the lockup detection to run, but checking that 
again won't hurt.

Christian.