[PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

Tue Oct 8 18:21:37 CEST 2013

Hi Marek,

please try the attached patch as a replacement for your signaling all 
fences patch. I'm not 100% sure if it fixes all issues, but it's at 
least a start.

Thanks,
Christian.

Am 07.10.2013 13:08, schrieb Christian König:
>> First of all, I can't complain about the reliability of the hardware
>> GPU reset. It's mostly the kernel driver that happens to run into a
>> deadlock at the same time.
>
> Alex and I spend quite some time on making this reliable again after 
> activating more rings and adding VM support. The main problem is that 
> I couldn't figure out where the CPU deadlock comes from, cause I 
> couldn't reliable reproduce the issue.
>
> What is the content of /proc/<pid of X server>/task/*/stack and 
> sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in 
> the deadlock situation?
>
> I'm pretty sure that we nearly always have a problem when two threads 
> are waiting for fences on one of them detects that we have a lockup 
> while the other one keeps holding the exclusive lock. Signaling all 
> fences might work around that problem, but it probably would be better 
> to fix the underlying issue.
>
> Going to take a deeper look into it.
>
> Christian.
>
> Am 03.10.2013 02:45, schrieb Marek Olšák:
>> First of all, I can't complain about the reliability of the hardware
>> GPU reset. It's mostly the kernel driver that happens to run into a
>> deadlock at the same time.
>>
>> Regarding the issue with fences, the problem is that the GPU reset
>> completes successfully according to dmesg, but X doesn't respond. I
>> can move the cursor on the screen, but I can't do anything else and
>> the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can
>> easily reproduce this, because it's the most common reason why a GPU
>> lockup leads to frozen X. The GPU actually recovers, but X is hung. I
>> can't tell whether the fences are just not signalled or whether there
>> is actually a real CPU deadlock I can't see.
>>
>> This patch makes the problem go away and GPU resets are successful
>> (except for extreme cases, see below). With a small enough lockup
>> timeout, the lockups are just a minor annoyance and I thought I could
>> get through a piglit run just with a few tens or hundreds of GPU
>> resets...
>>
>> A different type of deadlock showed up, though it needs a lot of
>> concurrently-running apps like piglit. What happened is that the
>> kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due
>> to a GPU hang while holding onto the exclusive lock, and another
>> thread wanting to do the GPU reset was unable to acquire the lock.
>>
>> That said, I will use the patch locally, because it helps a lot. I got
>> a few lockups while writing this email and I'm glad I didn't have to
>> reboot.
>>
>> Marek
>>
>> On Wed, Oct 2, 2013 at 4:50 PM, Christian König 
>> <deathsimple at vodafone.de> wrote:
>>> Possible, but I would rather guess that this doesn't work because 
>>> the IB test runs into a deadlock situation and so the GPU reset 
>>> never fully completes.
>>>
>>> Can you reproduce the problem?
>>>
>>> If you want to make GPU resets more reliable I would rather suggest 
>>> to remove the ring lock dependency.
>>> Then we should try to give all the fence wait functions a (reliable) 
>>> timeout and move reset handling a layer up into the ioctl functions. 
>>> But for this you need to rip out the old PM code first.
>>>
>>> Christian.
>>>
>>> Marek Olšák <maraeo at gmail.com> schrieb:
>>>
>>>> I'm afraid signalling the fences with an IB test is not reliable.
>>>>
>>>> Marek
>>>>
>>>> On Wed, Oct 2, 2013 at 3:52 PM, Christian König 
>>>> <deathsimple at vodafone.de> wrote:
>>>>> NAK, after recovering from a lockup the first thing we do is 
>>>>> signalling all remaining fences with an IB test.
>>>>>
>>>>> If we don't recover we indeed signal all fences manually.
>>>>>
>>>>> Signalling all fences regardless of the outcome of the reset 
>>>>> creates problems with both types of partial resets.
>>>>>
>>>>> Christian.
>>>>>
>>>>> Marek Olšák <maraeo at gmail.com> schrieb:
>>>>>
>>>>>> From: Marek Olšák <marek.olsak at amd.com>
>>>>>>
>>>>>> After a lockup, fences are not signalled sometimes, causing
>>>>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes results
>>>>>> in an X server freeze.
>>>>>>
>>>>>> This fixes only one of many deadlocks which can occur during a 
>>>>>> lockup.
>>>>>>
>>>>>> Signed-off-by: Marek Olšák <marek.olsak at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++
>>>>>> 1 file changed, 5 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>> index 841d0e0..7b97baa 100644
>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device 
>>>>>> *rdev)
>>>>>>        radeon_save_bios_scratch_regs(rdev);
>>>>>>        /* block TTM */
>>>>>>        resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
>>>>>> +
>>>>>> +      mutex_lock(&rdev->ring_lock);
>>>>>> +      radeon_fence_driver_force_completion(rdev);
>>>>>> +      mutex_unlock(&rdev->ring_lock);
>>>>>> +
>>>>>>        radeon_pm_suspend(rdev);
>>>>>>        radeon_suspend(rdev);
>>>>>>
>>>>>> -- 
>>>>>> 1.8.1.2
>>>>>>
>>>>>> _______________________________________________
>>>>>> dri-devel mailing list
>>>>>> dri-devel at lists.freedesktop.org
>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-radeon-rework-and-fix-reset-detection.patch
Type: text/x-diff
Size: 17294 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20131008/c9f9f50b/attachment-0001.patch>