[PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

Sun Oct 13 14:47:22 CEST 2013

I've figured out what was wrong with the patch. We need to reset the 
"needs_reset" flag earlier, otherwise the IB test might think we are in 
a lockup and aborts the reset after waiting for the minimum timeout period.

Please try the attached patch instead.

Thanks,
Christian.

Am 09.10.2013 14:04, schrieb Marek Olšák:
> The ring test of the first compute ring always fails and it shouldn't
> affect the GPU reset in any way.
>
> I can't tell if the deadlock issue is fixed, because the GPU reset
> usually fails with your patch. It always succeeded without your patch.
>
> Marek
>
> On Wed, Oct 9, 2013 at 1:09 PM, Christian König <deathsimple at vodafone.de> wrote:
>> Mhm, that doesn't looks like anything related but more like the reset of the
>> compute ring didn't worked.
>>
>> How often does that happen? And do you still get the problem where X waits
>> for a fence that never comes back?
>>
>> Christian.
>>
>> Am 09.10.2013 12:36, schrieb Marek Olšák:
>>
>>> I'm afraid your patch sometimes causes the GPU reset to fail, which
>>> had never happened before IIRC.
>>>
>>> The dmesg log from the failure is attached.
>>>
>>> Marek
>>>
>>> On Tue, Oct 8, 2013 at 6:21 PM, Christian König <deathsimple at vodafone.de>
>>> wrote:
>>>> Hi Marek,
>>>>
>>>> please try the attached patch as a replacement for your signaling all
>>>> fences
>>>> patch. I'm not 100% sure if it fixes all issues, but it's at least a
>>>> start.
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>> Am 07.10.2013 13:08, schrieb Christian König:
>>>>
>>>>>> First of all, I can't complain about the reliability of the hardware
>>>>>> GPU reset. It's mostly the kernel driver that happens to run into a
>>>>>> deadlock at the same time.
>>>>>
>>>>> Alex and I spend quite some time on making this reliable again after
>>>>> activating more rings and adding VM support. The main problem is that I
>>>>> couldn't figure out where the CPU deadlock comes from, cause I couldn't
>>>>> reliable reproduce the issue.
>>>>>
>>>>> What is the content of /proc/<pid of X server>/task/*/stack and
>>>>> sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in
>>>>> the
>>>>> deadlock situation?
>>>>>
>>>>> I'm pretty sure that we nearly always have a problem when two threads
>>>>> are
>>>>> waiting for fences on one of them detects that we have a lockup while
>>>>> the
>>>>> other one keeps holding the exclusive lock. Signaling all fences might
>>>>> work
>>>>> around that problem, but it probably would be better to fix the
>>>>> underlying
>>>>> issue.
>>>>>
>>>>> Going to take a deeper look into it.
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 03.10.2013 02:45, schrieb Marek Olšák:
>>>>>> First of all, I can't complain about the reliability of the hardware
>>>>>> GPU reset. It's mostly the kernel driver that happens to run into a
>>>>>> deadlock at the same time.
>>>>>>
>>>>>> Regarding the issue with fences, the problem is that the GPU reset
>>>>>> completes successfully according to dmesg, but X doesn't respond. I
>>>>>> can move the cursor on the screen, but I can't do anything else and
>>>>>> the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can
>>>>>> easily reproduce this, because it's the most common reason why a GPU
>>>>>> lockup leads to frozen X. The GPU actually recovers, but X is hung. I
>>>>>> can't tell whether the fences are just not signalled or whether there
>>>>>> is actually a real CPU deadlock I can't see.
>>>>>>
>>>>>> This patch makes the problem go away and GPU resets are successful
>>>>>> (except for extreme cases, see below). With a small enough lockup
>>>>>> timeout, the lockups are just a minor annoyance and I thought I could
>>>>>> get through a piglit run just with a few tens or hundreds of GPU
>>>>>> resets...
>>>>>>
>>>>>> A different type of deadlock showed up, though it needs a lot of
>>>>>> concurrently-running apps like piglit. What happened is that the
>>>>>> kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due
>>>>>> to a GPU hang while holding onto the exclusive lock, and another
>>>>>> thread wanting to do the GPU reset was unable to acquire the lock.
>>>>>>
>>>>>> That said, I will use the patch locally, because it helps a lot. I got
>>>>>> a few lockups while writing this email and I'm glad I didn't have to
>>>>>> reboot.
>>>>>>
>>>>>> Marek
>>>>>>
>>>>>> On Wed, Oct 2, 2013 at 4:50 PM, Christian König
>>>>>> <deathsimple at vodafone.de>
>>>>>> wrote:
>>>>>>> Possible, but I would rather guess that this doesn't work because the
>>>>>>> IB
>>>>>>> test runs into a deadlock situation and so the GPU reset never fully
>>>>>>> completes.
>>>>>>>
>>>>>>> Can you reproduce the problem?
>>>>>>>
>>>>>>> If you want to make GPU resets more reliable I would rather suggest to
>>>>>>> remove the ring lock dependency.
>>>>>>> Then we should try to give all the fence wait functions a (reliable)
>>>>>>> timeout and move reset handling a layer up into the ioctl functions.
>>>>>>> But for
>>>>>>> this you need to rip out the old PM code first.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Marek Olšák <maraeo at gmail.com> schrieb:
>>>>>>>
>>>>>>>> I'm afraid signalling the fences with an IB test is not reliable.
>>>>>>>>
>>>>>>>> Marek
>>>>>>>>
>>>>>>>> On Wed, Oct 2, 2013 at 3:52 PM, Christian König
>>>>>>>> <deathsimple at vodafone.de> wrote:
>>>>>>>>> NAK, after recovering from a lockup the first thing we do is
>>>>>>>>> signalling all remaining fences with an IB test.
>>>>>>>>>
>>>>>>>>> If we don't recover we indeed signal all fences manually.
>>>>>>>>>
>>>>>>>>> Signalling all fences regardless of the outcome of the reset creates
>>>>>>>>> problems with both types of partial resets.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Marek Olšák <maraeo at gmail.com> schrieb:
>>>>>>>>>
>>>>>>>>>> From: Marek Olšák <marek.olsak at amd.com>
>>>>>>>>>>
>>>>>>>>>> After a lockup, fences are not signalled sometimes, causing
>>>>>>>>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes results
>>>>>>>>>> in an X server freeze.
>>>>>>>>>>
>>>>>>>>>> This fixes only one of many deadlocks which can occur during a
>>>>>>>>>> lockup.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Marek Olšák <marek.olsak at amd.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++
>>>>>>>>>> 1 file changed, 5 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>> index 841d0e0..7b97baa 100644
>>>>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device
>>>>>>>>>> *rdev)
>>>>>>>>>>          radeon_save_bios_scratch_regs(rdev);
>>>>>>>>>>          /* block TTM */
>>>>>>>>>>          resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
>>>>>>>>>> +
>>>>>>>>>> +      mutex_lock(&rdev->ring_lock);
>>>>>>>>>> +      radeon_fence_driver_force_completion(rdev);
>>>>>>>>>> +      mutex_unlock(&rdev->ring_lock);
>>>>>>>>>> +
>>>>>>>>>>          radeon_pm_suspend(rdev);
>>>>>>>>>>          radeon_suspend(rdev);
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> 1.8.1.2
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> dri-devel mailing list
>>>>>>>>>> dri-devel at lists.freedesktop.org
>>>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>>
>>>>> _______________________________________________
>>>>> dri-devel mailing list
>>>>> dri-devel at lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-radeon-rework-and-fix-reset-detection-v2.patch
Type: text/x-diff
Size: 17226 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20131013/3187a906/attachment-0001.patch>