[PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

Wed Oct 9 14:04:59 CEST 2013

The ring test of the first compute ring always fails and it shouldn't
affect the GPU reset in any way.

I can't tell if the deadlock issue is fixed, because the GPU reset
usually fails with your patch. It always succeeded without your patch.

Marek

On Wed, Oct 9, 2013 at 1:09 PM, Christian König <deathsimple at vodafone.de> wrote:
> Mhm, that doesn't looks like anything related but more like the reset of the
> compute ring didn't worked.
>
> How often does that happen? And do you still get the problem where X waits
> for a fence that never comes back?
>
> Christian.
>
> Am 09.10.2013 12:36, schrieb Marek Olšák:
>
>> I'm afraid your patch sometimes causes the GPU reset to fail, which
>> had never happened before IIRC.
>>
>> The dmesg log from the failure is attached.
>>
>> Marek
>>
>> On Tue, Oct 8, 2013 at 6:21 PM, Christian König <deathsimple at vodafone.de>
>> wrote:
>>>
>>> Hi Marek,
>>>
>>> please try the attached patch as a replacement for your signaling all
>>> fences
>>> patch. I'm not 100% sure if it fixes all issues, but it's at least a
>>> start.
>>>
>>> Thanks,
>>> Christian.
>>>
>>> Am 07.10.2013 13:08, schrieb Christian König:
>>>
>>>>> First of all, I can't complain about the reliability of the hardware
>>>>> GPU reset. It's mostly the kernel driver that happens to run into a
>>>>> deadlock at the same time.
>>>>
>>>>
>>>> Alex and I spend quite some time on making this reliable again after
>>>> activating more rings and adding VM support. The main problem is that I
>>>> couldn't figure out where the CPU deadlock comes from, cause I couldn't
>>>> reliable reproduce the issue.
>>>>
>>>> What is the content of /proc/<pid of X server>/task/*/stack and
>>>> sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in
>>>> the
>>>> deadlock situation?
>>>>
>>>> I'm pretty sure that we nearly always have a problem when two threads
>>>> are
>>>> waiting for fences on one of them detects that we have a lockup while
>>>> the
>>>> other one keeps holding the exclusive lock. Signaling all fences might
>>>> work
>>>> around that problem, but it probably would be better to fix the
>>>> underlying
>>>> issue.
>>>>
>>>> Going to take a deeper look into it.
>>>>
>>>> Christian.
>>>>
>>>> Am 03.10.2013 02:45, schrieb Marek Olšák:
>>>>>
>>>>> First of all, I can't complain about the reliability of the hardware
>>>>> GPU reset. It's mostly the kernel driver that happens to run into a
>>>>> deadlock at the same time.
>>>>>
>>>>> Regarding the issue with fences, the problem is that the GPU reset
>>>>> completes successfully according to dmesg, but X doesn't respond. I
>>>>> can move the cursor on the screen, but I can't do anything else and
>>>>> the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can
>>>>> easily reproduce this, because it's the most common reason why a GPU
>>>>> lockup leads to frozen X. The GPU actually recovers, but X is hung. I
>>>>> can't tell whether the fences are just not signalled or whether there
>>>>> is actually a real CPU deadlock I can't see.
>>>>>
>>>>> This patch makes the problem go away and GPU resets are successful
>>>>> (except for extreme cases, see below). With a small enough lockup
>>>>> timeout, the lockups are just a minor annoyance and I thought I could
>>>>> get through a piglit run just with a few tens or hundreds of GPU
>>>>> resets...
>>>>>
>>>>> A different type of deadlock showed up, though it needs a lot of
>>>>> concurrently-running apps like piglit. What happened is that the
>>>>> kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due
>>>>> to a GPU hang while holding onto the exclusive lock, and another
>>>>> thread wanting to do the GPU reset was unable to acquire the lock.
>>>>>
>>>>> That said, I will use the patch locally, because it helps a lot. I got
>>>>> a few lockups while writing this email and I'm glad I didn't have to
>>>>> reboot.
>>>>>
>>>>> Marek
>>>>>
>>>>> On Wed, Oct 2, 2013 at 4:50 PM, Christian König
>>>>> <deathsimple at vodafone.de>
>>>>> wrote:
>>>>>>
>>>>>> Possible, but I would rather guess that this doesn't work because the
>>>>>> IB
>>>>>> test runs into a deadlock situation and so the GPU reset never fully
>>>>>> completes.
>>>>>>
>>>>>> Can you reproduce the problem?
>>>>>>
>>>>>> If you want to make GPU resets more reliable I would rather suggest to
>>>>>> remove the ring lock dependency.
>>>>>> Then we should try to give all the fence wait functions a (reliable)
>>>>>> timeout and move reset handling a layer up into the ioctl functions.
>>>>>> But for
>>>>>> this you need to rip out the old PM code first.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Marek Olšák <maraeo at gmail.com> schrieb:
>>>>>>
>>>>>>> I'm afraid signalling the fences with an IB test is not reliable.
>>>>>>>
>>>>>>> Marek
>>>>>>>
>>>>>>> On Wed, Oct 2, 2013 at 3:52 PM, Christian König
>>>>>>> <deathsimple at vodafone.de> wrote:
>>>>>>>>
>>>>>>>> NAK, after recovering from a lockup the first thing we do is
>>>>>>>> signalling all remaining fences with an IB test.
>>>>>>>>
>>>>>>>> If we don't recover we indeed signal all fences manually.
>>>>>>>>
>>>>>>>> Signalling all fences regardless of the outcome of the reset creates
>>>>>>>> problems with both types of partial resets.
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Marek Olšák <maraeo at gmail.com> schrieb:
>>>>>>>>
>>>>>>>>> From: Marek Olšák <marek.olsak at amd.com>
>>>>>>>>>
>>>>>>>>> After a lockup, fences are not signalled sometimes, causing
>>>>>>>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes results
>>>>>>>>> in an X server freeze.
>>>>>>>>>
>>>>>>>>> This fixes only one of many deadlocks which can occur during a
>>>>>>>>> lockup.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Marek Olšák <marek.olsak at amd.com>
>>>>>>>>> ---
>>>>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++
>>>>>>>>> 1 file changed, 5 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>> index 841d0e0..7b97baa 100644
>>>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device
>>>>>>>>> *rdev)
>>>>>>>>>         radeon_save_bios_scratch_regs(rdev);
>>>>>>>>>         /* block TTM */
>>>>>>>>>         resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
>>>>>>>>> +
>>>>>>>>> +      mutex_lock(&rdev->ring_lock);
>>>>>>>>> +      radeon_fence_driver_force_completion(rdev);
>>>>>>>>> +      mutex_unlock(&rdev->ring_lock);
>>>>>>>>> +
>>>>>>>>>         radeon_pm_suspend(rdev);
>>>>>>>>>         radeon_suspend(rdev);
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 1.8.1.2
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> dri-devel mailing list
>>>>>>>>> dri-devel at lists.freedesktop.org
>>>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>>
>