[PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

Wed Oct 9 12:36:04 CEST 2013

I'm afraid your patch sometimes causes the GPU reset to fail, which
had never happened before IIRC.

The dmesg log from the failure is attached.

Marek

On Tue, Oct 8, 2013 at 6:21 PM, Christian König <deathsimple at vodafone.de> wrote:
> Hi Marek,
>
> please try the attached patch as a replacement for your signaling all fences
> patch. I'm not 100% sure if it fixes all issues, but it's at least a start.
>
> Thanks,
> Christian.
>
> Am 07.10.2013 13:08, schrieb Christian König:
>
>>> First of all, I can't complain about the reliability of the hardware
>>> GPU reset. It's mostly the kernel driver that happens to run into a
>>> deadlock at the same time.
>>
>>
>> Alex and I spend quite some time on making this reliable again after
>> activating more rings and adding VM support. The main problem is that I
>> couldn't figure out where the CPU deadlock comes from, cause I couldn't
>> reliable reproduce the issue.
>>
>> What is the content of /proc/<pid of X server>/task/*/stack and
>> sys/kernel/debug/dri/0/radeon_fence_info when the X server is stuck in the
>> deadlock situation?
>>
>> I'm pretty sure that we nearly always have a problem when two threads are
>> waiting for fences on one of them detects that we have a lockup while the
>> other one keeps holding the exclusive lock. Signaling all fences might work
>> around that problem, but it probably would be better to fix the underlying
>> issue.
>>
>> Going to take a deeper look into it.
>>
>> Christian.
>>
>> Am 03.10.2013 02:45, schrieb Marek Olšák:
>>>
>>> First of all, I can't complain about the reliability of the hardware
>>> GPU reset. It's mostly the kernel driver that happens to run into a
>>> deadlock at the same time.
>>>
>>> Regarding the issue with fences, the problem is that the GPU reset
>>> completes successfully according to dmesg, but X doesn't respond. I
>>> can move the cursor on the screen, but I can't do anything else and
>>> the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can
>>> easily reproduce this, because it's the most common reason why a GPU
>>> lockup leads to frozen X. The GPU actually recovers, but X is hung. I
>>> can't tell whether the fences are just not signalled or whether there
>>> is actually a real CPU deadlock I can't see.
>>>
>>> This patch makes the problem go away and GPU resets are successful
>>> (except for extreme cases, see below). With a small enough lockup
>>> timeout, the lockups are just a minor annoyance and I thought I could
>>> get through a piglit run just with a few tens or hundreds of GPU
>>> resets...
>>>
>>> A different type of deadlock showed up, though it needs a lot of
>>> concurrently-running apps like piglit. What happened is that the
>>> kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due
>>> to a GPU hang while holding onto the exclusive lock, and another
>>> thread wanting to do the GPU reset was unable to acquire the lock.
>>>
>>> That said, I will use the patch locally, because it helps a lot. I got
>>> a few lockups while writing this email and I'm glad I didn't have to
>>> reboot.
>>>
>>> Marek
>>>
>>> On Wed, Oct 2, 2013 at 4:50 PM, Christian König <deathsimple at vodafone.de>
>>> wrote:
>>>>
>>>> Possible, but I would rather guess that this doesn't work because the IB
>>>> test runs into a deadlock situation and so the GPU reset never fully
>>>> completes.
>>>>
>>>> Can you reproduce the problem?
>>>>
>>>> If you want to make GPU resets more reliable I would rather suggest to
>>>> remove the ring lock dependency.
>>>> Then we should try to give all the fence wait functions a (reliable)
>>>> timeout and move reset handling a layer up into the ioctl functions. But for
>>>> this you need to rip out the old PM code first.
>>>>
>>>> Christian.
>>>>
>>>> Marek Olšák <maraeo at gmail.com> schrieb:
>>>>
>>>>> I'm afraid signalling the fences with an IB test is not reliable.
>>>>>
>>>>> Marek
>>>>>
>>>>> On Wed, Oct 2, 2013 at 3:52 PM, Christian König
>>>>> <deathsimple at vodafone.de> wrote:
>>>>>>
>>>>>> NAK, after recovering from a lockup the first thing we do is
>>>>>> signalling all remaining fences with an IB test.
>>>>>>
>>>>>> If we don't recover we indeed signal all fences manually.
>>>>>>
>>>>>> Signalling all fences regardless of the outcome of the reset creates
>>>>>> problems with both types of partial resets.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Marek Olšák <maraeo at gmail.com> schrieb:
>>>>>>
>>>>>>> From: Marek Olšák <marek.olsak at amd.com>
>>>>>>>
>>>>>>> After a lockup, fences are not signalled sometimes, causing
>>>>>>> the GEM_WAIT_IDLE ioctl to never return, which sometimes results
>>>>>>> in an X server freeze.
>>>>>>>
>>>>>>> This fixes only one of many deadlocks which can occur during a
>>>>>>> lockup.
>>>>>>>
>>>>>>> Signed-off-by: Marek Olšák <marek.olsak at amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++
>>>>>>> 1 file changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>> b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>> index 841d0e0..7b97baa 100644
>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>>>> @@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device
>>>>>>> *rdev)
>>>>>>>        radeon_save_bios_scratch_regs(rdev);
>>>>>>>        /* block TTM */
>>>>>>>        resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
>>>>>>> +
>>>>>>> +      mutex_lock(&rdev->ring_lock);
>>>>>>> +      radeon_fence_driver_force_completion(rdev);
>>>>>>> +      mutex_unlock(&rdev->ring_lock);
>>>>>>> +
>>>>>>>        radeon_pm_suspend(rdev);
>>>>>>>        radeon_suspend(rdev);
>>>>>>>
>>>>>>> --
>>>>>>> 1.8.1.2
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> dri-devel mailing list
>>>>>>> dri-devel at lists.freedesktop.org
>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>
-------------- next part --------------
[  104.836967] radeon 0000:01:00.0: GPU lockup CP stall for more than 1000msec
[  104.836974] radeon 0000:01:00.0: GPU lockup (waiting for 0x000000000000fcee last fence id 0x000000000000fcd8 on ring 0)
[  105.001288] [drm] Disabling audio 0 support
[  105.001289] [drm] Disabling audio 1 support
[  105.001290] [drm] Disabling audio 2 support
[  105.001291] [drm] Disabling audio 3 support
[  105.001292] [drm] Disabling audio 4 support
[  105.001292] [drm] Disabling audio 5 support
[  105.001293] [drm] Disabling audio 6 support
[  105.001496] radeon 0000:01:00.0: sa_manager is not empty, clearing anyway
[  105.008773] radeon 0000:01:00.0: Saved 15990 dwords of commands on ring 0.
[  105.008776] radeon 0000:01:00.0: Saved 121 dwords of commands on ring 2.
[  105.008787] radeon 0000:01:00.0: GPU softreset: 0x00000008
[  105.008788] radeon 0000:01:00.0:   GRBM_STATUS=0xA0003028
[  105.008790] radeon 0000:01:00.0:   GRBM_STATUS2=0x70000008
[  105.008791] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x00000006
[  105.008792] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x00000006
[  105.008794] radeon 0000:01:00.0:   GRBM_STATUS_SE2=0x00000006
[  105.008795] radeon 0000:01:00.0:   GRBM_STATUS_SE3=0x00000006
[  105.008796] radeon 0000:01:00.0:   SRBM_STATUS=0x20000A40
[  105.008797] radeon 0000:01:00.0:   SRBM_STATUS2=0x00000000
[  105.008799] radeon 0000:01:00.0:   SDMA0_STATUS_REG   = 0x46CEE557
[  105.008801] radeon 0000:01:00.0:   SDMA1_STATUS_REG   = 0x46CEED57
[  105.008802] radeon 0000:01:00.0:   CP_STAT = 0x84010200
[  105.008803] radeon 0000:01:00.0:   CP_STALLED_STAT1 = 0x00000c00
[  105.008805] radeon 0000:01:00.0:   CP_STALLED_STAT2 = 0x00010000
[  105.008806] radeon 0000:01:00.0:   CP_STALLED_STAT3 = 0x00000400
[  105.008807] radeon 0000:01:00.0:   CP_CPF_BUSY_STAT = 0x48440002
[  105.008809] radeon 0000:01:00.0:   CP_CPF_STALLED_STAT1 = 0x00000001
[  105.008810] radeon 0000:01:00.0:   CP_CPF_STATUS = 0x80008023
[  105.008811] radeon 0000:01:00.0:   CP_CPC_BUSY_STAT = 0x00000408
[  105.008813] radeon 0000:01:00.0:   CP_CPC_STALLED_STAT1 = 0x00002100
[  105.008814] radeon 0000:01:00.0:   CP_CPC_STATUS = 0xa0000041
[  105.008816] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  105.008817] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[  105.018070] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00010001
[  105.018122] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[  105.019267] radeon 0000:01:00.0:   GRBM_STATUS=0xA0003028
[  105.019268] radeon 0000:01:00.0:   GRBM_STATUS2=0x30000008
[  105.019269] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x00000006
[  105.019271] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x00000006
[  105.019272] radeon 0000:01:00.0:   GRBM_STATUS_SE2=0x00000006
[  105.019273] radeon 0000:01:00.0:   GRBM_STATUS_SE3=0x00000006
[  105.019275] radeon 0000:01:00.0:   SRBM_STATUS=0x20000040
[  105.019276] radeon 0000:01:00.0:   SRBM_STATUS2=0x00000000
[  105.019277] radeon 0000:01:00.0:   SDMA0_STATUS_REG   = 0x46CEE557
[  105.019279] radeon 0000:01:00.0:   SDMA1_STATUS_REG   = 0x46CEED57
[  105.019280] radeon 0000:01:00.0:   CP_STAT = 0x00000000
[  105.019281] radeon 0000:01:00.0:   CP_STALLED_STAT1 = 0x00000000
[  105.019283] radeon 0000:01:00.0:   CP_STALLED_STAT2 = 0x00000000
[  105.019284] radeon 0000:01:00.0:   CP_STALLED_STAT3 = 0x00000000
[  105.019285] radeon 0000:01:00.0:   CP_CPF_BUSY_STAT = 0x48440000
[  105.019287] radeon 0000:01:00.0:   CP_CPF_STALLED_STAT1 = 0x00000000
[  105.019288] radeon 0000:01:00.0:   CP_CPF_STATUS = 0x80008003
[  105.019289] radeon 0000:01:00.0:   CP_CPC_BUSY_STAT = 0x00000408
[  105.019291] radeon 0000:01:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
[  105.019292] radeon 0000:01:00.0:   CP_CPC_STATUS = 0xa0000041
[  105.019299] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[  105.036864] [drm] probing gen 2 caps for device 8086:151 = 261ad03/e
[  105.036867] [drm] PCIE gen 3 link speeds already enabled
[  105.039262] [drm] PCIE GART of 1024M enabled (table at 0x0000000000277000).
[  105.039340] radeon 0000:01:00.0: WB enabled
[  105.039344] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffc7cc00
[  105.039346] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000040000c04 and cpu addr 0xffc7cc04
[  105.039347] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000040000c08 and cpu addr 0xffc7cc08
[  105.039348] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffc7cc0c
[  105.039349] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000040000c10 and cpu addr 0xffc7cc10
[  105.039727] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000076c98 and cpu addr 0xf8bb6c98
[  105.041638] [drm] ring test on 0 succeeded in 4 usecs
[  105.316804] [drm:cik_ring_test] *ERROR* radeon: ring 1 test failed (scratch(0x30118)=0xCAFEDEAD)
[  105.316952] [drm] ring test on 3 succeeded in 2 usecs
[  105.316962] [drm] ring test on 4 succeeded in 2 usecs
[  105.362635] [drm] ring test on 5 succeeded in 2 usecs
[  105.382493] [drm] UVD initialized successfully.
[  105.383105] [drm] Enabling audio 0 support
[  105.383106] [drm] Enabling audio 1 support
[  105.383106] [drm] Enabling audio 2 support
[  105.383107] [drm] Enabling audio 3 support
[  105.383108] [drm] Enabling audio 4 support
[  105.383109] [drm] Enabling audio 5 support
[  105.383109] [drm] Enabling audio 6 support
[  105.383294] [drm:cik_ib_test] *ERROR* radeon: fence wait failed (-35).
[  105.383296] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
[  105.383297] radeon 0000:01:00.0: ib ring test failed (-35).
[  105.383298] [drm] Disabling audio 0 support
[  105.383299] [drm] Disabling audio 1 support
[  105.383299] [drm] Disabling audio 2 support
[  105.383300] [drm] Disabling audio 3 support
[  105.383301] [drm] Disabling audio 4 support
[  105.383302] [drm] Disabling audio 5 support
[  105.383302] [drm] Disabling audio 6 support
[  105.390445] radeon 0000:01:00.0: GPU softreset: 0x00000008
[  105.390446] radeon 0000:01:00.0:   GRBM_STATUS=0xA0003028
[  105.390447] radeon 0000:01:00.0:   GRBM_STATUS2=0x70000008
[  105.390449] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x00000006
[  105.390450] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x00000006
[  105.390451] radeon 0000:01:00.0:   GRBM_STATUS_SE2=0x00000006
[  105.390453] radeon 0000:01:00.0:   GRBM_STATUS_SE3=0x00000006
[  105.390454] radeon 0000:01:00.0:   SRBM_STATUS=0x20000040
[  105.390455] radeon 0000:01:00.0:   SRBM_STATUS2=0x00000000
[  105.390457] radeon 0000:01:00.0:   SDMA0_STATUS_REG   = 0x46CEED57
[  105.390458] radeon 0000:01:00.0:   SDMA1_STATUS_REG   = 0x46CEED57
[  105.390460] radeon 0000:01:00.0:   CP_STAT = 0x80010200
[  105.390461] radeon 0000:01:00.0:   CP_STALLED_STAT1 = 0x00000c00
[  105.390462] radeon 0000:01:00.0:   CP_STALLED_STAT2 = 0x00010000
[  105.390464] radeon 0000:01:00.0:   CP_STALLED_STAT3 = 0x00000000
[  105.390465] radeon 0000:01:00.0:   CP_CPF_BUSY_STAT = 0x48440002
[  105.390466] radeon 0000:01:00.0:   CP_CPF_STALLED_STAT1 = 0x00000001
[  105.390468] radeon 0000:01:00.0:   CP_CPF_STATUS = 0x80008023
[  105.390469] radeon 0000:01:00.0:   CP_CPC_BUSY_STAT = 0x00000408
[  105.390471] radeon 0000:01:00.0:   CP_CPC_STALLED_STAT1 = 0x00002100
[  105.390472] radeon 0000:01:00.0:   CP_CPC_STATUS = 0xa0000041
[  105.390473] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  105.390475] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
[  105.390583] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00010001
[  105.390635] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[  105.391779] radeon 0000:01:00.0:   GRBM_STATUS=0xA0003028
[  105.391781] radeon 0000:01:00.0:   GRBM_STATUS2=0x30000008
[  105.391782] radeon 0000:01:00.0:   GRBM_STATUS_SE0=0x00000006
[  105.391783] radeon 0000:01:00.0:   GRBM_STATUS_SE1=0x00000006
[  105.391785] radeon 0000:01:00.0:   GRBM_STATUS_SE2=0x00000006
[  105.391786] radeon 0000:01:00.0:   GRBM_STATUS_SE3=0x00000006
[  105.391787] radeon 0000:01:00.0:   SRBM_STATUS=0x20000040
[  105.391788] radeon 0000:01:00.0:   SRBM_STATUS2=0x00000000
[  105.391790] radeon 0000:01:00.0:   SDMA0_STATUS_REG   = 0x46CEED57
[  105.391791] radeon 0000:01:00.0:   SDMA1_STATUS_REG   = 0x46CEED57
[  105.391793] radeon 0000:01:00.0:   CP_STAT = 0x00000000
[  105.391794] radeon 0000:01:00.0:   CP_STALLED_STAT1 = 0x00000000
[  105.391795] radeon 0000:01:00.0:   CP_STALLED_STAT2 = 0x00000000
[  105.391797] radeon 0000:01:00.0:   CP_STALLED_STAT3 = 0x00000000
[  105.391798] radeon 0000:01:00.0:   CP_CPF_BUSY_STAT = 0x48440000
[  105.391799] radeon 0000:01:00.0:   CP_CPF_STALLED_STAT1 = 0x00000000
[  105.391801] radeon 0000:01:00.0:   CP_CPF_STATUS = 0x80008003
[  105.391802] radeon 0000:01:00.0:   CP_CPC_BUSY_STAT = 0x00000408
[  105.391803] radeon 0000:01:00.0:   CP_CPC_STALLED_STAT1 = 0x00000000
[  105.391805] radeon 0000:01:00.0:   CP_CPC_STATUS = 0xa0000041
[  105.391811] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[  105.394327] [drm] probing gen 2 caps for device 8086:151 = 261ad03/e
[  105.394329] [drm] PCIE gen 3 link speeds already enabled
[  105.396699] [drm] PCIE GART of 1024M enabled (table at 0x0000000000277000).
[  105.396772] radeon 0000:01:00.0: WB enabled
[  105.396775] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00 and cpu addr 0xffc7cc00
[  105.396776] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000040000c04 and cpu addr 0xffc7cc04
[  105.396778] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000040000c08 and cpu addr 0xffc7cc08
[  105.396779] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c and cpu addr 0xffc7cc0c
[  105.396780] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000040000c10 and cpu addr 0xffc7cc10
[  105.397157] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000076c98 and cpu addr 0xf8bb6c98
[  105.399061] [drm:cik_ring_test] *ERROR* radeon: cp failed to get scratch reg (-22).
[  105.399062] [drm:cik_resume] *ERROR* cik startup failed on resume
[  105.399193] [drm:cik_sdma_ib_test] *ERROR* radeon: fence wait failed (-35).
[  105.399194] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 3 (-35).
[  105.399325] [drm:cik_sdma_ib_test] *ERROR* radeon: fence wait failed (-35).
[  105.399326] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 4 (-35).
[  105.536423] [drm:ci_dpm_enable] *ERROR* ci_start_dpm failed
[  105.536424] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed