"ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

Wed Feb 20 15:39:16 UTC 2019

On 2/20/19 12:28 AM, Mikhail Gavrilov wrote:
> On Tue, 19 Feb 2019 at 20:24, Grodzovsky, Andrey
> <Andrey.Grodzovsky at amd.com> wrote:
>> Just pull in latest drm-next from here -
>> https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
>>
>> Andrey
> Tested this kernel and result not good for me.
> 1) "amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0070113C"
> happens again. I thought this would fixed.

No, we only fixed the original deadlock with display driver during GPU 
reset. I still didn't have time to go over your captures for the GPU 
page fault.

The deadlock we see here is another deadlock, different from the one 
already fixed. I suggest you open a bugzilla ticket for this and add me 
there so we can track it and take care of it.

Andrey

>
> 2) After it "WARNING: possible circular locking dependency detected" happens.
>
> [  302.266337] ======================================================
> [  302.266338] WARNING: possible circular locking dependency detected
> [  302.266340] 5.0.0-rc1-drm-next-kernel+ #1 Tainted: G         C
> [  302.266341] ------------------------------------------------------
> [  302.266343] kworker/5:2/871 is trying to acquire lock:
> [  302.266345] 000000000abbb16a
> (&(&ring->fence_drv.lock)->rlock){-.-.}, at:
> dma_fence_remove_callback+0x1a/0x60
> [  302.266352]
>                 but task is already holding lock:
> [  302.266353] 000000006e32ba38
> (&(&sched->job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x140
> [gpu_sched]
> [  302.266358]
>                 which lock already depends on the new lock.
>
> [  302.266360]
>                 the existing dependency chain (in reverse order) is:
> [  302.266361]
>                 -> #1 (&(&sched->job_list_lock)->rlock){-.-.}:
> [  302.266366]        drm_sched_process_job+0x4d/0x180 [gpu_sched]
> [  302.266368]        dma_fence_signal+0x111/0x1a0
> [  302.266414]        amdgpu_fence_process+0xa3/0x100 [amdgpu]
> [  302.266470]        sdma_v4_0_process_trap_irq+0x6e/0xa0 [amdgpu]
> [  302.266523]        amdgpu_irq_dispatch+0xc0/0x250 [amdgpu]
> [  302.266576]        amdgpu_ih_process+0x84/0xf0 [amdgpu]
> [  302.266628]        amdgpu_irq_handler+0x1b/0x50 [amdgpu]
> [  302.266632]        __handle_irq_event_percpu+0x3f/0x290
> [  302.266635]        handle_irq_event_percpu+0x31/0x80
> [  302.266637]        handle_irq_event+0x34/0x51
> [  302.266639]        handle_edge_irq+0x7c/0x1a0
> [  302.266643]        handle_irq+0xbf/0x100
> [  302.266646]        do_IRQ+0x61/0x120
> [  302.266648]        ret_from_intr+0x0/0x22
> [  302.266651]        cpuidle_enter_state+0xbf/0x470
> [  302.266654]        do_idle+0x1ec/0x280
> [  302.266657]        cpu_startup_entry+0x19/0x20
> [  302.266660]        start_secondary+0x1b3/0x200
> [  302.266663]        secondary_startup_64+0xa4/0xb0
> [  302.266664]
>                 -> #0 (&(&ring->fence_drv.lock)->rlock){-.-.}:
> [  302.266668]        _raw_spin_lock_irqsave+0x49/0x83
> [  302.266670]        dma_fence_remove_callback+0x1a/0x60
> [  302.266673]        drm_sched_stop+0x59/0x140 [gpu_sched]
> [  302.266717]        amdgpu_device_pre_asic_reset+0x4f/0x240 [amdgpu]
> [  302.266761]        amdgpu_device_gpu_recover+0x88/0x7d0 [amdgpu]
> [  302.266822]        amdgpu_job_timedout+0x109/0x130 [amdgpu]
> [  302.266827]        drm_sched_job_timedout+0x40/0x70 [gpu_sched]
> [  302.266831]        process_one_work+0x272/0x5d0
> [  302.266834]        worker_thread+0x50/0x3b0
> [  302.266836]        kthread+0x108/0x140
> [  302.266839]        ret_from_fork+0x27/0x50
> [  302.266840]
>                 other info that might help us debug this:
>
> [  302.266841]  Possible unsafe locking scenario:
>
> [  302.266842]        CPU0                    CPU1
> [  302.266843]        ----                    ----
> [  302.266844]   lock(&(&sched->job_list_lock)->rlock);
> [  302.266846]
> lock(&(&ring->fence_drv.lock)->rlock);
> [  302.266847]
> lock(&(&sched->job_list_lock)->rlock);
> [  302.266849]   lock(&(&ring->fence_drv.lock)->rlock);
> [  302.266850]
>                  *** DEADLOCK ***
>
> [  302.266852] 5 locks held by kworker/5:2/871:
> [  302.266853]  #0: 00000000d133fb6e ((wq_completion)"events"){+.+.},
> at: process_one_work+0x1e9/0x5d0
> [  302.266857]  #1: 000000008a5c3f7e
> ((work_completion)(&(&sched->work_tdr)->work)){+.+.}, at:
> process_one_work+0x1e9/0x5d0
> [  302.266862]  #2: 00000000b9b2c76f (&adev->lock_reset){+.+.}, at:
> amdgpu_device_lock_adev+0x17/0x40 [amdgpu]
> [  302.266908]  #3: 00000000ac637728 (&dqm->lock_hidden){+.+.}, at:
> kgd2kfd_pre_reset+0x30/0x60 [amdgpu]
> [  302.266965]  #4: 000000006e32ba38
> (&(&sched->job_list_lock)->rlock){-.-.}, at: drm_sched_stop+0x34/0x140
> [gpu_sched]
> [  302.266971]
>                 stack backtrace:
> [  302.266975] CPU: 5 PID: 871 Comm: kworker/5:2 Tainted: G         C
>        5.0.0-rc1-drm-next-kernel+ #1
> [  302.266976] Hardware name: System manufacturer System Product
> Name/ROG STRIX X470-I GAMING, BIOS 1103 11/16/2018
> [  302.266980] Workqueue: events drm_sched_job_timedout [gpu_sched]
> [  302.266982] Call Trace:
> [  302.266987]  dump_stack+0x85/0xc0
> [  302.266991]  print_circular_bug.isra.0.cold+0x15c/0x195
> [  302.266994]  __lock_acquire+0x134c/0x1660
> [  302.266998]  ? add_lock_to_list.isra.0+0x67/0xb0
> [  302.267003]  lock_acquire+0xa2/0x1b0
> [  302.267006]  ? dma_fence_remove_callback+0x1a/0x60
> [  302.267011]  _raw_spin_lock_irqsave+0x49/0x83
> [  302.267013]  ? dma_fence_remove_callback+0x1a/0x60
> [  302.267016]  dma_fence_remove_callback+0x1a/0x60
> [  302.267020]  drm_sched_stop+0x59/0x140 [gpu_sched]
> [  302.267065]  amdgpu_device_pre_asic_reset+0x4f/0x240 [amdgpu]
> [  302.267110]  amdgpu_device_gpu_recover+0x88/0x7d0 [amdgpu]
> [  302.267173]  amdgpu_job_timedout+0x109/0x130 [amdgpu]
> [  302.267178]  drm_sched_job_timedout+0x40/0x70 [gpu_sched]
> [  302.267183]  process_one_work+0x272/0x5d0
> [  302.267188]  worker_thread+0x50/0x3b0
> [  302.267191]  kthread+0x108/0x140
> [  302.267194]  ? process_one_work+0x5d0/0x5d0
> [  302.267196]  ? kthread_park+0x90/0x90
> [  302.267199]  ret_from_fork+0x27/0x50
>
> 3) kernel logs flooded by message "[drm:amdgpu_cs_ioctl [amdgpu]]
> *ERROR* Failed to initialize parser -125!" until I killed
> gdm-wayland-session process.
> `# systemctl restart gdm` couldn't helps restart worked Wayland session.
>
> 4) finally after `# init 6` system actually hang.
>
> --
> Best Regards,
> Mike Gavrilov.