Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

Wed Jan 31 12:47:55 UTC 2018

Hi Alexander,

I've cherry picked the patch you pointed out into kernel from
amd-drm-next-4.17-wip at commit
9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set
DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has
gone indeed.

Working great on ARMv7l with AMD RX460.

Thanks,
Luís Mendes

On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander
<Alexander.Deucher at amd.com> wrote:
> Fixed with this patch:
>
> https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html
>
>
> Alex
>
> ________________________________
> From: Luís Mendes <luis.p.mendes at gmail.com>
> Sent: Tuesday, January 30, 2018 1:30 PM
> To: Michel Dänzer; Koenig, Christian
> Cc: Deucher, Alexander; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org
> Subject: Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 -
> Update 2
>
> Hi everyone,
>
> I've tested the kernel from amd-drm-next-4.17-wip at commit
> 9ab2894122275a6d636bb2654a157e88a0f7b9e2 (
> drm/amdgpu: set DRIVER_ATOMIC flag early) on ARMv7l, and the reported
> issues seem now to have gone. I haven't checked from which commit this
> is fixed, but it is now fixed! I also noticed a performance
> improvement in one of the glmark2 tests.
>
> There seem to be some other small issues, possibly unrelated, such
> that sometimes the screen becomes black and the sound stops while
> playing the video for a second or less and then normal playback is
> recovered, this happens rarely and at most once per power cycle, while
> using X and Kodi, despite I have played many individual videos and
> power cycled the machine sometimes.
>
> I've also observed what was already reported, when watching non-VP9 videos:
> [  591.729558] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.740255] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.750968] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.761628] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.772248] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.782672] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.793172] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.803681] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.814129] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.824560] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.835054] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.845437] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.855860] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.866415] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.876945] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
> [  591.887454] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
> writing more dwords to the ring than expected!
>
> Regards,
> Luís Mendes
>
> On Wed, Jan 3, 2018 at 11:08 PM, Luís Mendes <luis.p.mendes at gmail.com>
> wrote:
>> Hi Michel, Christian,
>>
>> Michel, I have tested amd-staging-drm-next at commit "drm/amdgpu/gfx9:
>> only init the apertures used by KGD (v2)" -
>> 0e4946409d11913523d30bc4830d10b388438c7a and the issues remain, both
>> on ARMv7 and on x86 amd64.
>>
>> Christian, in fact if I replay the apitraces obtained on the ARMv7
>> platform on the AMD64 I am also able to reproduce the GPU hang! So it
>> is not ARM platform specific. Should I send/upload the apitraces? I
>> have two of them, typically when one doesn't hang the gpu the other
>> hangs. One takes about 1GB of disk space while the other takes 2.3GB.
>> ...
>> [   69.019381] ISO 9660 Extensions: RRIP_1991A
>> [  213.292094] DMAR: DRHD: handling fault status reg 2
>> [  213.292102] DMAR: [INTR-REMAP] Request device [00:00.0] fault index
>> 1c [fault reason 38] Blocked an interrupt request due to source-id
>> verification failure
>> [  223.406919] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
>> timeout, last signaled seq=25158, last emitted seq=25160
>> [  223.406926] [drm] IP block:tonga_ih is hung!
>> [  223.407167] [drm] GPU recovery disabled.
>>
>> Regards,
>> Luís
>>
>>
>> On Wed, Jan 3, 2018 at 5:47 PM, Luís Mendes <luis.p.mendes at gmail.com>
>> wrote:
>>> Hi Michel, Christian,
>>>
>>> Christian, I have followed your suggestion and I have just submitted a
>>> bug to fdo at https://bugs.freedesktop.org/show_bug.cgi?id=104481 -
>>> GPU lockup Polaris 11 - AMD RX 460 and RX 550 on amd64 and on ARMv7
>>> platforms while playing video.
>>>
>>> Michel, amdgpu.dc=0 seems to make no difference. I will try
>>> amd-staging-drm-next and report back.
>>>
>>> Regards,
>>> Luís
>>>
>>> On Wed, Jan 3, 2018 at 5:09 PM, Michel Dänzer <michel at daenzer.net> wrote:
>>>> On 2018-01-03 12:02 PM, Luís Mendes wrote:
>>>>>
>>>>> What I believe it seems to be the case is that the GPU lock up only
>>>>> happens when doing a page flip, since the kernel locks with:
>>>>> [  243.693200] kworker/u4:3    D    0    89      2 0x00000000
>>>>> [  243.693232] Workqueue: events_unbound commit_work [drm_kms_helper]
>>>>> [  243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>]
>>>>> (schedule+0x4c/0xac)
>>>>> [  243.693259] [<80b8cdd0>] (schedule) from [<80b91024>]
>>>>> (schedule_timeout+0x228/0x444)
>>>>> [  243.693270] [<80b91024>] (schedule_timeout) from [<80886738>]
>>>>> (dma_fence_default_wait+0x2b4/0x2d8)
>>>>> [  243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
>>>>> (dma_fence_wait_timeout+0x40/0x150)
>>>>> [  243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
>>>>> (reservation_object_wait_timeout_rcu+0xfc/0x34c)
>>>>> [  243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
>>>>> [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
>>>>> [  243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from
>>>>> [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
>>>>> ...
>>>>
>>>> Does the problem also occur if you disable DC with amdgpu.dc=0 on the
>>>> kernel command line?
>>>>
>>>> Does it also happen with a kernel built from the amd-staging-drm-next
>>>> branch instead of drm-next-4.16?
>>>>
>>>>
>>>> --
>>>> Earthling Michel Dänzer               |               http://www.amd.com
>>>> Libre software enthusiast             |             Mesa and X developer