Regression on gfx8 with ring init

Tue Sep 18 14:40:46 UTC 2018

On 2018-09-18 10:31 a.m., Christian König wrote:
> Well looks like interrupt processing is working perfectly fine.
> 
> But looking at the error message once more I see that this actually 
> affects ring number 9 and not the GFX ring.
> 
> Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the 
> number?
> 
> That must be some of the compute rings.

That's a bingo.

[   32.231734] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:01:00.0 
on minor 0
[   32.233803] modprobe (3816) used greatest stack depth: 12464 bytes left
[   35.266007] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB 
test timed out.
[   35.266373] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: 
failed testing IB on ring (kiq_2.1.0) 9 (-110).
[   35.403034] [drm:process_one_work] *ERROR* ib ring test failed (-110).

Should point out that kfd still has the old fence logic:

[root at raven amd]# git grep enable_signaling
amdgpu/amdgpu_amdkfd_fence.c: *  nofity when the BO is free to move. 
fence_add_callback --> enable_signaling
amdgpu/amdgpu_amdkfd_fence.c: *  --> amdgpu_amdkfd_fence.enable_signaling
amdgpu/amdgpu_amdkfd_fence.c: * amdgpu_amdkfd_fence.enable_signaling - 
Start a work item that will quiesce
amdgpu/amdgpu_amdkfd_fence.c: * amdkfd_fence_enable_signaling - This 
gets called when TTM wants to evict
amdgpu/amdgpu_amdkfd_fence.c:static bool 
amdkfd_fence_enable_signaling(struct dma_fence *f)
amdgpu/amdgpu_amdkfd_fence.c:   .enable_signaling = 
amdkfd_fence_enable_signaling,

Tom

> 
> Thanks,
> Christian.
> 
> Am 18.09.2018 um 16:20 schrieb Tom St Denis:
>> On 2018-09-18 10:13 a.m., Christian König wrote:
>>> Mhm, there is no more failed IB-test in there isn't it?
>>
>> oh sorry I thought you wanted to test HEAD~ ... Attached is a log from 
>> the tip of drm-next
>>
>> Tom
>>
>>>
>>> Christian.
>>>
>>> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>>
>>>> Here's the log.
>>>>
>>>> Tom
>>>>
>>>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>>> Odd I couldn't even boot my system with the dGPU as primary after 
>>>>> rebuilding the kernel.  It got hung up in the IOMMU driver (loads 
>>>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it 
>>>>> panic'ed before loading the network stack.
>>>>>
>>>>> Bizarre.
>>>>>
>>>>> I'll keep trying.
>>>>>
>>>>> Tom
>>>>>
>>>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>>>
>>>>>>>> Anyway going to revert the change for now. Does anybody 
>>>>>>>> volunteer to figure out why interrupts sometimes doesn't work 
>>>>>>>> correctly on Raven?
>>>>>>>
>>>>>>> What does "doesn't work correctly?"  My workstation is a Raven1 
>>>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been 
>>>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>>>
>>>>>>> Anything I could test with my devel raven?
>>>>>>
>>>>>> The problem seems to be that on some boards IH handling doesn't 
>>>>>> work as it should.
>>>>>>
>>>>>> Can you try to disable the onboard graphics and try again?
>>>>>>
>>>>>> If that still doesn't work there is a DRM_DEBUG in 
>>>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the 
>>>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>>>>
>>>>>> Thanks,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Tom
>>>>>>>
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>>>>>>>> This commit:
>>>>>>>>>
>>>>>>>>> [root at raven linux]# git bisect good
>>>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
>>>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>>>>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>>>>>>>>>
>>>>>>>>>     drm/amdgpu: remove fence fallback
>>>>>>>>>
>>>>>>>>>     DC doesn't seem to have a fallback path either.
>>>>>>>>>
>>>>>>>>>     So when interrupts doesn't work any more we are pretty much 
>>>>>>>>> busted no
>>>>>>>>>     matter what.
>>>>>>>>>
>>>>>>>>>     Signed-off-by: Christian König <christian.koenig at amd.com>
>>>>>>>>>     Reviewed-by: Chunming Zhou <david1.zhou at amd.com>
>>>>>>>>>
>>>>>>>>> Results in this:
>>>>>>>>>
>>>>>>>>> [   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for 
>>>>>>>>> 0000:07:00.0 on minor 1
>>>>>>>>> [   24.335674] modprobe (3895) used greatest stack depth: 12600 
>>>>>>>>> bytes left
>>>>>>>>> [   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* 
>>>>>>>>> amdgpu: IB test timed out.
>>>>>>>>> [   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* 
>>>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>>>>>>>> [   26.407885] [drm:process_one_work] *ERROR* ib ring test 
>>>>>>>>> failed (-110).
>>>>>>>>> [   28.506708] fuse init (API version 7.27)
>>>>>>>>>
>>>>>>>>> On init with my polaris/raven1 system.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Tom
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>