Regression on gfx8 with ring init

Andrey Grodzovsky Andrey.Grodzovsky at amd.com
Thu Sep 20 20:35:41 UTC 2018


What's the status with this error and the suggested patch to fix it ? It 
impacts GPU reset on Polaris11.

Do we want to investigate why the original patch breaks it or just 
disable with the proposed patch ?


P.S Suspend resume also stopped working on latest branch - will bisect 
it later today or tomorrow.


Andrey


On 09/18/2018 11:00 AM, Christian König wrote:
> Tom,
>
> can you try if the following makes it working again?
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> index b6160de70d12..d65f5ba92fc5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct 
> amdgpu_ring *ring, long timeout)
>         return r;
>  }
>
> +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long 
> timeout)
> +{
> +       return 0;
> +}
>
>  static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
>  {
> @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs 
> gfx_v8_0_ring_funcs_kiq = {
>         .emit_ib = gfx_v8_0_ring_emit_ib_compute,
>         .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
>         .test_ring = gfx_v8_0_ring_test_ring,
> -       .test_ib = gfx_v8_0_ring_test_ib,
> +       .test_ib = gfx_v8_0_kiq_ring_test_ib,
>         .insert_nop = amdgpu_ring_insert_nop,
>         .pad_ib = amdgpu_ring_generic_pad_ib,
>         .emit_rreg = gfx_v8_0_ring_emit_rreg,
>
>
> Thanks,
> Christian.
>
> Am 18.09.2018 um 16:41 schrieb Christian König:
>> CRTC and GFX interrupts seem to be working perfectly fine.
>>
>> The problem here looks like only EOP interrupts from the Compute 
>> queue are not correctly handled.
>>
>> Most likely a bug somewhere in gfx_v8_0_eop_irq().
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:
>>>
>>> FWIW, a number of consumer Raven boards have bad IVRS tables 
>>> (windows doesn't use interrupt remapping so they are sometimes wrong 
>>> and probably not validated.  There are a number of workaround to 
>>> manually override the IVRS tables to make interrupts work.  I think 
>>> specifying pci=noacpi is also a possible workaround.
>>>
>>>
>>> Alex
>>>
>>> ------------------------------------------------------------------------
>>> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of 
>>> Christian König <christian.koenig at amd.com>
>>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM
>>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
>>> *Subject:* Re: Regression on gfx8 with ring init
>>> Well looks like interrupt processing is working perfectly fine.
>>>
>>> But looking at the error message once more I see that this actually
>>> affects ring number 9 and not the GFX ring.
>>>
>>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
>>> number?
>>>
>>> That must be some of the compute rings.
>>>
>>> Thanks,
>>> Christian.
>>>
>>> Am 18.09.2018 um 16:20 schrieb Tom St Denis:
>>> > On 2018-09-18 10:13 a.m., Christian König wrote:
>>> >> Mhm, there is no more failed IB-test in there isn't it?
>>> >
>>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a log 
>>> from
>>> > the tip of drm-next
>>> >
>>> > Tom
>>> >
>>> >>
>>> >> Christian.
>>> >>
>>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>> >>>
>>> >>> Here's the log.
>>> >>>
>>> >>> Tom
>>> >>>
>>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>> >>>> Odd I couldn't even boot my system with the dGPU as primary after
>>> >>>> rebuilding the kernel.  It got hung up in the IOMMU driver (loads
>>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>>> >>>> panic'ed before loading the network stack.
>>> >>>>
>>> >>>> Bizarre.
>>> >>>>
>>> >>>> I'll keep trying.
>>> >>>>
>>> >>>> Tom
>>> >>>>
>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>> >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>> >>>>>>> Great, not sure if that is a good or a bad news.
>>> >>>>>>>
>>> >>>>>>> Anyway going to revert the change for now. Does anybody
>>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>> >>>>>>> correctly on Raven?
>>> >>>>>>
>>> >>>>>> What does "doesn't work correctly?"  My workstation is a Raven1
>>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>> >>>>>> perfectly stable (through suspend/resumes too I might add).
>>> >>>>>>
>>> >>>>>> Anything I could test with my devel raven?
>>> >>>>>
>>> >>>>> The problem seems to be that on some boards IH handling doesn't
>>> >>>>> work as it should.
>>> >>>>>
>>> >>>>> Can you try to disable the onboard graphics and try again?
>>> >>>>>
>>> >>>>> If that still doesn't work there is a DRM_DEBUG in
>>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Christian.
>>> >>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Tom
>>> >>>>>>
>>> >>>>>>>
>>> >>>>>>> Christian.
>>> >>>>>>>
>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>> >>>>>>>> This commit:
>>> >>>>>>>>
>>> >>>>>>>> [root at raven linux]# git bisect good
>>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad 
>>> commit
>>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>> >>>>>>>> Author: Christian König <christian.koenig at amd.com>
>>> >>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>>> >>>>>>>>
>>> >>>>>>>>     drm/amdgpu: remove fence fallback
>>> >>>>>>>>
>>> >>>>>>>>     DC doesn't seem to have a fallback path either.
>>> >>>>>>>>
>>> >>>>>>>>     So when interrupts doesn't work any more we are pretty 
>>> much
>>> >>>>>>>> busted no
>>> >>>>>>>>     matter what.
>>> >>>>>>>>
>>> >>>>>>>>     Signed-off-by: Christian König <christian.koenig at amd.com>
>>> >>>>>>>>     Reviewed-by: Chunming Zhou <david1.zhou at amd.com>
>>> >>>>>>>>
>>> >>>>>>>> Results in this:
>>> >>>>>>>>
>>> >>>>>>>> [   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>>> >>>>>>>> 0000:07:00.0 on minor 1
>>> >>>>>>>> [   24.335674] modprobe (3895) used greatest stack depth: 
>>> 12600
>>> >>>>>>>> bytes left
>>> >>>>>>>> [   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>>> >>>>>>>> amdgpu: IB test timed out.
>>> >>>>>>>> [   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>> >>>>>>>> [   26.407885] [drm:process_one_work] *ERROR* ib ring test
>>> >>>>>>>> failed (-110).
>>> >>>>>>>> [   28.506708] fuse init (API version 7.27)
>>> >>>>>>>>
>>> >>>>>>>> On init with my polaris/raven1 system.
>>> >>>>>>>>
>>> >>>>>>>> Cheers,
>>> >>>>>>>> Tom
>>> >>>>>>>> _______________________________________________
>>> >>>>>>>> amd-gfx mailing list
>>> >>>>>>>> amd-gfx at lists.freedesktop.org
>>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180920/9499b511/attachment-0001.html>


More information about the amd-gfx mailing list