[PATCH] drm/amdgpu: fix a kcq hang issue for SRIOV

Liu, Monk Monk.Liu at amd.com
Wed Mar 28 04:22:05 UTC 2018


Thanks you all

Let's keep pushing on hardware team ... hope MC or RLCV side could discover something interesting 

/Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com] 
Sent: 2018年3月28日 1:18
To: Alex Deucher <alexdeucher at gmail.com>; Koenig, Christian <Christian.Koenig at amd.com>
Cc: Deng, Emily <Emily.Deng at amd.com>; amd-gfx list <amd-gfx at lists.freedesktop.org>; Liu, Monk <Monk.Liu at amd.com>
Subject: Re: [PATCH] drm/amdgpu: fix a kcq hang issue for SRIOV

Am 27.03.2018 um 18:56 schrieb Alex Deucher:
> On Tue, Mar 27, 2018 at 12:30 PM, Christian König 
> <christian.koenig at amd.com> wrote:
>> Am 27.03.2018 um 17:52 schrieb Alex Deucher:
>>> [SNIP]
>>>>> 2. add the new callback implementation to gfx9 and gfx8 (I think 
>>>>> gfx8 will need this as well since we support sr-iov there too)
>>>>
>>>> gfx8 doesn't have the hardware bug which seems to make this 
>>>> necessary, not does it have the same VMHUB design as gfx9.
>>> Oh, right, in this case it's the req/ack engines which were new for 
>>> soc15.  We may want the same fix for sdma4 though.
>>
>> And exactly that is one of the reasons why this workaround doesn't 
>> work correctly.
>>
>> The SDMA is not directly connected to the GFXHUB, so even if the SDMA 
>> would provide a single command for this the write/wait would still be 
>> executed as two operations.
> I'm not sure I follow.  I think there are two issues: the hw bug you 
> are referring to and the SR-IOV requirement that the req and the ack 
> can't be split by a world switch.  I believe the world switch happens 
> at at least packet granularity so I think for the SR-IOV requirement 
> using a single packet should handle it.

The problem is to me it looks like there is no SR-IOV requirement to not split the req and ack. The hardware is duplicated per VF and I suggested to Emily to test my multiple write workaround.

Since I didn't heard back I strongly assume that this worked as well and that can only mean that we are indeed running into the same hw issue again.

That in turn means that not only the GFXHUB is affect, but ANY (GRBM?) register write could be silently dropped. I can't imagine how we want to build a stable driver around this.

I unfortunately can't reliable reproduce the issue on bare metal any more. It would probably best if we could setup a call with some of the hardware guys to come up with a plan to narrow down this issue further.

Regards,
Christian.

>
>> In other words we can again run into the problem and the same thing 
>> applies for CPU based updates.
> yeah, CPU based updates could indeed be an issue for the SR-IOV 
> requirement, but in that case it's easier to read back and retry.
>
> Alex
>
>
>> The only real workaround would be to write the request, read the 
>> register back and if the write didn't succeeded write it again.
>>
>> But seriously remember that this issue is not limited to the VMHUB 
>> registers. Do you want to write and read back every register to make 
>> sure the write succeeded?
>>
>> Regards,
>> Christian.
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the amd-gfx mailing list