[PATCH] drm/amdgpu: fix a kcq hang issue for SRIOV

Christian König ckoenig.leichtzumerken at gmail.com
Tue Mar 27 17:18:14 UTC 2018


Am 27.03.2018 um 18:56 schrieb Alex Deucher:
> On Tue, Mar 27, 2018 at 12:30 PM, Christian König
> <christian.koenig at amd.com> wrote:
>> Am 27.03.2018 um 17:52 schrieb Alex Deucher:
>>> [SNIP]
>>>>> 2. add the new callback implementation to gfx9 and gfx8 (I think gfx8
>>>>> will need this as well since we support sr-iov there too)
>>>>
>>>> gfx8 doesn't have the hardware bug which seems to make this necessary,
>>>> not
>>>> does it have the same VMHUB design as gfx9.
>>> Oh, right, in this case it's the req/ack engines which were new for
>>> soc15.  We may want the same fix for sdma4 though.
>>
>> And exactly that is one of the reasons why this workaround doesn't work
>> correctly.
>>
>> The SDMA is not directly connected to the GFXHUB, so even if the SDMA would
>> provide a single command for this the write/wait would still be executed as
>> two operations.
> I'm not sure I follow.  I think there are two issues: the hw bug you
> are referring to and the SR-IOV requirement that the req and the ack
> can't be split by a world switch.  I believe the world switch happens
> at at least packet granularity so I think for the SR-IOV requirement
> using a single packet should handle it.

The problem is to me it looks like there is no SR-IOV requirement to not 
split the req and ack. The hardware is duplicated per VF and I suggested 
to Emily to test my multiple write workaround.

Since I didn't heard back I strongly assume that this worked as well and 
that can only mean that we are indeed running into the same hw issue again.

That in turn means that not only the GFXHUB is affect, but ANY (GRBM?) 
register write could be silently dropped. I can't imagine how we want to 
build a stable driver around this.

I unfortunately can't reliable reproduce the issue on bare metal any 
more. It would probably best if we could setup a call with some of the 
hardware guys to come up with a plan to narrow down this issue further.

Regards,
Christian.

>
>> In other words we can again run into the problem and the same thing applies
>> for CPU based updates.
> yeah, CPU based updates could indeed be an issue for the SR-IOV
> requirement, but in that case it's easier to read back and retry.
>
> Alex
>
>
>> The only real workaround would be to write the request, read the register
>> back and if the write didn't succeeded write it again.
>>
>> But seriously remember that this issue is not limited to the VMHUB
>> registers. Do you want to write and read back every register to make sure
>> the write succeeded?
>>
>> Regards,
>> Christian.
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the amd-gfx mailing list