[PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2
Felix Kuehling
felix.kuehling at amd.com
Thu Aug 10 21:56:14 UTC 2023
I think Jon is suggesting that the UNMAP_QUEUES command should clear the
address watch registers. Requesting such a change from the the HWS team
may take a long time.
That said, when was this workaround implemented and reviewed? Did I
review it as part of Jon's debugger upstreaming patch series? Or did
this come later? This patch only enables the workaround for v9.4.2.
Regards,
Felix
On 2023-08-10 17:52, Eric Huang wrote:
> The problem is the queue is suspended before clearing address watch
> call in KFD, there is not queue preemption and queue resume after
> clearing call, and the test ends. So there is not chance to send
> MAP_PROCESS to HWS. At this point FW has nothing to do. We have
> several test FWs from Tej, none of them works, so I recalled the
> kernel debug log and found out the problem.
>
> GFX11 has different scheduler, when calling clear address watch, KFD
> directly sends the MES_MISC_OP_SET_SHADER_DEBUGGER to MES, it doesn't
> consider if the queue is suspended. So GFX11 doesn't have this issue.
>
> Regards,
> Eric
>
> On 2023-08-10 17:27, Kim, Jonathan wrote:
>> [AMD Official Use Only - General]
>>
>> This is a strange solution because the MEC should set watch controls
>> as non-valid automatically on queue preemption to avoid this kind of
>> issue in the first place by design. MAP_PROCESS on resume will take
>> whatever the driver requests.
>> GFX11 has no issue with letting the HWS do this.
>>
>> Are we sure we're not working around some HWS bug?
>>
>> Thanks,
>>
>> Jon
>>
>>> -----Original Message-----
>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>>> Sent: Thursday, August 10, 2023 5:03 PM
>>> To: Huang, JinHuiEric <JinHuiEric.Huang at amd.com>; amd-
>>> gfx at lists.freedesktop.org
>>> Cc: Kim, Jonathan <Jonathan.Kim at amd.com>
>>> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for
>>> gfx v9.4.2
>>>
>>> I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a bit
>>> different because it needs to support multiple XCCs.
>>>
>>> That said, this patch is
>>>
>>> Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>>
>>>
>>> On 2023-08-10 16:47, Eric Huang wrote:
>>>> KFD currently relies on MEC FW to clear tcp watch control
>>>> register by sending MAP_PROCESS packet with 0 of field
>>>> tcp_watch_cntl to HWS, but if the queue is suspended, the
>>>> packet will not be sent and the previous value will be
>>>> left on the register, that will affect the following apps.
>>>> So the solution is to clear the register as gfx v9 in KFD.
>>>>
>>>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8 +-------
>>>> 1 file changed, 1 insertion(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>> index e2fed6edbdd0..aff08321e976 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>> @@ -163,12 +163,6 @@ static uint32_t
>>> kgd_gfx_aldebaran_set_address_watch(
>>>> return watch_address_cntl;
>>>> }
>>>>
>>>> -static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct
>>> amdgpu_device *adev,
>>>> - uint32_t watch_id)
>>>> -{
>>>> - return 0;
>>>> -}
>>>> -
>>>> const struct kfd2kgd_calls aldebaran_kfd2kgd = {
>>>> .program_sh_mem_settings =
>>> kgd_gfx_v9_program_sh_mem_settings,
>>>> .set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
>>>> @@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd = {
>>>> .set_wave_launch_trap_override =
>>> kgd_aldebaran_set_wave_launch_trap_override,
>>>> .set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode,
>>>> .set_address_watch = kgd_gfx_aldebaran_set_address_watch,
>>>> - .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
>>>> + .clear_address_watch = kgd_gfx_v9_clear_address_watch,
>>>> .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
>>>> .build_grace_period_packet_info =
>>> kgd_gfx_v9_build_grace_period_packet_info,
>>>> .program_trap_handler_settings =
>>> kgd_gfx_v9_program_trap_handler_settings,
>
More information about the amd-gfx
mailing list