[PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

Felix Kuehling felix.kuehling at amd.com
Thu Aug 10 21:56:14 UTC 2023


I think Jon is suggesting that the UNMAP_QUEUES command should clear the 
address watch registers. Requesting such a change from the the HWS team 
may take a long time.

That said, when was this workaround implemented and reviewed? Did I 
review it as part of Jon's debugger upstreaming patch series? Or did 
this come later? This patch only enables the workaround for v9.4.2.

Regards,
   Felix


On 2023-08-10 17:52, Eric Huang wrote:
> The problem is the queue is suspended before clearing address watch 
> call in KFD, there is not queue preemption and queue resume after 
> clearing call, and the test ends. So there is not chance to send 
> MAP_PROCESS to HWS. At this point FW has nothing to do. We have 
> several test FWs from Tej, none of them works, so I recalled the 
> kernel debug log and found out the problem.
>
> GFX11 has different scheduler, when calling clear address watch, KFD 
> directly sends the MES_MISC_OP_SET_SHADER_DEBUGGER to MES, it doesn't 
> consider if the queue is suspended. So GFX11 doesn't have this issue.
>
> Regards,
> Eric
>
> On 2023-08-10 17:27, Kim, Jonathan wrote:
>> [AMD Official Use Only - General]
>>
>> This is a strange solution because the MEC should set watch controls 
>> as non-valid automatically on queue preemption to avoid this kind of 
>> issue in the first place by design.  MAP_PROCESS on resume will take 
>> whatever the driver requests.
>> GFX11 has no issue with letting the HWS do this.
>>
>> Are we sure we're not working around some HWS bug?
>>
>> Thanks,
>>
>> Jon
>>
>>> -----Original Message-----
>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>>> Sent: Thursday, August 10, 2023 5:03 PM
>>> To: Huang, JinHuiEric <JinHuiEric.Huang at amd.com>; amd-
>>> gfx at lists.freedesktop.org
>>> Cc: Kim, Jonathan <Jonathan.Kim at amd.com>
>>> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for 
>>> gfx v9.4.2
>>>
>>> I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a bit
>>> different because it needs to support multiple XCCs.
>>>
>>> That said, this patch is
>>>
>>> Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>>
>>>
>>> On 2023-08-10 16:47, Eric Huang wrote:
>>>> KFD currently relies on MEC FW to clear tcp watch control
>>>> register by sending MAP_PROCESS packet with 0 of field
>>>> tcp_watch_cntl to HWS, but if the queue is suspended, the
>>>> packet will not be sent and the previous value will be
>>>> left on the register, that will affect the following apps.
>>>> So the solution is to clear the register as gfx v9 in KFD.
>>>>
>>>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8 +-------
>>>>    1 file changed, 1 insertion(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>> index e2fed6edbdd0..aff08321e976 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>> @@ -163,12 +163,6 @@ static uint32_t
>>> kgd_gfx_aldebaran_set_address_watch(
>>>>      return watch_address_cntl;
>>>>    }
>>>>
>>>> -static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct
>>> amdgpu_device *adev,
>>>> - uint32_t watch_id)
>>>> -{
>>>> -   return 0;
>>>> -}
>>>> -
>>>>    const struct kfd2kgd_calls aldebaran_kfd2kgd = {
>>>>      .program_sh_mem_settings =
>>> kgd_gfx_v9_program_sh_mem_settings,
>>>>      .set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
>>>> @@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd = {
>>>>      .set_wave_launch_trap_override =
>>> kgd_aldebaran_set_wave_launch_trap_override,
>>>>      .set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode,
>>>>      .set_address_watch = kgd_gfx_aldebaran_set_address_watch,
>>>> -   .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
>>>> +   .clear_address_watch = kgd_gfx_v9_clear_address_watch,
>>>>      .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
>>>>      .build_grace_period_packet_info =
>>> kgd_gfx_v9_build_grace_period_packet_info,
>>>>      .program_trap_handler_settings =
>>> kgd_gfx_v9_program_trap_handler_settings,
>


More information about the amd-gfx mailing list