[PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

Felix Kuehling felix.kuehling at amd.com
Fri Aug 11 13:26:51 UTC 2023


Am 2023-08-10 um 18:27 schrieb Eric Huang:
> There is not UNMAP_QUEUES command sending for queue preemption because 
> the queue is suspended and test is closed to the end. Function 
> unmap_queue_cpsch will do nothing after that.

How do you suspend queues without sending an UNMAP_QUEUES command?

Regards,
   Felix


>
> The workaround is new and only for gfx v9.4.2, because debugger tests 
> has changed to check if all address watch points are correctly set, 
> i.e. test A sets more than one watchpoint and leave, the following 
> test B only sets one watchpoint, and test A's setting will cause more 
> than one watchpoint event, so test B check out and report error on 
> second or third watchpoint not set by itself.
>
> Regards,
> Eric
>
> On 2023-08-10 17:56, Felix Kuehling wrote:
>> I think Jon is suggesting that the UNMAP_QUEUES command should clear 
>> the address watch registers. Requesting such a change from the the 
>> HWS team may take a long time.
>>
>> That said, when was this workaround implemented and reviewed? Did I 
>> review it as part of Jon's debugger upstreaming patch series? Or did 
>> this come later? This patch only enables the workaround for v9.4.2.
>>
>> Regards,
>>   Felix
>>
>>
>> On 2023-08-10 17:52, Eric Huang wrote:
>>> The problem is the queue is suspended before clearing address watch 
>>> call in KFD, there is not queue preemption and queue resume after 
>>> clearing call, and the test ends. So there is not chance to send 
>>> MAP_PROCESS to HWS. At this point FW has nothing to do. We have 
>>> several test FWs from Tej, none of them works, so I recalled the 
>>> kernel debug log and found out the problem.
>>>
>>> GFX11 has different scheduler, when calling clear address watch, KFD 
>>> directly sends the MES_MISC_OP_SET_SHADER_DEBUGGER to MES, it 
>>> doesn't consider if the queue is suspended. So GFX11 doesn't have 
>>> this issue.
>>>
>>> Regards,
>>> Eric
>>>
>>> On 2023-08-10 17:27, Kim, Jonathan wrote:
>>>> [AMD Official Use Only - General]
>>>>
>>>> This is a strange solution because the MEC should set watch 
>>>> controls as non-valid automatically on queue preemption to avoid 
>>>> this kind of issue in the first place by design. MAP_PROCESS on 
>>>> resume will take whatever the driver requests.
>>>> GFX11 has no issue with letting the HWS do this.
>>>>
>>>> Are we sure we're not working around some HWS bug?
>>>>
>>>> Thanks,
>>>>
>>>> Jon
>>>>
>>>>> -----Original Message-----
>>>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>>>>> Sent: Thursday, August 10, 2023 5:03 PM
>>>>> To: Huang, JinHuiEric <JinHuiEric.Huang at amd.com>; amd-
>>>>> gfx at lists.freedesktop.org
>>>>> Cc: Kim, Jonathan <Jonathan.Kim at amd.com>
>>>>> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug 
>>>>> for gfx v9.4.2
>>>>>
>>>>> I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a bit
>>>>> different because it needs to support multiple XCCs.
>>>>>
>>>>> That said, this patch is
>>>>>
>>>>> Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>>>>
>>>>>
>>>>> On 2023-08-10 16:47, Eric Huang wrote:
>>>>>> KFD currently relies on MEC FW to clear tcp watch control
>>>>>> register by sending MAP_PROCESS packet with 0 of field
>>>>>> tcp_watch_cntl to HWS, but if the queue is suspended, the
>>>>>> packet will not be sent and the previous value will be
>>>>>> left on the register, that will affect the following apps.
>>>>>> So the solution is to clear the register as gfx v9 in KFD.
>>>>>>
>>>>>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8 +-------
>>>>>>    1 file changed, 1 insertion(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>>> index e2fed6edbdd0..aff08321e976 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>>> @@ -163,12 +163,6 @@ static uint32_t
>>>>> kgd_gfx_aldebaran_set_address_watch(
>>>>>>      return watch_address_cntl;
>>>>>>    }
>>>>>>
>>>>>> -static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct
>>>>> amdgpu_device *adev,
>>>>>> - uint32_t watch_id)
>>>>>> -{
>>>>>> -   return 0;
>>>>>> -}
>>>>>> -
>>>>>>    const struct kfd2kgd_calls aldebaran_kfd2kgd = {
>>>>>>      .program_sh_mem_settings =
>>>>> kgd_gfx_v9_program_sh_mem_settings,
>>>>>>      .set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
>>>>>> @@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd = {
>>>>>>      .set_wave_launch_trap_override =
>>>>> kgd_aldebaran_set_wave_launch_trap_override,
>>>>>>      .set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode,
>>>>>>      .set_address_watch = kgd_gfx_aldebaran_set_address_watch,
>>>>>> -   .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
>>>>>> +   .clear_address_watch = kgd_gfx_v9_clear_address_watch,
>>>>>>      .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
>>>>>>      .build_grace_period_packet_info =
>>>>> kgd_gfx_v9_build_grace_period_packet_info,
>>>>>> .program_trap_handler_settings =
>>>>> kgd_gfx_v9_program_trap_handler_settings,
>>>
>


More information about the amd-gfx mailing list