[PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

Eric Huang jinhuieric.huang at amd.com
Fri Aug 11 14:07:32 UTC 2023


On 2023-08-11 09:26, Felix Kuehling wrote:
> Am 2023-08-10 um 18:27 schrieb Eric Huang:
>> There is not UNMAP_QUEUES command sending for queue preemption 
>> because the queue is suspended and test is closed to the end. 
>> Function unmap_queue_cpsch will do nothing after that.
>
> How do you suspend queues without sending an UNMAP_QUEUES command?
Now I understand what you mean, I was only thinking of UNMAP_QUEUES 
sending after clearing call. So MEC FW should clear the control register 
unconditionally on every UNMAP_QUEUES command. We can request it for gfx 
v9.4.3 to avoid the awkward workaround in KFD.

Thanks,
Eric
>
> Regards,
>   Felix
>
>
>>
>> The workaround is new and only for gfx v9.4.2, because debugger tests 
>> has changed to check if all address watch points are correctly set, 
>> i.e. test A sets more than one watchpoint and leave, the following 
>> test B only sets one watchpoint, and test A's setting will cause more 
>> than one watchpoint event, so test B check out and report error on 
>> second or third watchpoint not set by itself.
>>
>> Regards,
>> Eric
>>
>> On 2023-08-10 17:56, Felix Kuehling wrote:
>>> I think Jon is suggesting that the UNMAP_QUEUES command should clear 
>>> the address watch registers. Requesting such a change from the the 
>>> HWS team may take a long time.
>>>
>>> That said, when was this workaround implemented and reviewed? Did I 
>>> review it as part of Jon's debugger upstreaming patch series? Or did 
>>> this come later? This patch only enables the workaround for v9.4.2.
>>>
>>> Regards,
>>>   Felix
>>>
>>>
>>> On 2023-08-10 17:52, Eric Huang wrote:
>>>> The problem is the queue is suspended before clearing address watch 
>>>> call in KFD, there is not queue preemption and queue resume after 
>>>> clearing call, and the test ends. So there is not chance to send 
>>>> MAP_PROCESS to HWS. At this point FW has nothing to do. We have 
>>>> several test FWs from Tej, none of them works, so I recalled the 
>>>> kernel debug log and found out the problem.
>>>>
>>>> GFX11 has different scheduler, when calling clear address watch, 
>>>> KFD directly sends the MES_MISC_OP_SET_SHADER_DEBUGGER to MES, it 
>>>> doesn't consider if the queue is suspended. So GFX11 doesn't have 
>>>> this issue.
>>>>
>>>> Regards,
>>>> Eric
>>>>
>>>> On 2023-08-10 17:27, Kim, Jonathan wrote:
>>>>> [AMD Official Use Only - General]
>>>>>
>>>>> This is a strange solution because the MEC should set watch 
>>>>> controls as non-valid automatically on queue preemption to avoid 
>>>>> this kind of issue in the first place by design. MAP_PROCESS on 
>>>>> resume will take whatever the driver requests.
>>>>> GFX11 has no issue with letting the HWS do this.
>>>>>
>>>>> Are we sure we're not working around some HWS bug?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jon
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>>>>>> Sent: Thursday, August 10, 2023 5:03 PM
>>>>>> To: Huang, JinHuiEric <JinHuiEric.Huang at amd.com>; amd-
>>>>>> gfx at lists.freedesktop.org
>>>>>> Cc: Kim, Jonathan <Jonathan.Kim at amd.com>
>>>>>> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug 
>>>>>> for gfx v9.4.2
>>>>>>
>>>>>> I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a 
>>>>>> bit
>>>>>> different because it needs to support multiple XCCs.
>>>>>>
>>>>>> That said, this patch is
>>>>>>
>>>>>> Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>>>>>
>>>>>>
>>>>>> On 2023-08-10 16:47, Eric Huang wrote:
>>>>>>> KFD currently relies on MEC FW to clear tcp watch control
>>>>>>> register by sending MAP_PROCESS packet with 0 of field
>>>>>>> tcp_watch_cntl to HWS, but if the queue is suspended, the
>>>>>>> packet will not be sent and the previous value will be
>>>>>>> left on the register, that will affect the following apps.
>>>>>>> So the solution is to clear the register as gfx v9 in KFD.
>>>>>>>
>>>>>>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8 +-------
>>>>>>>    1 file changed, 1 insertion(+), 7 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>>>> index e2fed6edbdd0..aff08321e976 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>>>> @@ -163,12 +163,6 @@ static uint32_t
>>>>>> kgd_gfx_aldebaran_set_address_watch(
>>>>>>>      return watch_address_cntl;
>>>>>>>    }
>>>>>>>
>>>>>>> -static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct
>>>>>> amdgpu_device *adev,
>>>>>>> - uint32_t watch_id)
>>>>>>> -{
>>>>>>> -   return 0;
>>>>>>> -}
>>>>>>> -
>>>>>>>    const struct kfd2kgd_calls aldebaran_kfd2kgd = {
>>>>>>>      .program_sh_mem_settings =
>>>>>> kgd_gfx_v9_program_sh_mem_settings,
>>>>>>>      .set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
>>>>>>> @@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd 
>>>>>>> = {
>>>>>>>      .set_wave_launch_trap_override =
>>>>>> kgd_aldebaran_set_wave_launch_trap_override,
>>>>>>>      .set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode,
>>>>>>>      .set_address_watch = kgd_gfx_aldebaran_set_address_watch,
>>>>>>> -   .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
>>>>>>> +   .clear_address_watch = kgd_gfx_v9_clear_address_watch,
>>>>>>>      .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
>>>>>>>      .build_grace_period_packet_info =
>>>>>> kgd_gfx_v9_build_grace_period_packet_info,
>>>>>>> .program_trap_handler_settings =
>>>>>> kgd_gfx_v9_program_trap_handler_settings,
>>>>
>>



More information about the amd-gfx mailing list