[PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2
Felix Kuehling
felix.kuehling at amd.com
Fri Aug 11 13:26:51 UTC 2023
Am 2023-08-10 um 18:27 schrieb Eric Huang:
> There is not UNMAP_QUEUES command sending for queue preemption because
> the queue is suspended and test is closed to the end. Function
> unmap_queue_cpsch will do nothing after that.
How do you suspend queues without sending an UNMAP_QUEUES command?
Regards,
Felix
>
> The workaround is new and only for gfx v9.4.2, because debugger tests
> has changed to check if all address watch points are correctly set,
> i.e. test A sets more than one watchpoint and leave, the following
> test B only sets one watchpoint, and test A's setting will cause more
> than one watchpoint event, so test B check out and report error on
> second or third watchpoint not set by itself.
>
> Regards,
> Eric
>
> On 2023-08-10 17:56, Felix Kuehling wrote:
>> I think Jon is suggesting that the UNMAP_QUEUES command should clear
>> the address watch registers. Requesting such a change from the the
>> HWS team may take a long time.
>>
>> That said, when was this workaround implemented and reviewed? Did I
>> review it as part of Jon's debugger upstreaming patch series? Or did
>> this come later? This patch only enables the workaround for v9.4.2.
>>
>> Regards,
>> Felix
>>
>>
>> On 2023-08-10 17:52, Eric Huang wrote:
>>> The problem is the queue is suspended before clearing address watch
>>> call in KFD, there is not queue preemption and queue resume after
>>> clearing call, and the test ends. So there is not chance to send
>>> MAP_PROCESS to HWS. At this point FW has nothing to do. We have
>>> several test FWs from Tej, none of them works, so I recalled the
>>> kernel debug log and found out the problem.
>>>
>>> GFX11 has different scheduler, when calling clear address watch, KFD
>>> directly sends the MES_MISC_OP_SET_SHADER_DEBUGGER to MES, it
>>> doesn't consider if the queue is suspended. So GFX11 doesn't have
>>> this issue.
>>>
>>> Regards,
>>> Eric
>>>
>>> On 2023-08-10 17:27, Kim, Jonathan wrote:
>>>> [AMD Official Use Only - General]
>>>>
>>>> This is a strange solution because the MEC should set watch
>>>> controls as non-valid automatically on queue preemption to avoid
>>>> this kind of issue in the first place by design. MAP_PROCESS on
>>>> resume will take whatever the driver requests.
>>>> GFX11 has no issue with letting the HWS do this.
>>>>
>>>> Are we sure we're not working around some HWS bug?
>>>>
>>>> Thanks,
>>>>
>>>> Jon
>>>>
>>>>> -----Original Message-----
>>>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>>>>> Sent: Thursday, August 10, 2023 5:03 PM
>>>>> To: Huang, JinHuiEric <JinHuiEric.Huang at amd.com>; amd-
>>>>> gfx at lists.freedesktop.org
>>>>> Cc: Kim, Jonathan <Jonathan.Kim at amd.com>
>>>>> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug
>>>>> for gfx v9.4.2
>>>>>
>>>>> I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a bit
>>>>> different because it needs to support multiple XCCs.
>>>>>
>>>>> That said, this patch is
>>>>>
>>>>> Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>>>>
>>>>>
>>>>> On 2023-08-10 16:47, Eric Huang wrote:
>>>>>> KFD currently relies on MEC FW to clear tcp watch control
>>>>>> register by sending MAP_PROCESS packet with 0 of field
>>>>>> tcp_watch_cntl to HWS, but if the queue is suspended, the
>>>>>> packet will not be sent and the previous value will be
>>>>>> left on the register, that will affect the following apps.
>>>>>> So the solution is to clear the register as gfx v9 in KFD.
>>>>>>
>>>>>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8 +-------
>>>>>> 1 file changed, 1 insertion(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>>> index e2fed6edbdd0..aff08321e976 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
>>>>>> @@ -163,12 +163,6 @@ static uint32_t
>>>>> kgd_gfx_aldebaran_set_address_watch(
>>>>>> return watch_address_cntl;
>>>>>> }
>>>>>>
>>>>>> -static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct
>>>>> amdgpu_device *adev,
>>>>>> - uint32_t watch_id)
>>>>>> -{
>>>>>> - return 0;
>>>>>> -}
>>>>>> -
>>>>>> const struct kfd2kgd_calls aldebaran_kfd2kgd = {
>>>>>> .program_sh_mem_settings =
>>>>> kgd_gfx_v9_program_sh_mem_settings,
>>>>>> .set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
>>>>>> @@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd = {
>>>>>> .set_wave_launch_trap_override =
>>>>> kgd_aldebaran_set_wave_launch_trap_override,
>>>>>> .set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode,
>>>>>> .set_address_watch = kgd_gfx_aldebaran_set_address_watch,
>>>>>> - .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
>>>>>> + .clear_address_watch = kgd_gfx_v9_clear_address_watch,
>>>>>> .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
>>>>>> .build_grace_period_packet_info =
>>>>> kgd_gfx_v9_build_grace_period_packet_info,
>>>>>> .program_trap_handler_settings =
>>>>> kgd_gfx_v9_program_trap_handler_settings,
>>>
>
More information about the amd-gfx
mailing list