[PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

Kim, Jonathan Jonathan.Kim at amd.com
Thu Aug 10 22:25:28 UTC 2023


[Public]

Yeah this is a recent bug so this workaround is new.  More rigorous tests revealed this is probably a miss on the FW side.  We explicitly requested UNMAP_QUEUES unconditionally invalidate watch controls during the beginning of design to prevent any watch point racing.

Note GFX11 MES calls are different on the surface but under the hood it's the same (registers get invalidated on unmap then get updated on map.  Only difference it's at the queue level).

I'm fine with this solution but I think it'd be good to describe this as a workaround somewhere (as opposed to a driver issue) so that folks aren't scratching their heads later on looking at code for GFX11 and up and wondering why we don't nuke the control setting with the KFD for those devices.

Thanks,

Jon

> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> Sent: Thursday, August 10, 2023 5:56 PM
> To: Huang, JinHuiEric <JinHuiEric.Huang at amd.com>; Kim, Jonathan
> <Jonathan.Kim at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2
>
> I think Jon is suggesting that the UNMAP_QUEUES command should clear the
> address watch registers. Requesting such a change from the the HWS team
> may take a long time.
>
> That said, when was this workaround implemented and reviewed? Did I
> review it as part of Jon's debugger upstreaming patch series? Or did
> this come later? This patch only enables the workaround for v9.4.2.
>
> Regards,
>    Felix
>
>
> On 2023-08-10 17:52, Eric Huang wrote:
> > The problem is the queue is suspended before clearing address watch
> > call in KFD, there is not queue preemption and queue resume after
> > clearing call, and the test ends. So there is not chance to send
> > MAP_PROCESS to HWS. At this point FW has nothing to do. We have
> > several test FWs from Tej, none of them works, so I recalled the
> > kernel debug log and found out the problem.
> >
> > GFX11 has different scheduler, when calling clear address watch, KFD
> > directly sends the MES_MISC_OP_SET_SHADER_DEBUGGER to MES, it
> doesn't
> > consider if the queue is suspended. So GFX11 doesn't have this issue.
> >
> > Regards,
> > Eric
> >
> > On 2023-08-10 17:27, Kim, Jonathan wrote:
> >> [AMD Official Use Only - General]
> >>
> >> This is a strange solution because the MEC should set watch controls
> >> as non-valid automatically on queue preemption to avoid this kind of
> >> issue in the first place by design.  MAP_PROCESS on resume will take
> >> whatever the driver requests.
> >> GFX11 has no issue with letting the HWS do this.
> >>
> >> Are we sure we're not working around some HWS bug?
> >>
> >> Thanks,
> >>
> >> Jon
> >>
> >>> -----Original Message-----
> >>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> >>> Sent: Thursday, August 10, 2023 5:03 PM
> >>> To: Huang, JinHuiEric <JinHuiEric.Huang at amd.com>; amd-
> >>> gfx at lists.freedesktop.org
> >>> Cc: Kim, Jonathan <Jonathan.Kim at amd.com>
> >>> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for
> >>> gfx v9.4.2
> >>>
> >>> I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a bit
> >>> different because it needs to support multiple XCCs.
> >>>
> >>> That said, this patch is
> >>>
> >>> Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
> >>>
> >>>
> >>> On 2023-08-10 16:47, Eric Huang wrote:
> >>>> KFD currently relies on MEC FW to clear tcp watch control
> >>>> register by sending MAP_PROCESS packet with 0 of field
> >>>> tcp_watch_cntl to HWS, but if the queue is suspended, the
> >>>> packet will not be sent and the previous value will be
> >>>> left on the register, that will affect the following apps.
> >>>> So the solution is to clear the register as gfx v9 in KFD.
> >>>>
> >>>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8 +-----
> --
> >>>>    1 file changed, 1 insertion(+), 7 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
> >>>> index e2fed6edbdd0..aff08321e976 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
> >>>> @@ -163,12 +163,6 @@ static uint32_t
> >>> kgd_gfx_aldebaran_set_address_watch(
> >>>>      return watch_address_cntl;
> >>>>    }
> >>>>
> >>>> -static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct
> >>> amdgpu_device *adev,
> >>>> - uint32_t watch_id)
> >>>> -{
> >>>> -   return 0;
> >>>> -}
> >>>> -
> >>>>    const struct kfd2kgd_calls aldebaran_kfd2kgd = {
> >>>>      .program_sh_mem_settings =
> >>> kgd_gfx_v9_program_sh_mem_settings,
> >>>>      .set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
> >>>> @@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd =
> {
> >>>>      .set_wave_launch_trap_override =
> >>> kgd_aldebaran_set_wave_launch_trap_override,
> >>>>      .set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode,
> >>>>      .set_address_watch = kgd_gfx_aldebaran_set_address_watch,
> >>>> -   .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
> >>>> +   .clear_address_watch = kgd_gfx_v9_clear_address_watch,
> >>>>      .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
> >>>>      .build_grace_period_packet_info =
> >>> kgd_gfx_v9_build_grace_period_packet_info,
> >>>>      .program_trap_handler_settings =
> >>> kgd_gfx_v9_program_trap_handler_settings,
> >


More information about the amd-gfx mailing list