[PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

Kim, Jonathan Jonathan.Kim at amd.com
Thu Aug 10 23:04:55 UTC 2023


[Public]

Sounds good.

I'd also change:

>>>>>> KFD currently relies on MEC FW to clear tcp watch control
>>>>>> register by sending MAP_PROCESS packet with 0 of field
>>>>>> tcp_watch_cntl to HWS, but if the queue is suspended, the
>>>>>> packet will not be sent and the previous value will be
>>>>>> left on the register, that will affect the following apps.
>>>>>> So the solution is to clear the register as gfx v9 in KFD.

To something like:

KFD currently relies on MEC FW to clear tcp watch control
register on UNMAP_QUEUES.  Due to a FW bug, MEC does not
do this.
So the solution is to clear the register as gfx v9 in KFD.

With those fixed, this patch is Reviewed-by: Jonathan Kim <jonathan.kim at amd.com>

Hopefully we can get away with this since every watch instance register is supposed to be 1-1 to a process ...
And that there's no race scenarios with trailing exceptions on dynamic watch point address changes ...

Thanks,

Jon

> -----Original Message-----
> From: Huang, JinHuiEric <JinHuiEric.Huang at amd.com>
> Sent: Thursday, August 10, 2023 6:31 PM
> To: Kim, Jonathan <Jonathan.Kim at amd.com>; Kuehling, Felix
> <Felix.Kuehling at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2
>
> I will change title to "drm/amdkfd: workaround address watch clearing
> bug for gfx v9.4.2". is it OK?
>
> Regards,
> Eric
>
> On 2023-08-10 18:25, Kim, Jonathan wrote:
> > [Public]
> >
> > Yeah this is a recent bug so this workaround is new.  More rigorous tests
> revealed this is probably a miss on the FW side.  We explicitly requested
> UNMAP_QUEUES unconditionally invalidate watch controls during the
> beginning of design to prevent any watch point racing.
> >
> > Note GFX11 MES calls are different on the surface but under the hood it's
> the same (registers get invalidated on unmap then get updated on map.
> Only difference it's at the queue level).
> >
> > I'm fine with this solution but I think it'd be good to describe this as a
> workaround somewhere (as opposed to a driver issue) so that folks aren't
> scratching their heads later on looking at code for GFX11 and up and
> wondering why we don't nuke the control setting with the KFD for those
> devices.
> >
> > Thanks,
> >
> > Jon
> >
> >> -----Original Message-----
> >> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> >> Sent: Thursday, August 10, 2023 5:56 PM
> >> To: Huang, JinHuiEric <JinHuiEric.Huang at amd.com>; Kim, Jonathan
> >> <Jonathan.Kim at amd.com>; amd-gfx at lists.freedesktop.org
> >> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx
> v9.4.2
> >>
> >> I think Jon is suggesting that the UNMAP_QUEUES command should clear
> the
> >> address watch registers. Requesting such a change from the the HWS team
> >> may take a long time.
> >>
> >> That said, when was this workaround implemented and reviewed? Did I
> >> review it as part of Jon's debugger upstreaming patch series? Or did
> >> this come later? This patch only enables the workaround for v9.4.2.
> >>
> >> Regards,
> >>     Felix
> >>
> >>
> >> On 2023-08-10 17:52, Eric Huang wrote:
> >>> The problem is the queue is suspended before clearing address watch
> >>> call in KFD, there is not queue preemption and queue resume after
> >>> clearing call, and the test ends. So there is not chance to send
> >>> MAP_PROCESS to HWS. At this point FW has nothing to do. We have
> >>> several test FWs from Tej, none of them works, so I recalled the
> >>> kernel debug log and found out the problem.
> >>>
> >>> GFX11 has different scheduler, when calling clear address watch, KFD
> >>> directly sends the MES_MISC_OP_SET_SHADER_DEBUGGER to MES, it
> >> doesn't
> >>> consider if the queue is suspended. So GFX11 doesn't have this issue.
> >>>
> >>> Regards,
> >>> Eric
> >>>
> >>> On 2023-08-10 17:27, Kim, Jonathan wrote:
> >>>> [AMD Official Use Only - General]
> >>>>
> >>>> This is a strange solution because the MEC should set watch controls
> >>>> as non-valid automatically on queue preemption to avoid this kind of
> >>>> issue in the first place by design.  MAP_PROCESS on resume will take
> >>>> whatever the driver requests.
> >>>> GFX11 has no issue with letting the HWS do this.
> >>>>
> >>>> Are we sure we're not working around some HWS bug?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Jon
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> >>>>> Sent: Thursday, August 10, 2023 5:03 PM
> >>>>> To: Huang, JinHuiEric <JinHuiEric.Huang at amd.com>; amd-
> >>>>> gfx at lists.freedesktop.org
> >>>>> Cc: Kim, Jonathan <Jonathan.Kim at amd.com>
> >>>>> Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for
> >>>>> gfx v9.4.2
> >>>>>
> >>>>> I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a
> bit
> >>>>> different because it needs to support multiple XCCs.
> >>>>>
> >>>>> That said, this patch is
> >>>>>
> >>>>> Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
> >>>>>
> >>>>>
> >>>>> On 2023-08-10 16:47, Eric Huang wrote:
> >>>>>> KFD currently relies on MEC FW to clear tcp watch control
> >>>>>> register by sending MAP_PROCESS packet with 0 of field
> >>>>>> tcp_watch_cntl to HWS, but if the queue is suspended, the
> >>>>>> packet will not be sent and the previous value will be
> >>>>>> left on the register, that will affect the following apps.
> >>>>>> So the solution is to clear the register as gfx v9 in KFD.
> >>>>>>
> >>>>>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8
> +-----
> >> --
> >>>>>>     1 file changed, 1 insertion(+), 7 deletions(-)
> >>>>>>
> >>>>>> diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
> >>>>>> index e2fed6edbdd0..aff08321e976 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
> >>>>>> @@ -163,12 +163,6 @@ static uint32_t
> >>>>> kgd_gfx_aldebaran_set_address_watch(
> >>>>>>       return watch_address_cntl;
> >>>>>>     }
> >>>>>>
> >>>>>> -static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct
> >>>>> amdgpu_device *adev,
> >>>>>> - uint32_t watch_id)
> >>>>>> -{
> >>>>>> -   return 0;
> >>>>>> -}
> >>>>>> -
> >>>>>>     const struct kfd2kgd_calls aldebaran_kfd2kgd = {
> >>>>>>       .program_sh_mem_settings =
> >>>>> kgd_gfx_v9_program_sh_mem_settings,
> >>>>>>       .set_pasid_vmid_mapping =
> kgd_gfx_v9_set_pasid_vmid_mapping,
> >>>>>> @@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd
> =
> >> {
> >>>>>>       .set_wave_launch_trap_override =
> >>>>> kgd_aldebaran_set_wave_launch_trap_override,
> >>>>>>       .set_wave_launch_mode =
> kgd_aldebaran_set_wave_launch_mode,
> >>>>>>       .set_address_watch = kgd_gfx_aldebaran_set_address_watch,
> >>>>>> -   .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
> >>>>>> +   .clear_address_watch = kgd_gfx_v9_clear_address_watch,
> >>>>>>       .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
> >>>>>>       .build_grace_period_packet_info =
> >>>>> kgd_gfx_v9_build_grace_period_packet_info,
> >>>>>>       .program_trap_handler_settings =
> >>>>> kgd_gfx_v9_program_trap_handler_settings,



More information about the amd-gfx mailing list