[PATCH] drm/amdkfd: Fix the shift-out-of-bounds warning

Kim, Jonathan Jonathan.Kim at amd.com
Thu Jan 11 17:25:09 UTC 2024


[Public]

> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> Sent: Thursday, January 11, 2024 11:03 AM
> To: Ma, Jun <Jun.Ma2 at amd.com>; Ma, Jun <Jun.Ma2 at amd.com>; amd-
> gfx at lists.freedesktop.org; Kim, Jonathan <Jonathan.Kim at amd.com>
> Subject: Re: [PATCH] drm/amdkfd: Fix the shift-out-of-bounds warning
>
> [+Jon]
>
> On 2024-01-11 01:05, Ma, Jun wrote:
>
> > Hi Felix,
> >
> > On 1/10/2024 11:57 PM, Felix Kuehling wrote:
> >> On 2024-01-10 04:39, Ma Jun wrote:
> >>> There is following shift-out-of-bounds warning if ecode=0.
> >>> "shift exponent 4294967295 is too large for 64-bit type 'long long
> unsigned int'"
> >>>
> >>> Signed-off-by: Ma Jun <Jun.Ma2 at amd.com>
> >>> ---
> >>>    include/uapi/linux/kfd_ioctl.h | 2 +-
> >>>    1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
> >>> index 2aa88afe305b..129325b02a91 100644
> >>> --- a/include/uapi/linux/kfd_ioctl.h
> >>> +++ b/include/uapi/linux/kfd_ioctl.h
> >>> @@ -1004,7 +1004,7 @@ enum kfd_dbg_trap_exception_code {
> >>>    };
> >>>
> >>>    /* Mask generated by ecode in kfd_dbg_trap_exception_code */
> >>> -#define KFD_EC_MASK(ecode)       (1ULL << (ecode - 1))
> >>> +#define KFD_EC_MASK(ecode)       (BIT(ecode) - 1)
> >> This is not the same thing. We want a bit mask with one bit set. And
> >> ecode=1 should set bit 0. ecode=0 is not a valid code and doesn't have a
> >> valid mask. You could use BIT((ecode) - 1), but I think that would give
> >> you the same warning for ecode=0. I also don't see BIT defined anywhere
> >> under include/uapi, so I think using this in the API header would break
> >> the build in user mode.
> >>
> >> Where are you seeing the warning about the bad shift exponent? Looks
> >> like someone is using the KFD_EC_MASK macro incorrectly. Or if there is
> >> a legitimate use of it with ecode=0, then the correct fix would be
> >>
> > This warning is caused by following code in function
> event_interrupt_wq_v10()
> >
> > else if (source_id == SOC15_INTSRC_CP_BAD_OPCODE) {
> >     kfd_set_dbg_ev_from_interrupt(dev, pasid,
> >     KFD_DEBUG_DOORBELL_ID(context_id0),
> >     KFD_EC_MASK(KFD_DEBUG_CP_BAD_OP_ECODE(context_id0)),
> >     NULL,
> >     0);
> > }
>
> This looks OK. The compiler must be warning about a potential problem
> here, not a definite one.
>
> Question for Jon, how does the firmware encode the error code in the
> context ID? I see these macros:
>
> #define KFD_DEBUG_CP_BAD_OP_ECODE_MASK          0x3fffc00
> #define KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT         10
> #define KFD_DEBUG_CP_BAD_OP_ECODE(ctxid0) (((ctxid0) &                  \
>                                  KFD_DEBUG_CP_BAD_OP_ECODE_MASK)         \
>                                  >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT)
>
> It looks like we have 16 bits for the ECODE. That's enough to have a bit
> mask. Do we really need KFD_EC_MASK to convert an error number into a
> bitmask here?

Added Jay for confirmation.
I could be wrong but IIRC (and I'm quite fuzzy on this ... probably should document this), unlike the wave trap code interrupt mask (bit mask) the CP bad op code is a single error code that directly points to one of the exception code enums that we defined in the user API header.
If that's the case, the KFD_EC_MASK is convenient for the kfd debugger code to mask the payload to send to the debugger or runtime.
If that's been wrong this whole time (i.e. the bad ops code is actually a bitwise mask of ecodes), then I'm not sure how we were able to get away with running the runtime negative tests for as long as we have and we'd need to recheck those tests.

>
>
> >
> >
> >> #define KFD_EC_MASK(ecode) ((ecode) ? 1ULL << (ecode - 1) : 0ULL)
> > This can fix the warning.
>
> In the code above, if ecode is 0, that would lead to calling
> kfd_set_dbg_ev_from_interrupt with a event mask of 0. Not sure if that
> even makes sense. Jon, so we need special handling of cases where the
> error code is 0 or out of range, so we can warn about buggy firmware
> rather than creating nonsensical events for the debugger?

That makes sense.  Again, deferring to Jay if a NULL cp bad op code is expected under any circumstances.
Either way, raising undefined events to the debugger or runtime isn't useful so range checking to filter out non-encoded cp bad op interrupts would be needed.

Thanks,

Jon

>
> Regards,
>    Felix
>
>
> >
> > Regards
> > Ma Jun
> >> Regards,
> >>     Felix
> >>
> >>
> >>>
> >>>    /* Masks for exception code type checks below */
> >>>    #define KFD_EC_MASK_QUEUE
>       (KFD_EC_MASK(EC_QUEUE_WAVE_ABORT) |     \
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 17984 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240111/42409ff1/attachment-0001.bin>


More information about the amd-gfx mailing list