[PATCH] drm/amdkfd: Fix the shift-out-of-bounds warning

Thu Jan 11 16:03:23 UTC 2024

[+Jon]

On 2024-01-11 01:05, Ma, Jun wrote:

> Hi Felix,
>
> On 1/10/2024 11:57 PM, Felix Kuehling wrote:
>> On 2024-01-10 04:39, Ma Jun wrote:
>>> There is following shift-out-of-bounds warning if ecode=0.
>>> "shift exponent 4294967295 is too large for 64-bit type 'long long unsigned int'"
>>>
>>> Signed-off-by: Ma Jun <Jun.Ma2 at amd.com>
>>> ---
>>>    include/uapi/linux/kfd_ioctl.h | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
>>> index 2aa88afe305b..129325b02a91 100644
>>> --- a/include/uapi/linux/kfd_ioctl.h
>>> +++ b/include/uapi/linux/kfd_ioctl.h
>>> @@ -1004,7 +1004,7 @@ enum kfd_dbg_trap_exception_code {
>>>    };
>>>    
>>>    /* Mask generated by ecode in kfd_dbg_trap_exception_code */
>>> -#define KFD_EC_MASK(ecode)	(1ULL << (ecode - 1))
>>> +#define KFD_EC_MASK(ecode)	(BIT(ecode) - 1)
>> This is not the same thing. We want a bit mask with one bit set. And
>> ecode=1 should set bit 0. ecode=0 is not a valid code and doesn't have a
>> valid mask. You could use BIT((ecode) - 1), but I think that would give
>> you the same warning for ecode=0. I also don't see BIT defined anywhere
>> under include/uapi, so I think using this in the API header would break
>> the build in user mode.
>>
>> Where are you seeing the warning about the bad shift exponent? Looks
>> like someone is using the KFD_EC_MASK macro incorrectly. Or if there is
>> a legitimate use of it with ecode=0, then the correct fix would be
>>
> This warning is caused by following code in function event_interrupt_wq_v10()
> 		
> else if (source_id == SOC15_INTSRC_CP_BAD_OPCODE) {
> 	kfd_set_dbg_ev_from_interrupt(dev, pasid,
> 	KFD_DEBUG_DOORBELL_ID(context_id0),
> 	KFD_EC_MASK(KFD_DEBUG_CP_BAD_OP_ECODE(context_id0)),
> 	NULL,
> 	0);
> }

This looks OK. The compiler must be warning about a potential problem 
here, not a definite one.

Question for Jon, how does the firmware encode the error code in the 
context ID? I see these macros:

#define KFD_DEBUG_CP_BAD_OP_ECODE_MASK          0x3fffc00
#define KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT         10
#define KFD_DEBUG_CP_BAD_OP_ECODE(ctxid0) (((ctxid0) &                  \
                                 KFD_DEBUG_CP_BAD_OP_ECODE_MASK)         \
                                 >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT)

It looks like we have 16 bits for the ECODE. That's enough to have a bit 
mask. Do we really need KFD_EC_MASK to convert an error number into a 
bitmask here?

>
>
>> #define KFD_EC_MASK(ecode)	((ecode) ? 1ULL << (ecode - 1) : 0ULL)
> This can fix the warning.

In the code above, if ecode is 0, that would lead to calling 
kfd_set_dbg_ev_from_interrupt with a event mask of 0. Not sure if that 
even makes sense. Jon, so we need special handling of cases where the 
error code is 0 or out of range, so we can warn about buggy firmware 
rather than creating nonsensical events for the debugger?

Regards,
   Felix

>
> Regards
> Ma Jun
>> Regards,
>>     Felix
>>
>>
>>>    
>>>    /* Masks for exception code type checks below */
>>>    #define KFD_EC_MASK_QUEUE	(KFD_EC_MASK(EC_QUEUE_WAVE_ABORT) |	\