[PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
Kim, Jonathan
Jonathan.Kim at amd.com
Fri Mar 1 06:35:49 UTC 2024
[Public]
> -----Original Message-----
> From: Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>
> Sent: Friday, March 1, 2024 12:50 AM
> To: Kim, Jonathan <Jonathan.Kim at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix
> <Felix.Kuehling at amd.com>; Zhang, Yifan <Yifan1.Zhang at amd.com>
> Subject: RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
>
> [Public]
>
> Hi Jon,
>
> -----Original Message-----
> From: Kim, Jonathan <Jonathan.Kim at amd.com>
> Sent: Thursday, February 29, 2024 11:58 PM
> To: Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>; amd-
> gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix
> <Felix.Kuehling at amd.com>; Zhang, Yifan <Yifan1.Zhang at amd.com>; Zhang,
> Jesse(Jie) <Jesse.Zhang at amd.com>; Zhang, Jesse(Jie)
> <Jesse.Zhang at amd.com>
> Subject: RE: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
>
> [Public]
>
> I think this was discussed in another thread.
> Exception codes should be range checked prior to applying the mask. Raising
> null events to the debugger or runtime isn't useful.
> I haven't gotten around to fixing this yet. I should have time this week.
> Just to double check, the out of bounds shift is because of a CP interrupt that
> generates a null exception code?
>
> [Zhang, Jesse(Jie)] Thanks for your reminder, I saw that discussion.
> In this interrupt, other fields(such as, source id, client id pasid ) are correct.
> only the value of context_id0 (0xf) is invalid.
> How about do the check ,like this:
> } else if (source_id == SOC15_INTSRC_CP_BAD_OPCODE) {
> + /* filter out the invalidate context_id0 */
> + if (!(context_id0 >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) ||
> + (context_id0 >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) >
> EC_MAX)
> + return;
The range check should probably flag any exception prefixed as EC_QUEUE_PACKET_* as valid defined in kfd_dbg_trap_exception_code:
https://github.com/torvalds/linux/blob/master/include/uapi/linux/kfd_ioctl.h#L857
+ Jay to confirm this is the correct exception range for CP_BAD_OPCODE
If that's the case, then I think we can define a KFD_DBG_EC_TYPE_IS_QUEUE_PACKET macro similar to:
https://github.com/torvalds/linux/blob/master/include/uapi/linux/kfd_ioctl.h#L917
That way, KFD process interrupts v9, v10, v11 can use that check prior to mask conversion and user space may find it useful as well.
Jon
> kfd_set_dbg_ev_from_interrupt(dev, pasid,
> KFD_DEBUG_DOORBELL_ID(context_id0),
>
> KFD_EC_MASK(KFD_DEBUG_CP_BAD_OP_ECODE(context_id0)),
> Thanks
> Jesse
> Jon
>
> > -----Original Message-----
> > From: Jesse Zhang <jesse.zhang at amd.com>
> > Sent: Thursday, February 29, 2024 3:45 AM
> > To: amd-gfx at lists.freedesktop.org
> > Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Kuehling, Felix
> > <Felix.Kuehling at amd.com>; Kim, Jonathan <Jonathan.Kim at amd.com>;
> Zhang,
> > Yifan <Yifan1.Zhang at amd.com>; Zhang, Jesse(Jie)
> <Jesse.Zhang at amd.com>;
> > Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>
> > Subject: [PATCH] drm/amdkfd: fix shift out of bounds about gpu debug
> >
> > the issue is :
> > [ 388.151802] UBSAN: shift-out-of-bounds in
> > drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_int_process_v10.c:346:5
> > [ 388.151807] shift exponent 4294967295 is too large for 64-bit type
> > 'long long unsigned int'
> > [ 388.151812] CPU: 6 PID: 347 Comm: kworker/6:1H Tainted: G E
> > 6.7.0+ #1
> > [ 388.151814] Hardware name: AMD Splinter/Splinter-GNR, BIOS
> > WS54117N_140 01/16/2024
> > [ 388.151816] Workqueue: KFD IH interrupt_wq [amdgpu] [ 388.152084]
> > Call Trace:
> > [ 388.152086] <TASK>
> > [ 388.152089] dump_stack_lvl+0x4c/0x70 [ 388.152096]
> > dump_stack+0x14/0x20 [ 388.152098] ubsan_epilogue+0x9/0x40 [
> > 388.152101] __ubsan_handle_shift_out_of_bounds+0x113/0x170
> > [ 388.152103] ? vprintk+0x40/0x70
> > [ 388.152106] ? swsusp_check+0x131/0x190 [ 388.152110]
> > event_interrupt_wq_v10.cold+0x16/0x1e [amdgpu] [ 388.152411] ?
> > raw_spin_rq_unlock+0x14/0x40 [ 388.152415] ?
> > finish_task_switch+0x85/0x2a0 [ 388.152417] ?
> > kfifo_copy_out+0x5f/0x70 [ 388.152420] interrupt_wq+0xb2/0x120
> > [amdgpu] [ 388.152642] ? interrupt_wq+0xb2/0x120 [amdgpu] [
> > 388.152728] process_scheduled_works+0x9a/0x3a0
> > [ 388.152731] ? __pfx_worker_thread+0x10/0x10 [ 388.152732]
> > worker_thread+0x15f/0x2d0 [ 388.152733] ?
> > __pfx_worker_thread+0x10/0x10 [ 388.152734] kthread+0xfb/0x130 [
> > 388.152735] ? __pfx_kthread+0x10/0x10 [ 388.152736]
> > ret_from_fork+0x3d/0x60 [ 388.152738] ? __pfx_kthread+0x10/0x10 [
> > 388.152739] ret_from_fork_asm+0x1b/0x30 [ 388.152742] </TASK>
> >
> > Signed-off-by: Jesse Zhang <Jesse.Zhang at amd.com>
> > ---
> > include/uapi/linux/kfd_ioctl.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/kfd_ioctl.h
> > b/include/uapi/linux/kfd_ioctl.h index 9ce46edc62a5..3d5867df17e8
> > 100644
> > --- a/include/uapi/linux/kfd_ioctl.h
> > +++ b/include/uapi/linux/kfd_ioctl.h
> > @@ -887,7 +887,7 @@ enum kfd_dbg_trap_exception_code { };
> >
> > /* Mask generated by ecode in kfd_dbg_trap_exception_code */
> > -#define KFD_EC_MASK(ecode) (1ULL << (ecode - 1))
> > +#define KFD_EC_MASK(ecode) (ecode ? (1ULL << (ecode - 1)) : 0ULL)
> >
> > /* Masks for exception code type checks below */ #define
> > KFD_EC_MASK_QUEUE
> > (KFD_EC_MASK(EC_QUEUE_WAVE_ABORT) | \
> > --
> > 2.25.1
>
>
More information about the amd-gfx
mailing list