[PATCH v3 2/2] drm/xe: Don't suspend device upon wedge

Tue Jul 16 21:41:24 UTC 2024

On Tue, Jul 16, 2024 at 03:26:03PM -0600, Cavitt, Jonathan wrote:
> -----Original Message-----
> From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of Matthew Brost
> Sent: Monday, July 15, 2024 11:39 PM
> To: intel-xe at lists.freedesktop.org
> Cc: Vivi, Rodrigo <rodrigo.vivi at intel.com>
> Subject: [PATCH v3 2/2] drm/xe: Don't suspend device upon wedge
> > 
> > When wedging a device we shouldn't be suspending device as state for
> > debug will be lost.
> > 
> > Also this appears to not work as the below stack trace pops upon trying
> > to resume a wedged device:
> > 
> > [  304.245044] INFO: task cat:12115 blocked for more than 151 seconds.
> > [  304.251333]       Tainted: G        W          6.10.0-rc7-xe+ #3518
> > [  304.257617] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  304.265459] task:cat             state:D stack:13384 pid:12115 tgid:12115 ppid:3986   flags:0x00000006
> > [  304.265465] Call Trace:
> > [  304.265467]  <TASK>
> > [  304.265469]  __schedule+0x3c4/0xdf0
> > [  304.265478]  schedule+0x3c/0x140
> > [  304.265481]  rpm_resume+0x1cc/0x740
> > [  304.265484]  ? __pfx_autoremove_wake_function+0x10/0x10
> > [  304.265489]  __pm_runtime_resume+0x49/0x80
> > [  304.265494]  guc_info+0x6b/0xb0 [xe]
> > [  304.265538]  ? __pfx___drm_printfn_seq_file+0x10/0x10
> > [  304.265541]  ? __pfx___drm_puts_seq_file+0x10/0x10
> > [  304.265545]  seq_read_iter+0x111/0x4c0
> > [  304.265551]  seq_read+0xfc/0x140
> > [  304.265556]  full_proxy_read+0x58/0x80
> > [  304.265560]  vfs_read+0xa7/0x360
> > [  304.265563]  ? find_held_lock+0x2b/0x80
> > [  304.265568]  ksys_read+0x64/0xe0
> > [  304.265571]  do_syscall_64+0x68/0x140
> > [  304.265575]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  304.265578] RIP: 0033:0x7f4254d14992
> > [  304.265580] RSP: 002b:00007ffc558666f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> > [  304.265583] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f4254d14992
> > [  304.265584] RDX: 0000000000020000 RSI: 00007f4254ebb000 RDI: 0000000000000003
> > [  304.265586] RBP: 00007f4254ebb000 R08: 00007f4254eba010 R09: 00007f4254eba010
> > [  304.265587] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000022000
> > [  304.265588] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
> > [  304.265593]  </TASK>
> > [  304.265594]
> >                Showing all locks held in the system:
> > [  304.265598] 1 lock held by khungtaskd/57:
> > [  304.265599]  #0: ffffffff8273b860 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x36/0x1c0
> > [  304.265607] 3 locks held by kworker/6:1/90:
> > [  304.265610] 1 lock held by in:imklog/547:
> > [  304.265611]  #0: ffff88810498cd88 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x76/0xc0
> > [  304.265620] 1 lock held by dmesg/1310:
> > 
> > Fixes: 8ed9aaae39f3 ("drm/xe: Force wedged state and block GT reset upon any GPU hang")
> > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_device.c | 16 ++++++++++++++++
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > index 1e3d3a7e74d5..07aedbaf1821 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -893,6 +893,13 @@ u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address)
> >  	return address & GENMASK_ULL(xe->info.va_bits - 1, 0);
> >  }
> >  
> > +static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
> > +{
> > +	struct xe_device *xe = arg;
> > +
> > +	xe_pm_runtime_put(xe);
> > +}
> > +
> >  /**
> >   * xe_device_declare_wedged - Declare device wedged
> >   * @xe: xe device instance
> > @@ -911,12 +918,21 @@ void xe_device_declare_wedged(struct xe_device *xe)
> >  {
> >  	struct xe_gt *gt;
> >  	u8 id;
> > +	int err;
> >  
> >  	if (xe->wedged.mode == 0) {
> >  		drm_dbg(&xe->drm, "Wedged mode is forcibly disabled\n");
> >  		return;
> >  	}
> >  
> > +	err = drmm_add_action_or_reset(&xe->drm, xe_device_wedged_fini, xe);
> > +	if (err) {
> 
> If we aren't reporting the error value, we can probably just
> check against the function itself to reduce the size of the change:
> 
> 	if (drmm_add_action_or_reset(&xe->drm, xe_device_wedged_fini, xe)) {
> 
> Either that, or we should report the error value as a part
> of the drm_err report.
> 
> The current implementation is still good, however, and this
> is just a suggestion.  I won't block on this.
> 

Good suggestion, will change.

Matt

> Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
> -Jonathan Cavitt
> 
> > +		drm_err(&xe->drm, "Failed to register xe_device_wedged_fini clean-up. Although device is wedged.\n");
> > +		return;
> > +	}
> > +
> > +	xe_pm_runtime_get_noresume(xe);
> > +
> >  	if (!atomic_xchg(&xe->wedged.flag, 1)) {
> >  		xe->needs_flr_on_fini = true;
> >  		drm_err(&xe->drm,
> > -- 
> > 2.34.1
> > 
> >