[PATCH 3/3] drm/xe: Force wedged state and block GT reset upon any GPU hang

Teres Alexis, Alan Previn alan.previn.teres.alexis at intel.com
Fri Mar 15 06:28:50 UTC 2024


alan:snip

On Thu, 2024-03-14 at 21:06 -0400, Rodrigo Vivi wrote:
> On Wed, Mar 13, 2024 at 11:00:06PM -0500, Lucas De Marchi wrote:
> > On Wed, Mar 13, 2024 at 06:06:14PM -0400, Rodrigo Vivi wrote:
> > > On Wed, Mar 13, 2024 at 04:54:38PM -0500, Lucas De Marchi wrote:
> > > > On Wed, Mar 13, 2024 at 05:44:00PM -0400, Rodrigo Vivi wrote:
> > > > > On Wed, Mar 13, 2024 at 03:49:56PM -0500, Lucas De Marchi
> > > > > wrote:
> > > > > > On Wed, Mar 13, 2024 at 03:54:59PM -0400, Rodrigo Vivi
> > > > > > wrote:
> > > > > > 
> > 
> > I think we can use the modparam on probe and already put it in ads.
> > That dictates the default behavior for the _module_ regardless of
> > the device.
> 
> agreed. I already sent the 3 patches that accomplished that.
alan: personal opinion - we really ought to have runtime controls in
the case we are looking at both integrated + discrete combination that
needs to be debugged (i.e. debugfs).
> 
> > Then we allow either setting the param to change the default
> > behavior
> > like above or we create a debugfs so we can set it per-device after
> > the
> > probe.
> 
> I have the 4th patch in here:
> https://github.com/rodrigovivi/linux/commits/xe-busted
> that is targeting this goal. However I'm still dealing with trying to
> change
> the guc sched policy on the fly.
> 
> I'm not convinced that i915 code around that 0x506 command is the
> right code,
> so I'm still investigating the spec and doing some experiments.
> 
> But I'd like to move forward with this default behavior with module
> parameter so we unblock our sv teams.
> 
> thoughts?
alan: IIRC guc has preemption timing per context and can be changed at
runtime (but may only get updated next time the context is scheduled
into the engine?).

Btw, i havent had time to thoroughly go thru all the patches on above
github but based on this series, i dont see us also preventing the
runtime guc/gt-reset (which is also something that the use-case being
targetted needs to avoid). There are a few functions that seem to be
involved in this runtime guc/gt reset (when guc fails to reset engine)
but we must be careful to also not block gucgt-resets for the post-hw-
config readup after the early stage guc load. Bascially we need to find
the paths for writing to GDRST and block (except for that early boot
and also suspend-resume and shutdown).



More information about the Intel-xe mailing list