[PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset domain aware

Mon Apr 8 05:23:31 UTC 2024

> -----Original Message-----
> From: Harrison, John C <john.c.harrison at intel.com>
> Sent: Saturday, April 6, 2024 5:13 AM
> To: Brost, Matthew <matthew.brost at intel.com>
> Cc: Upadhyay, Tejas <tejas.upadhyay at intel.com>; De Marchi, Lucas
> <lucas.demarchi at intel.com>; Roper, Matthew D
> <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
> xe at lists.freedesktop.org
> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset
> domain aware
> 
> On 4/5/2024 16:33, Matthew Brost wrote:
> > On Fri, Apr 05, 2024 at 11:15:14AM -0700, John Harrison wrote:
> >> On 4/4/2024 21:47, Upadhyay, Tejas wrote:
> >>>> -----Original Message-----
> >>>> From: Harrison, John C<john.c.harrison at intel.com>
> >>>> Sent: Friday, April 5, 2024 4:53 AM
> >>>> To: Upadhyay, Tejas<tejas.upadhyay at intel.com>; De Marchi, Lucas
> >>>> <lucas.demarchi at intel.com>; Roper, Matthew D
> >>>> <matthew.d.roper at intel.com>
> >>>> Cc:igt-dev at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
> >>>> Brost, Matthew<matthew.brost at intel.com>
> >>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang
> >>>> tests reset domain aware
> >>>>
> >>>> On 4/2/2024 22:35, Upadhyay, Tejas wrote:
> >>>>>> -----Original Message-----
> >>>>>> From: De Marchi, Lucas<lucas.demarchi at intel.com>
> >>>>>> Sent: Wednesday, April 3, 2024 2:26 AM
> >>>>>> To: Roper, Matthew D<matthew.d.roper at intel.com>
> >>>>>> Cc: Upadhyay, Tejas<tejas.upadhyay at intel.com>; igt-
> >>>>>> dev at lists.freedesktop.org;intel-xe at lists.freedesktop.org; Brost,
> >>>>>> Matthew<matthew.brost at intel.com>
> >>>>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang
> >>>>>> tests reset domain aware
> >>>>>>
> >>>>>> On Tue, Apr 02, 2024 at 12:40:17PM -0700, Matt Roper wrote:
> >>>>>>> On Tue, Apr 02, 2024 at 05:52:23PM +0530, Tejas Upadhyay wrote:
> >>>>>>>> RCS/CCS are dependent engines as they are sharing reset domain.
> >>>>>>>> Whenever there is reset from CCS, all the exec queues running
> >>>>>>>> on RCS are victimised mainly on Lunarlake.
> >>>>>>>>
> >>>>>>>> Lets skip parallel execution on CCS with RCS.
> >>>>>>> I haven't really looked at this specific test in detail, but
> >>>>>>> based on your explanation here, you're also going to run into
> >>>>>>> problems with multiple CCS engines since they all share the same
> >>>>>>> reset.  You won't see that on platforms like LNL that only have
> >>>>>>> a single CCS, but platforms
> >>>>>> but it is seen on LNL because of having both RCS and CCS.
> >>>>>>
> >>>>>>> like PVC, ATS-M, DG2, etc. can all have multiple CCS where a
> >>>>>>> reset on one kills anything running on the others.
> >>>>>>>
> >>>>>>>
> >>>>>>> Matt
> >>>>>>>
> >>>>>>>> It helps in fixing following errors:
> >>>>>>>> 1. Test assertion failure function test_legacy_mode, file,
> >>>>>>>> Failed
> >>>>>>>> assertion: data[i].data == 0xc0ffee
> >>>>>>>>
> >>>>>>>> 2.Test assertion failure function xe_exec, file
> >>>>>>>> ../lib/xe/xe_ioctl.c, Failed assertion: __xe_exec(fd, exec) ==
> >>>>>>>> 0,
> >>>>>>>> error: -125 != 0
> >>>>>>>>
> >>>>>>>> Signed-off-by: Tejas Upadhyay<tejas.upadhyay at intel.com>
> >>>>>>>> ---
> >>>>>>>>     tests/intel/xe_exec_threads.c | 26 +++++++++++++++++++++++++-
> >>>>>>>>     1 file changed, 25 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/tests/intel/xe_exec_threads.c
> >>>>>>>> b/tests/intel/xe_exec_threads.c index 8083980f9..31af61dc9
> >>>>>>>> 100644
> >>>>>>>> --- a/tests/intel/xe_exec_threads.c
> >>>>>>>> +++ b/tests/intel/xe_exec_threads.c
> >>>>>>>> @@ -710,6 +710,17 @@ static void *thread(void *data)
> >>>>>>>>     	return NULL;
> >>>>>>>>     }
> >>>>>>>>
> >>>>>>>> +static bool is_engine_contexts_victimized(int fd, unsigned int
> >>>>>>>> +flags) {
> >>>>>>>> +	if (!IS_LUNARLAKE(intel_get_drm_devid(fd)))
> >>>>>>>> +		return false;
> >>>>>> as above, I don't think we should add any platform check here.
> >>>>>> It's impossible to keep it up to date and it's also testing the wrong
> thing.
> >>>>>> AFAIU you don't want parallel submission on engines that share
> >>>>>> the same reset domain. So, this is actually what should be tested.
> >>>>> Platforms like  PVC, ATS-M, DG2, etc. have some kind of WA/noWA
> >>>>> which
> >>>> helps to run things parallelly on engines in same reset domain and
> >>>> apparently BMG/LNL does not have that kind of support so applicable
> >>>> for LNL/BMG with parallel submission on RCS/CCS only.
> >>>>> @Harrison, John C please reply if you have any other input here.
> >>>> I don't get what you mean by 'have some kind of WA/noWA'. All
> >>>> platforms with compute engines have shared reset domains. That is all
> there is to it. I.e.
> >>>> everything from TGL onwards. That includes RCS and all CCS engines.
> >>>> So RCS + CCS, CCS0 + CCS1, RCS + CC0 + CCS1, etc. Any platform with
> >>>> multiple engines that talk to EUs will reset all of those engines in
> parallel.
> >>>>
> >>>> There are w/a's which make the situation even worse. E.g. on
> >>>> DG2/MTL you are not allowed to context switch one of those engines
> while another is busy.
> >>>> Which means that if one hangs, they all hang - you cannot just wait
> >>>> for other workloads to complete and/or pre-empt them off the engine
> >>>> prior to doing the shared reset. But there is nothing that makes it better.
> >>>>
> >>>> I assume we are talking about GuC triggered engine resets here? As
> >>>> opposed to driver triggered full GT resets?
> >>>>
> >>>> The GuC will attempt to idle all other connected engines first by
> >>>> pre-empting out any executing contexts. If those contexts are
> >>>> pre-emptible then they will survive - GuC will automatically
> >>>> restart them once the reset is complete. If they are not (or at
> >>>> least not pre-emptible within the pre-emption timeout
> >>>> limit) then they will be killed as collateral damage.
> >>>>
> >>>> What are the workloads being submitted by this test? Are the
> >>>> pre-emptible spinners? If so, then they should survive (assuming
> >>>> you don't have the DG2/MTL RCS/CCS w/a in effect). If they are
> >>>> non-preemptible spinners then they are toast.
> >>> Main question here was, if this fix should be applied to all platforms who
> has RCS and CCS both or just LNL/BMG. Reason to ask is, only LNL/BMG are
> hitting this issue, with same tests PVC and other platforms are not hitting
> issue which we are addressing here.
> >> And the answer is that yes, shared reset domains are common to all
> >> platforms with compute engines. So if only LNL/BMG are failing then
> >> the problem is not understood. Which is not helped by this test code
> >> being extremely complex and having almost zero explanation in it at all :(.
> >>
> > Let me explain what this test is doing...
> >
> > - It creates a thread per hardware engine instance
> > - Within each thread it creates many exec queues targeting 1 hardware
> >    instance
> > - It submits a bunch of batches which do dword write the exec queues
> > - If the HANG flag is set, 1 of the exec queue per thread will insert a
> >    non-preemptable spinner. It is expected the GuC will reset this exec
> >    queue only. Cross CCS / RCS resets will break this as one of the
> >    'good' exec queues from another thread could also be reset.
> If the 'good' workloads are pre-emptible then they should not be reset.
> The GuC will attempt to pre-empt all shared domain engines prior to
> triggering any resets. If they are being killed then something is broken and
> needs to be fixed.
> 
> > - I think the HANG sections can fail on PVC too (VLK-57725)
> > - This is racey, as if the resets occur when all the 'bad' exec queues
> >    are running the test will still work with cross CCS / RCS resets
> Can this description be added to the test?
> 
> > As the author of this test, I am fine with compute class just being
> > skipped if HANF flag set. It is not testing individual engine instance
> > resets (we had tests for that but might be temporally removed) rather
> > falls into the class of tests which trying to do a bunch of things in
> > parallel to stress the KMD.
> Note that some platforms don't have RCS or media engines. Which means
> only running on BCS engines. Is that sufficient coverage?

If you look at this patch, we skip compute only if rcs is present, otherwise not. So far I don’t see failure when 2 compute instances happen to enter this race.

Thanks,
Tejas

> 
> And if this is not meant to be a reset test, why does it test resets at all? If the
> concern is that we don't have a stress test involving resets and this is the only
> coverage then it seems like we should not be crippling it.
> 
> John.
> 
> 
> >
> > Matt
> >
> >> As noted, PVC has multiple compute engines but no RCS engine. If any
> >> compute engine is reset then all are reset. So if the test is running
> >> correctly and passing on PVC then it cannot be failing on LNL/BMG
> >> purely due to shared domain resets.
> >>
> >> Is the reset not happening on PVC? Is the test not actually running
> >> multiple contexts in parallel on PVC? Or are the spinners
> >> pre-emptible and are therefore supposed to survive the reset of a
> >> shared domain engine by being swapped out first? In which case
> >> LNL/BMG are broken because the killed contexts are not supposed to be
> killed even though the engine is reset?
> >>
> >> John.
> >>
> >>> Thanks,
> >>> Tejas
> >>>> John.
> >>>>
> >>>>
> >>>>> Thanks,
> >>>>> Tejas
> >>>>>> Lucas De Marchi
> >>>>>>
> >>>>>>>> +
> >>>>>>>> +	if (flags & HANG)
> >>>>>>>> +		return true;
> >>>>>>>> +
> >>>>>>>> +	return false;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>     /**
> >>>>>>>>      * SUBTEST: threads-%s
> >>>>>>>>      * Description: Run threads %arg[1] test with multi threads
> >>>>>>>> @@
> >>>>>>>> -955,9 +966,13 @@ static void threads(int fd, int flags)
> >>>>>>>>     	bool go = false;
> >>>>>>>>     	int n_threads = 0;
> >>>>>>>>     	int gt;
> >>>>>>>> +	bool has_rcs = false;
> >>>>>>>>
> >>>>>>>> -	xe_for_each_engine(fd, hwe)
> >>>>>>>> +	xe_for_each_engine(fd, hwe) {
> >>>>>>>> +		if (hwe->engine_class ==
> DRM_XE_ENGINE_CLASS_RENDER)
> >>>>>>>> +			has_rcs = true;
> >>>>>>>>     		++n_engines;
> >>>>>>>> +	}
> >>>>>>>>
> >>>>>>>>     	if (flags & BALANCER) {
> >>>>>>>>     		xe_for_each_gt(fd, gt)
> >>>>>>>> @@ -990,6 +1005,15 @@ static void threads(int fd, int flags)
> >>>>>>>>     	}
> >>>>>>>>
> >>>>>>>>     	xe_for_each_engine(fd, hwe) {
> >>>>>>>> +		/* RCS/CCS sharing reset domain hence dependent
> engines.
> >>>>>>>> +		 * When CCS is doing reset, all the contexts of RCS are
> >>>>>>>> +		 * victimized, so skip the compute engine avoiding
> >>>>>>>> +		 * parallel execution with RCS
> >>>>>>>> +		 */
> >>>>>>>> +		if (has_rcs && hwe->engine_class ==
> >>>>>> DRM_XE_ENGINE_CLASS_COMPUTE &&
> >>>>>>>> +		    is_engine_contexts_victimized(fd, flags))
> >>>>>>>> +			continue;
> >>>>>>>> +
> >>>>>>>>     		threads_data[i].mutex = &mutex;
> >>>>>>>>     		threads_data[i].cond = &cond;
> >>>>>>>>     #define ADDRESS_SHIFT	39
> >>>>>>>> --
> >>>>>>>> 2.25.1
> >>>>>>>>
> >>>>>>> --
> >>>>>>> Matt Roper
> >>>>>>> Graphics Software Engineer
> >>>>>>> Linux GPU Platform Enablement
> >>>>>>> Intel Corporation