[PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset domain aware

Mon Apr 8 12:00:05 UTC 2024

> -----Original Message-----
> From: Upadhyay, Tejas
> Sent: Monday, April 8, 2024 10:54 AM
> To: Harrison, John C <john.c.harrison at intel.com>; Brost, Matthew
> <matthew.brost at intel.com>
> Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Roper, Matthew D
> <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
> xe at lists.freedesktop.org
> Subject: RE: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset
> domain aware
> 
> 
> 
> > -----Original Message-----
> > From: Harrison, John C <john.c.harrison at intel.com>
> > Sent: Saturday, April 6, 2024 5:13 AM
> > To: Brost, Matthew <matthew.brost at intel.com>
> > Cc: Upadhyay, Tejas <tejas.upadhyay at intel.com>; De Marchi, Lucas
> > <lucas.demarchi at intel.com>; Roper, Matthew D
> > <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
> > xe at lists.freedesktop.org
> > Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests
> > reset domain aware
> >
> > On 4/5/2024 16:33, Matthew Brost wrote:
> > > On Fri, Apr 05, 2024 at 11:15:14AM -0700, John Harrison wrote:
> > >> On 4/4/2024 21:47, Upadhyay, Tejas wrote:
> > >>>> -----Original Message-----
> > >>>> From: Harrison, John C<john.c.harrison at intel.com>
> > >>>> Sent: Friday, April 5, 2024 4:53 AM
> > >>>> To: Upadhyay, Tejas<tejas.upadhyay at intel.com>; De Marchi, Lucas
> > >>>> <lucas.demarchi at intel.com>; Roper, Matthew D
> > >>>> <matthew.d.roper at intel.com>
> > >>>> Cc:igt-dev at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
> > >>>> Brost, Matthew<matthew.brost at intel.com>
> > >>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang
> > >>>> tests reset domain aware
> > >>>>
> > >>>> On 4/2/2024 22:35, Upadhyay, Tejas wrote:
> > >>>>>> -----Original Message-----
> > >>>>>> From: De Marchi, Lucas<lucas.demarchi at intel.com>
> > >>>>>> Sent: Wednesday, April 3, 2024 2:26 AM
> > >>>>>> To: Roper, Matthew D<matthew.d.roper at intel.com>
> > >>>>>> Cc: Upadhyay, Tejas<tejas.upadhyay at intel.com>; igt-
> > >>>>>> dev at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
> > >>>>>> Brost, Matthew<matthew.brost at intel.com>
> > >>>>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang
> > >>>>>> tests reset domain aware
> > >>>>>>
> > >>>>>> On Tue, Apr 02, 2024 at 12:40:17PM -0700, Matt Roper wrote:
> > >>>>>>> On Tue, Apr 02, 2024 at 05:52:23PM +0530, Tejas Upadhyay wrote:
> > >>>>>>>> RCS/CCS are dependent engines as they are sharing reset domain.
> > >>>>>>>> Whenever there is reset from CCS, all the exec queues running
> > >>>>>>>> on RCS are victimised mainly on Lunarlake.
> > >>>>>>>>
> > >>>>>>>> Lets skip parallel execution on CCS with RCS.
> > >>>>>>> I haven't really looked at this specific test in detail, but
> > >>>>>>> based on your explanation here, you're also going to run into
> > >>>>>>> problems with multiple CCS engines since they all share the
> > >>>>>>> same reset.  You won't see that on platforms like LNL that
> > >>>>>>> only have a single CCS, but platforms
> > >>>>>> but it is seen on LNL because of having both RCS and CCS.
> > >>>>>>
> > >>>>>>> like PVC, ATS-M, DG2, etc. can all have multiple CCS where a
> > >>>>>>> reset on one kills anything running on the others.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Matt
> > >>>>>>>
> > >>>>>>>> It helps in fixing following errors:
> > >>>>>>>> 1. Test assertion failure function test_legacy_mode, file,
> > >>>>>>>> Failed
> > >>>>>>>> assertion: data[i].data == 0xc0ffee
> > >>>>>>>>
> > >>>>>>>> 2.Test assertion failure function xe_exec, file
> > >>>>>>>> ../lib/xe/xe_ioctl.c, Failed assertion: __xe_exec(fd, exec)
> > >>>>>>>> == 0,
> > >>>>>>>> error: -125 != 0
> > >>>>>>>>
> > >>>>>>>> Signed-off-by: Tejas Upadhyay<tejas.upadhyay at intel.com>
> > >>>>>>>> ---
> > >>>>>>>>     tests/intel/xe_exec_threads.c | 26
> +++++++++++++++++++++++++-
> > >>>>>>>>     1 file changed, 25 insertions(+), 1 deletion(-)
> > >>>>>>>>
> > >>>>>>>> diff --git a/tests/intel/xe_exec_threads.c
> > >>>>>>>> b/tests/intel/xe_exec_threads.c index 8083980f9..31af61dc9
> > >>>>>>>> 100644
> > >>>>>>>> --- a/tests/intel/xe_exec_threads.c
> > >>>>>>>> +++ b/tests/intel/xe_exec_threads.c
> > >>>>>>>> @@ -710,6 +710,17 @@ static void *thread(void *data)
> > >>>>>>>>     	return NULL;
> > >>>>>>>>     }
> > >>>>>>>>
> > >>>>>>>> +static bool is_engine_contexts_victimized(int fd, unsigned
> > >>>>>>>> +int
> > >>>>>>>> +flags) {
> > >>>>>>>> +	if (!IS_LUNARLAKE(intel_get_drm_devid(fd)))
> > >>>>>>>> +		return false;
> > >>>>>> as above, I don't think we should add any platform check here.
> > >>>>>> It's impossible to keep it up to date and it's also testing the
> > >>>>>> wrong
> > thing.
> > >>>>>> AFAIU you don't want parallel submission on engines that share
> > >>>>>> the same reset domain. So, this is actually what should be tested.
> > >>>>> Platforms like  PVC, ATS-M, DG2, etc. have some kind of WA/noWA
> > >>>>> which
> > >>>> helps to run things parallelly on engines in same reset domain
> > >>>> and apparently BMG/LNL does not have that kind of support so
> > >>>> applicable for LNL/BMG with parallel submission on RCS/CCS only.
> > >>>>> @Harrison, John C please reply if you have any other input here.
> > >>>> I don't get what you mean by 'have some kind of WA/noWA'. All
> > >>>> platforms with compute engines have shared reset domains. That is
> > >>>> all
> > there is to it. I.e.
> > >>>> everything from TGL onwards. That includes RCS and all CCS engines.
> > >>>> So RCS + CCS, CCS0 + CCS1, RCS + CC0 + CCS1, etc. Any platform
> > >>>> with multiple engines that talk to EUs will reset all of those
> > >>>> engines in
> > parallel.
> > >>>>
> > >>>> There are w/a's which make the situation even worse. E.g. on
> > >>>> DG2/MTL you are not allowed to context switch one of those
> > >>>> engines
> > while another is busy.
> > >>>> Which means that if one hangs, they all hang - you cannot just
> > >>>> wait for other workloads to complete and/or pre-empt them off the
> > >>>> engine prior to doing the shared reset. But there is nothing that makes
> it better.
> > >>>>
> > >>>> I assume we are talking about GuC triggered engine resets here?
> > >>>> As opposed to driver triggered full GT resets?
> > >>>>
> > >>>> The GuC will attempt to idle all other connected engines first by
> > >>>> pre-empting out any executing contexts. If those contexts are
> > >>>> pre-emptible then they will survive - GuC will automatically
> > >>>> restart them once the reset is complete. If they are not (or at
> > >>>> least not pre-emptible within the pre-emption timeout
> > >>>> limit) then they will be killed as collateral damage.
> > >>>>
> > >>>> What are the workloads being submitted by this test? Are the
> > >>>> pre-emptible spinners? If so, then they should survive (assuming
> > >>>> you don't have the DG2/MTL RCS/CCS w/a in effect). If they are
> > >>>> non-preemptible spinners then they are toast.
> > >>> Main question here was, if this fix should be applied to all
> > >>> platforms who
> > has RCS and CCS both or just LNL/BMG. Reason to ask is, only LNL/BMG
> > are hitting this issue, with same tests PVC and other platforms are
> > not hitting issue which we are addressing here.
> > >> And the answer is that yes, shared reset domains are common to all
> > >> platforms with compute engines. So if only LNL/BMG are failing then
> > >> the problem is not understood. Which is not helped by this test
> > >> code being extremely complex and having almost zero explanation in it at
> all :(.
> > >>
> > > Let me explain what this test is doing...
> > >
> > > - It creates a thread per hardware engine instance
> > > - Within each thread it creates many exec queues targeting 1 hardware
> > >    instance
> > > - It submits a bunch of batches which do dword write the exec queues
> > > - If the HANG flag is set, 1 of the exec queue per thread will insert a
> > >    non-preemptable spinner. It is expected the GuC will reset this exec
> > >    queue only. Cross CCS / RCS resets will break this as one of the
> > >    'good' exec queues from another thread could also be reset.
> > If the 'good' workloads are pre-emptible then they should not be reset.
> > The GuC will attempt to pre-empt all shared domain engines prior to
> > triggering any resets. If they are being killed then something is
> > broken and needs to be fixed.
> >
> > > - I think the HANG sections can fail on PVC too (VLK-57725)
> > > - This is racey, as if the resets occur when all the 'bad' exec queues
> > >    are running the test will still work with cross CCS / RCS resets
> > Can this description be added to the test?
> >
> > > As the author of this test, I am fine with compute class just being
> > > skipped if HANF flag set. It is not testing individual engine
> > > instance resets (we had tests for that but might be temporally
> > > removed) rather falls into the class of tests which trying to do a
> > > bunch of things in parallel to stress the KMD.
> > Note that some platforms don't have RCS or media engines. Which means
> > only running on BCS engines. Is that sufficient coverage?
> 
> If you look at this patch, we skip compute only if rcs is present, otherwise not.
> So far I don’t see failure when 2 compute instances happen to enter this race.
> 

I can modify test to skip in case 2 ccs and no rcs with HANG tests. Even though issue not seen anywhere else if change needs to be generic. 

> Thanks,
> Tejas
> 
> >
> > And if this is not meant to be a reset test, why does it test resets
> > at all? If the concern is that we don't have a stress test involving
> > resets and this is the only coverage then it seems like we should not be
> crippling it.
> >
> > John.
> >
> >
> > >
> > > Matt
> > >
> > >> As noted, PVC has multiple compute engines but no RCS engine. If
> > >> any compute engine is reset then all are reset. So if the test is
> > >> running correctly and passing on PVC then it cannot be failing on
> > >> LNL/BMG purely due to shared domain resets.
> > >>
> > >> Is the reset not happening on PVC? Is the test not actually running
> > >> multiple contexts in parallel on PVC? Or are the spinners
> > >> pre-emptible and are therefore supposed to survive the reset of a
> > >> shared domain engine by being swapped out first? In which case
> > >> LNL/BMG are broken because the killed contexts are not supposed to
> > >> be
> > killed even though the engine is reset?
> > >>
> > >> John.
> > >>
> > >>> Thanks,
> > >>> Tejas
> > >>>> John.
> > >>>>
> > >>>>
> > >>>>> Thanks,
> > >>>>> Tejas
> > >>>>>> Lucas De Marchi
> > >>>>>>
> > >>>>>>>> +
> > >>>>>>>> +	if (flags & HANG)
> > >>>>>>>> +		return true;
> > >>>>>>>> +
> > >>>>>>>> +	return false;
> > >>>>>>>> +}
> > >>>>>>>> +
> > >>>>>>>>     /**
> > >>>>>>>>      * SUBTEST: threads-%s
> > >>>>>>>>      * Description: Run threads %arg[1] test with multi
> > >>>>>>>> threads @@
> > >>>>>>>> -955,9 +966,13 @@ static void threads(int fd, int flags)
> > >>>>>>>>     	bool go = false;
> > >>>>>>>>     	int n_threads = 0;
> > >>>>>>>>     	int gt;
> > >>>>>>>> +	bool has_rcs = false;
> > >>>>>>>>
> > >>>>>>>> -	xe_for_each_engine(fd, hwe)
> > >>>>>>>> +	xe_for_each_engine(fd, hwe) {
> > >>>>>>>> +		if (hwe->engine_class ==
> > DRM_XE_ENGINE_CLASS_RENDER)
> > >>>>>>>> +			has_rcs = true;
> > >>>>>>>>     		++n_engines;
> > >>>>>>>> +	}
> > >>>>>>>>
> > >>>>>>>>     	if (flags & BALANCER) {
> > >>>>>>>>     		xe_for_each_gt(fd, gt)
> > >>>>>>>> @@ -990,6 +1005,15 @@ static void threads(int fd, int flags)
> > >>>>>>>>     	}
> > >>>>>>>>
> > >>>>>>>>     	xe_for_each_engine(fd, hwe) {
> > >>>>>>>> +		/* RCS/CCS sharing reset domain hence dependent
> > engines.
> > >>>>>>>> +		 * When CCS is doing reset, all the contexts of RCS are
> > >>>>>>>> +		 * victimized, so skip the compute engine avoiding
> > >>>>>>>> +		 * parallel execution with RCS
> > >>>>>>>> +		 */
> > >>>>>>>> +		if (has_rcs && hwe->engine_class ==
> > >>>>>> DRM_XE_ENGINE_CLASS_COMPUTE &&
> > >>>>>>>> +		    is_engine_contexts_victimized(fd, flags))
> > >>>>>>>> +			continue;
> > >>>>>>>> +
> > >>>>>>>>     		threads_data[i].mutex = &mutex;
> > >>>>>>>>     		threads_data[i].cond = &cond;
> > >>>>>>>>     #define ADDRESS_SHIFT	39
> > >>>>>>>> --
> > >>>>>>>> 2.25.1
> > >>>>>>>>
> > >>>>>>> --
> > >>>>>>> Matt Roper
> > >>>>>>> Graphics Software Engineer
> > >>>>>>> Linux GPU Platform Enablement
> > >>>>>>> Intel Corporation