[PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset domain aware

Thu Apr 11 05:12:18 UTC 2024


> -----Original Message-----
> From: Harrison, John C <john.c.harrison at intel.com>
> Sent: Thursday, April 11, 2024 12:52 AM
> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; Brost, Matthew
> <matthew.brost at intel.com>
> Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Roper, Matthew D
> <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
> xe at lists.freedesktop.org
> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset
> domain aware
> 
> On 4/8/2024 05:00, Upadhyay, Tejas wrote:
> >> -----Original Message-----
> >> From: Upadhyay, Tejas
> >> Sent: Monday, April 8, 2024 10:54 AM
> >> To: Harrison, John C <john.c.harrison at intel.com>; Brost, Matthew
> >> <matthew.brost at intel.com>
> >> Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Roper, Matthew D
> >> <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
> >> xe at lists.freedesktop.org
> >> Subject: RE: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests
> >> reset domain aware
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: Harrison, John C <john.c.harrison at intel.com>
> >>> Sent: Saturday, April 6, 2024 5:13 AM
> >>> To: Brost, Matthew <matthew.brost at intel.com>
> >>> Cc: Upadhyay, Tejas <tejas.upadhyay at intel.com>; De Marchi, Lucas
> >>> <lucas.demarchi at intel.com>; Roper, Matthew D
> >>> <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
> >>> xe at lists.freedesktop.org
> >>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests
> >>> reset domain aware
> >>>
> >>> On 4/5/2024 16:33, Matthew Brost wrote:
> >>>> On Fri, Apr 05, 2024 at 11:15:14AM -0700, John Harrison wrote:
> >>>>> On 4/4/2024 21:47, Upadhyay, Tejas wrote:
> >>>>>>> -----Original Message-----
> >>>>>>> From: Harrison, John C<john.c.harrison at intel.com>
> >>>>>>> Sent: Friday, April 5, 2024 4:53 AM
> >>>>>>> To: Upadhyay, Tejas<tejas.upadhyay at intel.com>; De Marchi, Lucas
> >>>>>>> <lucas.demarchi at intel.com>; Roper, Matthew D
> >>>>>>> <matthew.d.roper at intel.com>
> >>>>>>> Cc:igt-dev at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
> >>>>>>> Brost, Matthew<matthew.brost at intel.com>
> >>>>>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang
> >>>>>>> tests reset domain aware
> >>>>>>>
> >>>>>>> On 4/2/2024 22:35, Upadhyay, Tejas wrote:
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: De Marchi, Lucas<lucas.demarchi at intel.com>
> >>>>>>>>> Sent: Wednesday, April 3, 2024 2:26 AM
> >>>>>>>>> To: Roper, Matthew D<matthew.d.roper at intel.com>
> >>>>>>>>> Cc: Upadhyay, Tejas<tejas.upadhyay at intel.com>; igt-
> >>>>>>>>> dev at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
> >>>>>>>>> Brost, Matthew<matthew.brost at intel.com>
> >>>>>>>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang
> >>>>>>>>> tests reset domain aware
> >>>>>>>>>
> >>>>>>>>> On Tue, Apr 02, 2024 at 12:40:17PM -0700, Matt Roper wrote:
> >>>>>>>>>> On Tue, Apr 02, 2024 at 05:52:23PM +0530, Tejas Upadhyay
> wrote:
> >>>>>>>>>>> RCS/CCS are dependent engines as they are sharing reset
> domain.
> >>>>>>>>>>> Whenever there is reset from CCS, all the exec queues
> >>>>>>>>>>> running on RCS are victimised mainly on Lunarlake.
> >>>>>>>>>>>
> >>>>>>>>>>> Lets skip parallel execution on CCS with RCS.
> >>>>>>>>>> I haven't really looked at this specific test in detail, but
> >>>>>>>>>> based on your explanation here, you're also going to run into
> >>>>>>>>>> problems with multiple CCS engines since they all share the
> >>>>>>>>>> same reset.  You won't see that on platforms like LNL that
> >>>>>>>>>> only have a single CCS, but platforms
> >>>>>>>>> but it is seen on LNL because of having both RCS and CCS.
> >>>>>>>>>
> >>>>>>>>>> like PVC, ATS-M, DG2, etc. can all have multiple CCS where a
> >>>>>>>>>> reset on one kills anything running on the others.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Matt
> >>>>>>>>>>
> >>>>>>>>>>> It helps in fixing following errors:
> >>>>>>>>>>> 1. Test assertion failure function test_legacy_mode, file,
> >>>>>>>>>>> Failed
> >>>>>>>>>>> assertion: data[i].data == 0xc0ffee
> >>>>>>>>>>>
> >>>>>>>>>>> 2.Test assertion failure function xe_exec, file
> >>>>>>>>>>> ../lib/xe/xe_ioctl.c, Failed assertion: __xe_exec(fd, exec)
> >>>>>>>>>>> == 0,
> >>>>>>>>>>> error: -125 != 0
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Tejas Upadhyay<tejas.upadhyay at intel.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>>      tests/intel/xe_exec_threads.c | 26
> >> +++++++++++++++++++++++++-
> >>>>>>>>>>>      1 file changed, 25 insertions(+), 1 deletion(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/tests/intel/xe_exec_threads.c
> >>>>>>>>>>> b/tests/intel/xe_exec_threads.c index 8083980f9..31af61dc9
> >>>>>>>>>>> 100644
> >>>>>>>>>>> --- a/tests/intel/xe_exec_threads.c
> >>>>>>>>>>> +++ b/tests/intel/xe_exec_threads.c
> >>>>>>>>>>> @@ -710,6 +710,17 @@ static void *thread(void *data)
> >>>>>>>>>>>      	return NULL;
> >>>>>>>>>>>      }
> >>>>>>>>>>>
> >>>>>>>>>>> +static bool is_engine_contexts_victimized(int fd, unsigned
> >>>>>>>>>>> +int
> >>>>>>>>>>> +flags) {
> >>>>>>>>>>> +	if (!IS_LUNARLAKE(intel_get_drm_devid(fd)))
> >>>>>>>>>>> +		return false;
> >>>>>>>>> as above, I don't think we should add any platform check here.
> >>>>>>>>> It's impossible to keep it up to date and it's also testing
> >>>>>>>>> the wrong
> >>> thing.
> >>>>>>>>> AFAIU you don't want parallel submission on engines that share
> >>>>>>>>> the same reset domain. So, this is actually what should be tested.
> >>>>>>>> Platforms like  PVC, ATS-M, DG2, etc. have some kind of WA/noWA
> >>>>>>>> which
> >>>>>>> helps to run things parallelly on engines in same reset domain
> >>>>>>> and apparently BMG/LNL does not have that kind of support so
> >>>>>>> applicable for LNL/BMG with parallel submission on RCS/CCS only.
> >>>>>>>> @Harrison, John C please reply if you have any other input here.
> >>>>>>> I don't get what you mean by 'have some kind of WA/noWA'. All
> >>>>>>> platforms with compute engines have shared reset domains. That
> >>>>>>> is all
> >>> there is to it. I.e.
> >>>>>>> everything from TGL onwards. That includes RCS and all CCS engines.
> >>>>>>> So RCS + CCS, CCS0 + CCS1, RCS + CC0 + CCS1, etc. Any platform
> >>>>>>> with multiple engines that talk to EUs will reset all of those
> >>>>>>> engines in
> >>> parallel.
> >>>>>>> There are w/a's which make the situation even worse. E.g. on
> >>>>>>> DG2/MTL you are not allowed to context switch one of those
> >>>>>>> engines
> >>> while another is busy.
> >>>>>>> Which means that if one hangs, they all hang - you cannot just
> >>>>>>> wait for other workloads to complete and/or pre-empt them off
> >>>>>>> the engine prior to doing the shared reset. But there is nothing
> >>>>>>> that makes
> >> it better.
> >>>>>>> I assume we are talking about GuC triggered engine resets here?
> >>>>>>> As opposed to driver triggered full GT resets?
> >>>>>>>
> >>>>>>> The GuC will attempt to idle all other connected engines first
> >>>>>>> by pre-empting out any executing contexts. If those contexts are
> >>>>>>> pre-emptible then they will survive - GuC will automatically
> >>>>>>> restart them once the reset is complete. If they are not (or at
> >>>>>>> least not pre-emptible within the pre-emption timeout
> >>>>>>> limit) then they will be killed as collateral damage.
> >>>>>>>
> >>>>>>> What are the workloads being submitted by this test? Are the
> >>>>>>> pre-emptible spinners? If so, then they should survive (assuming
> >>>>>>> you don't have the DG2/MTL RCS/CCS w/a in effect). If they are
> >>>>>>> non-preemptible spinners then they are toast.
> >>>>>> Main question here was, if this fix should be applied to all
> >>>>>> platforms who
> >>> has RCS and CCS both or just LNL/BMG. Reason to ask is, only LNL/BMG
> >>> are hitting this issue, with same tests PVC and other platforms are
> >>> not hitting issue which we are addressing here.
> >>>>> And the answer is that yes, shared reset domains are common to all
> >>>>> platforms with compute engines. So if only LNL/BMG are failing
> >>>>> then the problem is not understood. Which is not helped by this
> >>>>> test code being extremely complex and having almost zero
> >>>>> explanation in it at
> >> all :(.
> >>>> Let me explain what this test is doing...
> >>>>
> >>>> - It creates a thread per hardware engine instance
> >>>> - Within each thread it creates many exec queues targeting 1 hardware
> >>>>     instance
> >>>> - It submits a bunch of batches which do dword write the exec
> >>>> queues
> >>>> - If the HANG flag is set, 1 of the exec queue per thread will insert a
> >>>>     non-preemptable spinner. It is expected the GuC will reset this exec
> >>>>     queue only. Cross CCS / RCS resets will break this as one of the
> >>>>     'good' exec queues from another thread could also be reset.
> >>> If the 'good' workloads are pre-emptible then they should not be reset.
> >>> The GuC will attempt to pre-empt all shared domain engines prior to
> >>> triggering any resets. If they are being killed then something is
> >>> broken and needs to be fixed.
> >>>
> >>>> - I think the HANG sections can fail on PVC too (VLK-57725)
> >>>> - This is racey, as if the resets occur when all the 'bad' exec queues
> >>>>     are running the test will still work with cross CCS / RCS
> >>>> resets
> >>> Can this description be added to the test?
> >>>
> >>>> As the author of this test, I am fine with compute class just being
> >>>> skipped if HANF flag set. It is not testing individual engine
> >>>> instance resets (we had tests for that but might be temporally
> >>>> removed) rather falls into the class of tests which trying to do a
> >>>> bunch of things in parallel to stress the KMD.
> >>> Note that some platforms don't have RCS or media engines. Which
> >>> means only running on BCS engines. Is that sufficient coverage?
> >> If you look at this patch, we skip compute only if rcs is present, otherwise
> not.
> >> So far I don’t see failure when 2 compute instances happen to enter this
> race.
> >>
> > I can modify test to skip in case 2 ccs and no rcs with HANG tests. Even
> though issue not seen anywhere else if change needs to be generic.
> My point is not to make random and unexplained changes to the test but to
> understand why the test is failing the way it is. So far, the explanation does
> not make sense.
> 
> See above about pre-emptible workloads should not be killed. LNL/BMG do
> not have the RCS/CCS workaround of DG2 that prevents pre-emptions and
> context switches while the other side is busy. So I am not seeing a reason why
> the test is failing. That needs to be explained before simply making it skip on
> those platforms.
> 
> John.
> 
> >
> >> Thanks,
> >> Tejas
> >>
> >>> And if this is not meant to be a reset test, why does it test resets
> >>> at all? If the concern is that we don't have a stress test involving
> >>> resets and this is the only coverage then it seems like we should
> >>> not be
> >> crippling it.
> >>> John.
> >>>
> >>>
> >>>> Matt
> >>>>
> >>>>> As noted, PVC has multiple compute engines but no RCS engine. If
> >>>>> any compute engine is reset then all are reset. So if the test is
> >>>>> running correctly and passing on PVC then it cannot be failing on
> >>>>> LNL/BMG purely due to shared domain resets.
> >>>>>
> >>>>> Is the reset not happening on PVC? Is the test not actually
> >>>>> running multiple contexts in parallel on PVC? Or are the spinners
> >>>>> pre-emptible and are therefore supposed to survive the reset of a
> >>>>> shared domain engine by being swapped out first? In which case
> >>>>> LNL/BMG are broken because the killed contexts are not supposed to
> >>>>> be
> >>> killed even though the engine is reset?

Ok, I saw in test spinner were marked not preemptible, but then I made them preemptible with below change, still see the failure when CCS goes reset.

@@ -477,7 +477,7 @@ test_legacy_mode(int fd, uint32_t vm, uint64_t addr, uint64_t userptr,
                uint64_t pad;
                uint32_t data;
        } *data;
-       struct xe_spin_opts spin_opts = { .preempt = false };
+       struct xe_spin_opts spin_opts = { .preempt = true };

Is this expected if not could this be test or KMD or GuC issue?

Thanks,
Tejas

> >>>>> John.
> >>>>>
> >>>>>> Thanks,
> >>>>>> Tejas
> >>>>>>> John.
> >>>>>>>
> >>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Tejas
> >>>>>>>>> Lucas De Marchi
> >>>>>>>>>
> >>>>>>>>>>> +
> >>>>>>>>>>> +	if (flags & HANG)
> >>>>>>>>>>> +		return true;
> >>>>>>>>>>> +
> >>>>>>>>>>> +	return false;
> >>>>>>>>>>> +}
> >>>>>>>>>>> +
> >>>>>>>>>>>      /**
> >>>>>>>>>>>       * SUBTEST: threads-%s
> >>>>>>>>>>>       * Description: Run threads %arg[1] test with multi
> >>>>>>>>>>> threads @@
> >>>>>>>>>>> -955,9 +966,13 @@ static void threads(int fd, int flags)
> >>>>>>>>>>>      	bool go = false;
> >>>>>>>>>>>      	int n_threads = 0;
> >>>>>>>>>>>      	int gt;
> >>>>>>>>>>> +	bool has_rcs = false;
> >>>>>>>>>>>
> >>>>>>>>>>> -	xe_for_each_engine(fd, hwe)
> >>>>>>>>>>> +	xe_for_each_engine(fd, hwe) {
> >>>>>>>>>>> +		if (hwe->engine_class ==
> >>> DRM_XE_ENGINE_CLASS_RENDER)
> >>>>>>>>>>> +			has_rcs = true;
> >>>>>>>>>>>      		++n_engines;
> >>>>>>>>>>> +	}
> >>>>>>>>>>>
> >>>>>>>>>>>      	if (flags & BALANCER) {
> >>>>>>>>>>>      		xe_for_each_gt(fd, gt) @@ -990,6 +1005,15
> @@ static
> >>>>>>>>>>> void threads(int fd, int flags)
> >>>>>>>>>>>      	}
> >>>>>>>>>>>
> >>>>>>>>>>>      	xe_for_each_engine(fd, hwe) {
> >>>>>>>>>>> +		/* RCS/CCS sharing reset domain hence dependent
> >>> engines.
> >>>>>>>>>>> +		 * When CCS is doing reset, all the contexts of RCS are
> >>>>>>>>>>> +		 * victimized, so skip the compute engine avoiding
> >>>>>>>>>>> +		 * parallel execution with RCS
> >>>>>>>>>>> +		 */
> >>>>>>>>>>> +		if (has_rcs && hwe->engine_class ==
> >>>>>>>>> DRM_XE_ENGINE_CLASS_COMPUTE &&
> >>>>>>>>>>> +		    is_engine_contexts_victimized(fd, flags))
> >>>>>>>>>>> +			continue;
> >>>>>>>>>>> +
> >>>>>>>>>>>      		threads_data[i].mutex = &mutex;
> >>>>>>>>>>>      		threads_data[i].cond = &cond;
> >>>>>>>>>>>      #define ADDRESS_SHIFT	39
> >>>>>>>>>>> --
> >>>>>>>>>>> 2.25.1
> >>>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Matt Roper
> >>>>>>>>>> Graphics Software Engineer
> >>>>>>>>>> Linux GPU Platform Enablement Intel Corporation