[PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset domain aware

Thu Apr 11 05:37:45 UTC 2024


> -----Original Message-----
> From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of
> Upadhyay, Tejas
> Sent: Thursday, April 11, 2024 10:42 AM
> To: Harrison, John C <john.c.harrison at intel.com>; Brost, Matthew
> <matthew.brost at intel.com>
> Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Roper, Matthew D
> <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
> xe at lists.freedesktop.org
> Subject: RE: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset
> domain aware
> 
> 
> 
> > -----Original Message-----
> > From: Harrison, John C <john.c.harrison at intel.com>
> > Sent: Thursday, April 11, 2024 12:52 AM
> > To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; Brost, Matthew
> > <matthew.brost at intel.com>
> > Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Roper, Matthew D
> > <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
> > xe at lists.freedesktop.org
> > Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests
> > reset domain aware
> >
> > On 4/8/2024 05:00, Upadhyay, Tejas wrote:
> > >> -----Original Message-----
> > >> From: Upadhyay, Tejas
> > >> Sent: Monday, April 8, 2024 10:54 AM
> > >> To: Harrison, John C <john.c.harrison at intel.com>; Brost, Matthew
> > >> <matthew.brost at intel.com>
> > >> Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Roper, Matthew D
> > >> <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
> > >> xe at lists.freedesktop.org
> > >> Subject: RE: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang
> > >> tests reset domain aware
> > >>
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: Harrison, John C <john.c.harrison at intel.com>
> > >>> Sent: Saturday, April 6, 2024 5:13 AM
> > >>> To: Brost, Matthew <matthew.brost at intel.com>
> > >>> Cc: Upadhyay, Tejas <tejas.upadhyay at intel.com>; De Marchi, Lucas
> > >>> <lucas.demarchi at intel.com>; Roper, Matthew D
> > >>> <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
> > >>> xe at lists.freedesktop.org
> > >>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang
> > >>> tests reset domain aware
> > >>>
> > >>> On 4/5/2024 16:33, Matthew Brost wrote:
> > >>>> On Fri, Apr 05, 2024 at 11:15:14AM -0700, John Harrison wrote:
> > >>>>> On 4/4/2024 21:47, Upadhyay, Tejas wrote:
> > >>>>>>> -----Original Message-----
> > >>>>>>> From: Harrison, John C<john.c.harrison at intel.com>
> > >>>>>>> Sent: Friday, April 5, 2024 4:53 AM
> > >>>>>>> To: Upadhyay, Tejas<tejas.upadhyay at intel.com>; De Marchi,
> > >>>>>>> Lucas <lucas.demarchi at intel.com>; Roper, Matthew D
> > >>>>>>> <matthew.d.roper at intel.com>
> > >>>>>>> Cc:igt-dev at lists.freedesktop.org;intel-xe at lists.freedesktop.or
> > >>>>>>> g; Brost, Matthew<matthew.brost at intel.com>
> > >>>>>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang
> > >>>>>>> tests reset domain aware
> > >>>>>>>
> > >>>>>>> On 4/2/2024 22:35, Upadhyay, Tejas wrote:
> > >>>>>>>>> -----Original Message-----
> > >>>>>>>>> From: De Marchi, Lucas<lucas.demarchi at intel.com>
> > >>>>>>>>> Sent: Wednesday, April 3, 2024 2:26 AM
> > >>>>>>>>> To: Roper, Matthew D<matthew.d.roper at intel.com>
> > >>>>>>>>> Cc: Upadhyay, Tejas<tejas.upadhyay at intel.com>; igt-
> > >>>>>>>>> dev at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
> > >>>>>>>>> Brost, Matthew<matthew.brost at intel.com>
> > >>>>>>>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make
> > >>>>>>>>> hang tests reset domain aware
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Apr 02, 2024 at 12:40:17PM -0700, Matt Roper wrote:
> > >>>>>>>>>> On Tue, Apr 02, 2024 at 05:52:23PM +0530, Tejas Upadhyay
> > wrote:
> > >>>>>>>>>>> RCS/CCS are dependent engines as they are sharing reset
> > domain.
> > >>>>>>>>>>> Whenever there is reset from CCS, all the exec queues
> > >>>>>>>>>>> running on RCS are victimised mainly on Lunarlake.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Lets skip parallel execution on CCS with RCS.
> > >>>>>>>>>> I haven't really looked at this specific test in detail,
> > >>>>>>>>>> but based on your explanation here, you're also going to
> > >>>>>>>>>> run into problems with multiple CCS engines since they all
> > >>>>>>>>>> share the same reset.  You won't see that on platforms like
> > >>>>>>>>>> LNL that only have a single CCS, but platforms
> > >>>>>>>>> but it is seen on LNL because of having both RCS and CCS.
> > >>>>>>>>>
> > >>>>>>>>>> like PVC, ATS-M, DG2, etc. can all have multiple CCS where
> > >>>>>>>>>> a reset on one kills anything running on the others.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Matt
> > >>>>>>>>>>
> > >>>>>>>>>>> It helps in fixing following errors:
> > >>>>>>>>>>> 1. Test assertion failure function test_legacy_mode, file,
> > >>>>>>>>>>> Failed
> > >>>>>>>>>>> assertion: data[i].data == 0xc0ffee
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2.Test assertion failure function xe_exec, file
> > >>>>>>>>>>> ../lib/xe/xe_ioctl.c, Failed assertion: __xe_exec(fd,
> > >>>>>>>>>>> exec) == 0,
> > >>>>>>>>>>> error: -125 != 0
> > >>>>>>>>>>>
> > >>>>>>>>>>> Signed-off-by: Tejas Upadhyay<tejas.upadhyay at intel.com>
> > >>>>>>>>>>> ---
> > >>>>>>>>>>>      tests/intel/xe_exec_threads.c | 26
> > >> +++++++++++++++++++++++++-
> > >>>>>>>>>>>      1 file changed, 25 insertions(+), 1 deletion(-)
> > >>>>>>>>>>>
> > >>>>>>>>>>> diff --git a/tests/intel/xe_exec_threads.c
> > >>>>>>>>>>> b/tests/intel/xe_exec_threads.c index 8083980f9..31af61dc9
> > >>>>>>>>>>> 100644
> > >>>>>>>>>>> --- a/tests/intel/xe_exec_threads.c
> > >>>>>>>>>>> +++ b/tests/intel/xe_exec_threads.c
> > >>>>>>>>>>> @@ -710,6 +710,17 @@ static void *thread(void *data)
> > >>>>>>>>>>>      	return NULL;
> > >>>>>>>>>>>      }
> > >>>>>>>>>>>
> > >>>>>>>>>>> +static bool is_engine_contexts_victimized(int fd,
> > >>>>>>>>>>> +unsigned int
> > >>>>>>>>>>> +flags) {
> > >>>>>>>>>>> +	if (!IS_LUNARLAKE(intel_get_drm_devid(fd)))
> > >>>>>>>>>>> +		return false;
> > >>>>>>>>> as above, I don't think we should add any platform check here.
> > >>>>>>>>> It's impossible to keep it up to date and it's also testing
> > >>>>>>>>> the wrong
> > >>> thing.
> > >>>>>>>>> AFAIU you don't want parallel submission on engines that
> > >>>>>>>>> share the same reset domain. So, this is actually what should be
> tested.
> > >>>>>>>> Platforms like  PVC, ATS-M, DG2, etc. have some kind of
> > >>>>>>>> WA/noWA which
> > >>>>>>> helps to run things parallelly on engines in same reset domain
> > >>>>>>> and apparently BMG/LNL does not have that kind of support so
> > >>>>>>> applicable for LNL/BMG with parallel submission on RCS/CCS only.
> > >>>>>>>> @Harrison, John C please reply if you have any other input here.
> > >>>>>>> I don't get what you mean by 'have some kind of WA/noWA'. All
> > >>>>>>> platforms with compute engines have shared reset domains. That
> > >>>>>>> is all
> > >>> there is to it. I.e.
> > >>>>>>> everything from TGL onwards. That includes RCS and all CCS
> engines.
> > >>>>>>> So RCS + CCS, CCS0 + CCS1, RCS + CC0 + CCS1, etc. Any platform
> > >>>>>>> with multiple engines that talk to EUs will reset all of those
> > >>>>>>> engines in
> > >>> parallel.
> > >>>>>>> There are w/a's which make the situation even worse. E.g. on
> > >>>>>>> DG2/MTL you are not allowed to context switch one of those
> > >>>>>>> engines
> > >>> while another is busy.
> > >>>>>>> Which means that if one hangs, they all hang - you cannot just
> > >>>>>>> wait for other workloads to complete and/or pre-empt them off
> > >>>>>>> the engine prior to doing the shared reset. But there is
> > >>>>>>> nothing that makes
> > >> it better.
> > >>>>>>> I assume we are talking about GuC triggered engine resets here?
> > >>>>>>> As opposed to driver triggered full GT resets?
> > >>>>>>>
> > >>>>>>> The GuC will attempt to idle all other connected engines first
> > >>>>>>> by pre-empting out any executing contexts. If those contexts
> > >>>>>>> are pre-emptible then they will survive - GuC will
> > >>>>>>> automatically restart them once the reset is complete. If they
> > >>>>>>> are not (or at least not pre-emptible within the pre-emption
> > >>>>>>> timeout
> > >>>>>>> limit) then they will be killed as collateral damage.
> > >>>>>>>
> > >>>>>>> What are the workloads being submitted by this test? Are the
> > >>>>>>> pre-emptible spinners? If so, then they should survive
> > >>>>>>> (assuming you don't have the DG2/MTL RCS/CCS w/a in effect).
> > >>>>>>> If they are non-preemptible spinners then they are toast.
> > >>>>>> Main question here was, if this fix should be applied to all
> > >>>>>> platforms who
> > >>> has RCS and CCS both or just LNL/BMG. Reason to ask is, only
> > >>> LNL/BMG are hitting this issue, with same tests PVC and other
> > >>> platforms are not hitting issue which we are addressing here.
> > >>>>> And the answer is that yes, shared reset domains are common to
> > >>>>> all platforms with compute engines. So if only LNL/BMG are
> > >>>>> failing then the problem is not understood. Which is not helped
> > >>>>> by this test code being extremely complex and having almost zero
> > >>>>> explanation in it at
> > >> all :(.
> > >>>> Let me explain what this test is doing...
> > >>>>
> > >>>> - It creates a thread per hardware engine instance
> > >>>> - Within each thread it creates many exec queues targeting 1 hardware
> > >>>>     instance
> > >>>> - It submits a bunch of batches which do dword write the exec
> > >>>> queues
> > >>>> - If the HANG flag is set, 1 of the exec queue per thread will insert a
> > >>>>     non-preemptable spinner. It is expected the GuC will reset this exec
> > >>>>     queue only. Cross CCS / RCS resets will break this as one of the
> > >>>>     'good' exec queues from another thread could also be reset.
> > >>> If the 'good' workloads are pre-emptible then they should not be reset.
> > >>> The GuC will attempt to pre-empt all shared domain engines prior
> > >>> to triggering any resets. If they are being killed then something
> > >>> is broken and needs to be fixed.
> > >>>
> > >>>> - I think the HANG sections can fail on PVC too (VLK-57725)
> > >>>> - This is racey, as if the resets occur when all the 'bad' exec queues
> > >>>>     are running the test will still work with cross CCS / RCS
> > >>>> resets
> > >>> Can this description be added to the test?
> > >>>
> > >>>> As the author of this test, I am fine with compute class just
> > >>>> being skipped if HANF flag set. It is not testing individual
> > >>>> engine instance resets (we had tests for that but might be
> > >>>> temporally
> > >>>> removed) rather falls into the class of tests which trying to do
> > >>>> a bunch of things in parallel to stress the KMD.
> > >>> Note that some platforms don't have RCS or media engines. Which
> > >>> means only running on BCS engines. Is that sufficient coverage?
> > >> If you look at this patch, we skip compute only if rcs is present,
> > >> otherwise
> > not.
> > >> So far I don’t see failure when 2 compute instances happen to enter
> > >> this
> > race.
> > >>
> > > I can modify test to skip in case 2 ccs and no rcs with HANG tests.
> > > Even
> > though issue not seen anywhere else if change needs to be generic.
> > My point is not to make random and unexplained changes to the test but
> > to understand why the test is failing the way it is. So far, the
> > explanation does not make sense.
> >
> > See above about pre-emptible workloads should not be killed. LNL/BMG
> > do not have the RCS/CCS workaround of DG2 that prevents pre-emptions
> > and context switches while the other side is busy. So I am not seeing
> > a reason why the test is failing. That needs to be explained before
> > simply making it skip on those platforms.
> >
> > John.
> >
> > >
> > >> Thanks,
> > >> Tejas
> > >>
> > >>> And if this is not meant to be a reset test, why does it test
> > >>> resets at all? If the concern is that we don't have a stress test
> > >>> involving resets and this is the only coverage then it seems like
> > >>> we should not be
> > >> crippling it.
> > >>> John.
> > >>>
> > >>>
> > >>>> Matt
> > >>>>
> > >>>>> As noted, PVC has multiple compute engines but no RCS engine. If
> > >>>>> any compute engine is reset then all are reset. So if the test
> > >>>>> is running correctly and passing on PVC then it cannot be
> > >>>>> failing on LNL/BMG purely due to shared domain resets.
> > >>>>>
> > >>>>> Is the reset not happening on PVC? Is the test not actually
> > >>>>> running multiple contexts in parallel on PVC? Or are the
> > >>>>> spinners pre-emptible and are therefore supposed to survive the
> > >>>>> reset of a shared domain engine by being swapped out first? In
> > >>>>> which case LNL/BMG are broken because the killed contexts are
> > >>>>> not supposed to be
> > >>> killed even though the engine is reset?
> 
> Ok, I saw in test spinner were marked not preemptible, but then I made them
> preemptible with below change, still see the failure when CCS goes reset.
> 
> @@ -477,7 +477,7 @@ test_legacy_mode(int fd, uint32_t vm, uint64_t addr,
> uint64_t userptr,
>                 uint64_t pad;
>                 uint32_t data;
>         } *data;
> -       struct xe_spin_opts spin_opts = { .preempt = false };
> +       struct xe_spin_opts spin_opts = { .preempt = true };

I see MI_ARB_CHECK is added in test spin batch if pre-empt is true below ,
        if (opts->preempt)
                spin->batch[b++] = MI_ARB_CHECK;

Tejas
> 
> Is this expected if not could this be test or KMD or GuC issue?
> 
> Thanks,
> Tejas
> 
> > >>>>> John.
> > >>>>>
> > >>>>>> Thanks,
> > >>>>>> Tejas
> > >>>>>>> John.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Tejas
> > >>>>>>>>> Lucas De Marchi
> > >>>>>>>>>
> > >>>>>>>>>>> +
> > >>>>>>>>>>> +	if (flags & HANG)
> > >>>>>>>>>>> +		return true;
> > >>>>>>>>>>> +
> > >>>>>>>>>>> +	return false;
> > >>>>>>>>>>> +}
> > >>>>>>>>>>> +
> > >>>>>>>>>>>      /**
> > >>>>>>>>>>>       * SUBTEST: threads-%s
> > >>>>>>>>>>>       * Description: Run threads %arg[1] test with multi
> > >>>>>>>>>>> threads @@
> > >>>>>>>>>>> -955,9 +966,13 @@ static void threads(int fd, int flags)
> > >>>>>>>>>>>      	bool go = false;
> > >>>>>>>>>>>      	int n_threads = 0;
> > >>>>>>>>>>>      	int gt;
> > >>>>>>>>>>> +	bool has_rcs = false;
> > >>>>>>>>>>>
> > >>>>>>>>>>> -	xe_for_each_engine(fd, hwe)
> > >>>>>>>>>>> +	xe_for_each_engine(fd, hwe) {
> > >>>>>>>>>>> +		if (hwe->engine_class ==
> > >>> DRM_XE_ENGINE_CLASS_RENDER)
> > >>>>>>>>>>> +			has_rcs = true;
> > >>>>>>>>>>>      		++n_engines;
> > >>>>>>>>>>> +	}
> > >>>>>>>>>>>
> > >>>>>>>>>>>      	if (flags & BALANCER) {
> > >>>>>>>>>>>      		xe_for_each_gt(fd, gt) @@ -990,6 +1005,15
> > @@ static
> > >>>>>>>>>>> void threads(int fd, int flags)
> > >>>>>>>>>>>      	}
> > >>>>>>>>>>>
> > >>>>>>>>>>>      	xe_for_each_engine(fd, hwe) {
> > >>>>>>>>>>> +		/* RCS/CCS sharing reset domain hence
> dependent
> > >>> engines.
> > >>>>>>>>>>> +		 * When CCS is doing reset, all the contexts of
> RCS are
> > >>>>>>>>>>> +		 * victimized, so skip the compute engine
> avoiding
> > >>>>>>>>>>> +		 * parallel execution with RCS
> > >>>>>>>>>>> +		 */
> > >>>>>>>>>>> +		if (has_rcs && hwe->engine_class ==
> > >>>>>>>>> DRM_XE_ENGINE_CLASS_COMPUTE &&
> > >>>>>>>>>>> +		    is_engine_contexts_victimized(fd, flags))
> > >>>>>>>>>>> +			continue;
> > >>>>>>>>>>> +
> > >>>>>>>>>>>      		threads_data[i].mutex = &mutex;
> > >>>>>>>>>>>      		threads_data[i].cond = &cond;
> > >>>>>>>>>>>      #define ADDRESS_SHIFT	39
> > >>>>>>>>>>> --
> > >>>>>>>>>>> 2.25.1
> > >>>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> Matt Roper
> > >>>>>>>>>> Graphics Software Engineer
> > >>>>>>>>>> Linux GPU Platform Enablement Intel Corporation