[PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset domain aware

Fri Apr 5 04:47:08 UTC 2024

> -----Original Message-----
> From: Harrison, John C <john.c.harrison at intel.com>
> Sent: Friday, April 5, 2024 4:53 AM
> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; De Marchi, Lucas
> <lucas.demarchi at intel.com>; Roper, Matthew D
> <matthew.d.roper at intel.com>
> Cc: igt-dev at lists.freedesktop.org; intel-xe at lists.freedesktop.org; Brost,
> Matthew <matthew.brost at intel.com>
> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset
> domain aware
> 
> On 4/2/2024 22:35, Upadhyay, Tejas wrote:
> >> -----Original Message-----
> >> From: De Marchi, Lucas <lucas.demarchi at intel.com>
> >> Sent: Wednesday, April 3, 2024 2:26 AM
> >> To: Roper, Matthew D <matthew.d.roper at intel.com>
> >> Cc: Upadhyay, Tejas <tejas.upadhyay at intel.com>; igt-
> >> dev at lists.freedesktop.org; intel-xe at lists.freedesktop.org; Brost,
> >> Matthew <matthew.brost at intel.com>
> >> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests
> >> reset domain aware
> >>
> >> On Tue, Apr 02, 2024 at 12:40:17PM -0700, Matt Roper wrote:
> >>> On Tue, Apr 02, 2024 at 05:52:23PM +0530, Tejas Upadhyay wrote:
> >>>> RCS/CCS are dependent engines as they are sharing reset domain.
> >>>> Whenever there is reset from CCS, all the exec queues running on
> >>>> RCS are victimised mainly on Lunarlake.
> >>>>
> >>>> Lets skip parallel execution on CCS with RCS.
> >>> I haven't really looked at this specific test in detail, but based
> >>> on your explanation here, you're also going to run into problems
> >>> with multiple CCS engines since they all share the same reset.  You
> >>> won't see that on platforms like LNL that only have a single CCS,
> >>> but platforms
> >> but it is seen on LNL because of having both RCS and CCS.
> >>
> >>> like PVC, ATS-M, DG2, etc. can all have multiple CCS where a reset
> >>> on one kills anything running on the others.
> >>>
> >>>
> >>> Matt
> >>>
> >>>> It helps in fixing following errors:
> >>>> 1. Test assertion failure function test_legacy_mode, file, Failed
> >>>> assertion: data[i].data == 0xc0ffee
> >>>>
> >>>> 2.Test assertion failure function xe_exec, file
> >>>> ../lib/xe/xe_ioctl.c, Failed assertion: __xe_exec(fd, exec) == 0,
> >>>> error: -125 != 0
> >>>>
> >>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> >>>> ---
> >>>>   tests/intel/xe_exec_threads.c | 26 +++++++++++++++++++++++++-
> >>>>   1 file changed, 25 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/tests/intel/xe_exec_threads.c
> >>>> b/tests/intel/xe_exec_threads.c index 8083980f9..31af61dc9 100644
> >>>> --- a/tests/intel/xe_exec_threads.c
> >>>> +++ b/tests/intel/xe_exec_threads.c
> >>>> @@ -710,6 +710,17 @@ static void *thread(void *data)
> >>>>   	return NULL;
> >>>>   }
> >>>>
> >>>> +static bool is_engine_contexts_victimized(int fd, unsigned int
> >>>> +flags) {
> >>>> +	if (!IS_LUNARLAKE(intel_get_drm_devid(fd)))
> >>>> +		return false;
> >> as above, I don't think we should add any platform check here. It's
> >> impossible to keep it up to date and it's also testing the wrong thing.
> >> AFAIU you don't want parallel submission on engines that share the
> >> same reset domain. So, this is actually what should be tested.
> > Platforms like  PVC, ATS-M, DG2, etc. have some kind of WA/noWA which
> helps to run things parallelly on engines in same reset domain and apparently
> BMG/LNL does not have that kind of support so applicable for LNL/BMG with
> parallel submission on RCS/CCS only.
> >
> > @Harrison, John C please reply if you have any other input here.
> I don't get what you mean by 'have some kind of WA/noWA'. All platforms
> with compute engines have shared reset domains. That is all there is to it. I.e.
> everything from TGL onwards. That includes RCS and all CCS engines. So RCS +
> CCS, CCS0 + CCS1, RCS + CC0 + CCS1, etc. Any platform with multiple engines
> that talk to EUs will reset all of those engines in parallel.
> 
> There are w/a's which make the situation even worse. E.g. on DG2/MTL you
> are not allowed to context switch one of those engines while another is busy.
> Which means that if one hangs, they all hang - you cannot just wait for other
> workloads to complete and/or pre-empt them off the engine prior to doing
> the shared reset. But there is nothing that makes it better.
> 
> I assume we are talking about GuC triggered engine resets here? As opposed
> to driver triggered full GT resets?
> 
> The GuC will attempt to idle all other connected engines first by pre-empting
> out any executing contexts. If those contexts are pre-emptible then they will
> survive - GuC will automatically restart them once the reset is complete. If
> they are not (or at least not pre-emptible within the pre-emption timeout
> limit) then they will be killed as collateral damage.
> 
> What are the workloads being submitted by this test? Are the pre-emptible
> spinners? If so, then they should survive (assuming you don't have the
> DG2/MTL RCS/CCS w/a in effect). If they are non-preemptible spinners then
> they are toast.

Main question here was, if this fix should be applied to all platforms who has RCS and CCS both or just LNL/BMG. Reason to ask is, only LNL/BMG are hitting this issue, with same tests PVC and other platforms are not hitting issue which we are addressing here.

Thanks,
Tejas
> 
> John.
> 
> 
> >
> > Thanks,
> > Tejas
> >> Lucas De Marchi
> >>
> >>>> +
> >>>> +	if (flags & HANG)
> >>>> +		return true;
> >>>> +
> >>>> +	return false;
> >>>> +}
> >>>> +
> >>>>   /**
> >>>>    * SUBTEST: threads-%s
> >>>>    * Description: Run threads %arg[1] test with multi threads @@
> >>>> -955,9 +966,13 @@ static void threads(int fd, int flags)
> >>>>   	bool go = false;
> >>>>   	int n_threads = 0;
> >>>>   	int gt;
> >>>> +	bool has_rcs = false;
> >>>>
> >>>> -	xe_for_each_engine(fd, hwe)
> >>>> +	xe_for_each_engine(fd, hwe) {
> >>>> +		if (hwe->engine_class == DRM_XE_ENGINE_CLASS_RENDER)
> >>>> +			has_rcs = true;
> >>>>   		++n_engines;
> >>>> +	}
> >>>>
> >>>>   	if (flags & BALANCER) {
> >>>>   		xe_for_each_gt(fd, gt)
> >>>> @@ -990,6 +1005,15 @@ static void threads(int fd, int flags)
> >>>>   	}
> >>>>
> >>>>   	xe_for_each_engine(fd, hwe) {
> >>>> +		/* RCS/CCS sharing reset domain hence dependent engines.
> >>>> +		 * When CCS is doing reset, all the contexts of RCS are
> >>>> +		 * victimized, so skip the compute engine avoiding
> >>>> +		 * parallel execution with RCS
> >>>> +		 */
> >>>> +		if (has_rcs && hwe->engine_class ==
> >> DRM_XE_ENGINE_CLASS_COMPUTE &&
> >>>> +		    is_engine_contexts_victimized(fd, flags))
> >>>> +			continue;
> >>>> +
> >>>>   		threads_data[i].mutex = &mutex;
> >>>>   		threads_data[i].cond = &cond;
> >>>>   #define ADDRESS_SHIFT	39
> >>>> --
> >>>> 2.25.1
> >>>>
> >>> --
> >>> Matt Roper
> >>> Graphics Software Engineer
> >>> Linux GPU Platform Enablement
> >>> Intel Corporation