[PATCH V3 i-g-t] tests/xe_exec_threads: Make hang tests reset domain aware

Upadhyay, Tejas tejas.upadhyay at intel.com
Thu Jun 6 14:16:21 UTC 2024



> -----Original Message-----
> From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of
> Upadhyay, Tejas
> Sent: Wednesday, May 8, 2024 10:55 PM
> To: Harrison, John C <john.c.harrison at intel.com>; igt-
> dev at lists.freedesktop.org; Brost, Matthew <matthew.brost at intel.com>
> Cc: intel-xe at lists.freedesktop.org; De Marchi, Lucas
> <lucas.demarchi at intel.com>
> Subject: RE: [PATCH V3 i-g-t] tests/xe_exec_threads: Make hang tests reset
> domain aware
> 
> 
> 
> > -----Original Message-----
> > From: Harrison, John C <john.c.harrison at intel.com>
> > Sent: Tuesday, April 30, 2024 4:08 AM
> > To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; igt-
> > dev at lists.freedesktop.org
> > Cc: intel-xe at lists.freedesktop.org; Brost, Matthew
> > <matthew.brost at intel.com>; De Marchi, Lucas <lucas.demarchi at intel.com>
> > Subject: Re: [PATCH V3 i-g-t] tests/xe_exec_threads: Make hang tests
> > reset domain aware
> >
> > On 4/10/2024 02:40, Tejas Upadhyay wrote:
> > > RCS/CCS are dependent engines as they are sharing reset domain.
> > > Whenever there is reset from CCS, all the exec queues running on RCS
> > > are victimised mainly on Lunarlake.
> > >
> > > Lets skip parallel execution on CCS with RCS.
> > But why? In a real-world scenario, this should not happen.
> >
> > As per earlier comments, before changing the test we need to
> > understand more about what is actually happening and why. The previous
> > descriptions of what is failing and on what platforms do not make sense.
> >
> > Also, the request for more explanation was not just for the email
> > discussion but to actually add that description into the test. I even
> > explicitly asked for that in one of the review emails. The
> > xe_exec_threads executable reports 114 separate subtests, none of
> > which has any documentation at all. The comments within the code
> > basically come down to saying 'wait for completion' immediately before
> waiting.
> >
> > At one point Matthew Brost stated:
> >  > If the HANG flag is set, 1 of the exec queue per thread will insert
> > a  > non- preemptable spinner. It is expected the GuC will reset this exec  >
> queue only.
> > Cross CCS / RCS resets will break this as one of the  > 'good' exec
> > queues from another thread could also be reset.
> >
> > To which my replay was that if only one spinner is non-pre-emptible
> > then all the other spinners should get automatically switched out
> > prior to the reset being triggered and should therefore not be killed. The
> 'good'
> > queues should survive and be totally oblivious to the reset occurring.
> 
> As far as I am seeing test code, each spinner that is created is non-
> preemptible in this test. So all spinners will not get automatically switched out
> prior to reset and hence we will see test failure as 'good' queues will also not
> survive on reset. @Brost, Matthew please help to confirm behaviour. We
> need this to move further towards final solution.

Any update here please? @Summers, Stuart FYI

Thanks,
Tejas

> 
> Thanks,
> Tejas
> >
> > So either the above description is inaccurate and all the spinners are
> > actually non-pre-emptible. In which case, if that is not the intention
> > of the test then the correct fix is to make them pre-emptible and it
> > should then pass. Or if the description is accurate, then the test is
> > correctly failing because it has found a bug which needs to be fixed.
> > Either way, the correct solution is not to simply skip the failing
> > case and hide it under the rug.
> >
> > John.
> >
> >
> > >
> > > It helps in fixing following errors:
> > > 1. Test assertion failure function test_legacy_mode, file, Failed
> > > assertion: data[i].data == 0xc0ffee
> > >
> > > 2.Test assertion failure function xe_exec, file
> > > ../lib/xe/xe_ioctl.c, Failed assertion: __xe_exec(fd, exec) == 0,
> > > error: -125 != 0
> > >
> > > V3:
> > >    - Check victimization reset domain wise irrespective of platform
> > > - Lucas/MattR
> > > V2:
> > >    - Add error details
> > >
> > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> > > ---
> > >   tests/intel/xe_exec_threads.c | 43
> > ++++++++++++++++++++++++++++++++++-
> > >   1 file changed, 42 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/tests/intel/xe_exec_threads.c
> > > b/tests/intel/xe_exec_threads.c index 8083980f9..24eac39de 100644
> > > --- a/tests/intel/xe_exec_threads.c
> > > +++ b/tests/intel/xe_exec_threads.c
> > > @@ -710,6 +710,24 @@ static void *thread(void *data)
> > >   	return NULL;
> > >   }
> > >
> > > +static bool is_engine_contexts_victimized(__u16 eclass, unsigned int
> flags,
> > > +					  bool has_rcs, bool multi_ccs,
> > > +					  bool *ccs0_created)
> > > +{
> > > +	if (!(eclass == DRM_XE_ENGINE_CLASS_COMPUTE && flags & HANG))
> > > +		return false;
> > > +	if (has_rcs) {
> > > +		return true;
> > > +	} else if (multi_ccs) {
> > > +		/* In case of multi ccs, allow only 1 ccs to run HANG test*/
> > > +		if (*ccs0_created)
> > > +			return true;
> > > +		*ccs0_created = true;
> > > +	}
> > > +
> > > +	return false;
> > > +}
> > > +
> > >   /**
> > >    * SUBTEST: threads-%s
> > >    * Description: Run threads %arg[1] test with multi threads @@
> > > -955,9 +973,20 @@ static void threads(int fd, int flags)
> > >   	bool go = false;
> > >   	int n_threads = 0;
> > >   	int gt;
> > > +	bool has_rcs = false;
> > > +	bool multi_ccs = false, has_ccs = false, ccs0_created = false;
> > >
> > > -	xe_for_each_engine(fd, hwe)
> > > +	xe_for_each_engine(fd, hwe) {
> > > +		if (hwe->engine_class == DRM_XE_ENGINE_CLASS_RENDER) {
> > > +			has_rcs = true;
> > > +		} else if (hwe->engine_class ==
> > DRM_XE_ENGINE_CLASS_COMPUTE) {
> > > +			if (has_ccs)
> > > +				multi_ccs = true;
> > > +			else
> > > +				has_ccs = true;
> > > +		}
> > >   		++n_engines;
> > > +	}
> > >
> > >   	if (flags & BALANCER) {
> > >   		xe_for_each_gt(fd, gt)
> > > @@ -990,6 +1019,18 @@ static void threads(int fd, int flags)
> > >   	}
> > >
> > >   	xe_for_each_engine(fd, hwe) {
> > > +		/* RCS/CCS sharing reset domain hence dependent engines.
> > > +		 * When CCS is doing reset, all the contexts of RCS are
> > > +		 * victimized. Also in case of multiple CCS instances
> > > +		 * contexts running on other CCS engine are victimized.
> > > +		 * so skip the compute engine avoiding parallel execution
> > > +		 * with RCS and other CCS.
> > > +		 */
> > > +		if (is_engine_contexts_victimized(hwe->engine_class, flags,
> > > +						  has_rcs, multi_ccs,
> > > +						  &ccs0_created))
> > > +			continue;
> > > +
> > >   		threads_data[i].mutex = &mutex;
> > >   		threads_data[i].cond = &cond;
> > >   #define ADDRESS_SHIFT	39



More information about the igt-dev mailing list