[PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset domain aware

Fri Apr 5 04:42:28 UTC 2024

> -----Original Message-----
> From: Harrison, John C <john.c.harrison at intel.com>
> Sent: Friday, April 5, 2024 5:15 AM
> To: Upadhyay, Tejas <tejas.upadhyay at intel.com>; De Marchi, Lucas
> <lucas.demarchi at intel.com>; Roper, Matthew D
> <matthew.d.roper at intel.com>
> Cc: igt-dev at lists.freedesktop.org; intel-xe at lists.freedesktop.org; Brost,
> Matthew <matthew.brost at intel.com>
> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset
> domain aware
> 
> On 4/4/2024 16:22, John Harrison wrote:
> > On 4/2/2024 22:35, Upadhyay, Tejas wrote:
> >>> -----Original Message-----
> >>> From: De Marchi, Lucas <lucas.demarchi at intel.com>
> >>> Sent: Wednesday, April 3, 2024 2:26 AM
> >>> To: Roper, Matthew D <matthew.d.roper at intel.com>
> >>> Cc: Upadhyay, Tejas <tejas.upadhyay at intel.com>; igt-
> >>> dev at lists.freedesktop.org; intel-xe at lists.freedesktop.org; Brost,
> >>> Matthew <matthew.brost at intel.com>
> >>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests
> >>> reset domain aware
> >>>
> >>> On Tue, Apr 02, 2024 at 12:40:17PM -0700, Matt Roper wrote:
> >>>> On Tue, Apr 02, 2024 at 05:52:23PM +0530, Tejas Upadhyay wrote:
> >>>>> RCS/CCS are dependent engines as they are sharing reset domain.
> >>>>> Whenever there is reset from CCS, all the exec queues running on
> >>>>> RCS are victimised mainly on Lunarlake.
> >>>>>
> >>>>> Lets skip parallel execution on CCS with RCS.
> >>>> I haven't really looked at this specific test in detail, but based
> >>>> on your explanation here, you're also going to run into problems
> >>>> with multiple CCS engines since they all share the same reset. You
> >>>> won't see that on platforms like LNL that only have a single CCS,
> >>>> but platforms
> >>> but it is seen on LNL because of having both RCS and CCS.
> >>>
> >>>> like PVC, ATS-M, DG2, etc. can all have multiple CCS where a reset
> >>>> on one kills anything running on the others.
> >>>>
> >>>>
> >>>> Matt
> >>>>
> >>>>> It helps in fixing following errors:
> >>>>> 1. Test assertion failure function test_legacy_mode, file, Failed
> >>>>> assertion: data[i].data == 0xc0ffee
> >>>>>
> >>>>> 2.Test assertion failure function xe_exec, file
> >>>>> ../lib/xe/xe_ioctl.c, Failed assertion: __xe_exec(fd, exec) == 0,
> >>>>> error: -125 != 0
> >>>>>
> >>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> >>>>> ---
> >>>>>   tests/intel/xe_exec_threads.c | 26 +++++++++++++++++++++++++-
> >>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/tests/intel/xe_exec_threads.c
> >>>>> b/tests/intel/xe_exec_threads.c index 8083980f9..31af61dc9 100644
> >>>>> --- a/tests/intel/xe_exec_threads.c
> >>>>> +++ b/tests/intel/xe_exec_threads.c
> >>>>> @@ -710,6 +710,17 @@ static void *thread(void *data)
> >>>>>       return NULL;
> >>>>>   }
> >>>>>
> >>>>> +static bool is_engine_contexts_victimized(int fd, unsigned int
> >>>>> +flags) {
> >>>>> +    if (!IS_LUNARLAKE(intel_get_drm_devid(fd)))
> >>>>> +        return false;
> >>> as above, I don't think we should add any platform check here. It's
> >>> impossible to keep it up to date and it's also testing the wrong
> >>> thing.
> >>> AFAIU you don't want parallel submission on engines that share the
> >>> same reset domain. So, this is actually what should be tested.
> >> Platforms like  PVC, ATS-M, DG2, etc. have some kind of WA/noWA which
> >> helps to run things parallelly on engines in same reset domain and
> >> apparently BMG/LNL does not have that kind of support so applicable
> >> for LNL/BMG with parallel submission on RCS/CCS only.
> >>
> >> @Harrison, John C please reply if you have any other input here.
> > I don't get what you mean by 'have some kind of WA/noWA'. All
> > platforms with compute engines have shared reset domains. That is all
> > there is to it. I.e. everything from TGL onwards. That includes RCS
> > and all CCS engines. So RCS + CCS, CCS0 + CCS1, RCS + CC0 + CCS1, etc.
> > Any platform with multiple engines that talk to EUs will reset all of
> > those engines in parallel.
> >
> > There are w/a's which make the situation even worse. E.g. on DG2/MTL
> > you are not allowed to context switch one of those engines while
> > another is busy. Which means that if one hangs, they all hang - you
> > cannot just wait for other workloads to complete and/or pre-empt them
> > off the engine prior to doing the shared reset. But there is nothing
> > that makes it better.
> >
> > I assume we are talking about GuC triggered engine resets here? As
> > opposed to driver triggered full GT resets?
> >
> > The GuC will attempt to idle all other connected engines first by
> > pre-empting out any executing contexts. If those contexts are
> > pre-emptible then they will survive - GuC will automatically restart
> > them once the reset is complete. If they are not (or at least not
> > pre-emptible within the pre-emption timeout limit) then they will be
> > killed as collateral damage.
> >
> > What are the workloads being submitted by this test? Are the
> > pre-emptible spinners? If so, then they should survive (assuming you
> > don't have the DG2/MTL RCS/CCS w/a in effect). If they are
> > non-preemptible spinners then they are toast.
> >
> > John.
> >
> >
> >>
> >> Thanks,
> >> Tejas
> >>> Lucas De Marchi
> >>>
> >>>>> +
> >>>>> +    if (flags & HANG)
> >>>>> +        return true;
> >>>>> +
> >>>>> +    return false;
> >>>>> +}
> >>>>> +
> >>>>>   /**
> >>>>>    * SUBTEST: threads-%s
> >>>>>    * Description: Run threads %arg[1] test with multi threads @@
> >>>>> -955,9 +966,13 @@ static void threads(int fd, int flags)
> >>>>>       bool go = false;
> >>>>>       int n_threads = 0;
> >>>>>       int gt;
> >>>>> +    bool has_rcs = false;
> >>>>>
> >>>>> -    xe_for_each_engine(fd, hwe)
> >>>>> +    xe_for_each_engine(fd, hwe) {
> >>>>> +        if (hwe->engine_class == DRM_XE_ENGINE_CLASS_RENDER)
> >>>>> +            has_rcs = true;
> >>>>>           ++n_engines;
> >>>>> +    }
> >>>>>
> >>>>>       if (flags & BALANCER) {
> >>>>>           xe_for_each_gt(fd, gt)
> >>>>> @@ -990,6 +1005,15 @@ static void threads(int fd, int flags)
> PS: There is nothing in the function name that suggests this is a reset specific
> test. If this is common code for multiple tests including some that do not
> expect to hit resets, then removing all testing of compute engines is a bad
> idea.

We skip tests where rcs/ccs both are involved and HANG flag is passed, HANG flag means we will have reset involved. Otherwise all other compute engine test will run as is.

Thanks,
Tejas 
> 
> John.
> 
> 
> >>>>>       }
> >>>>>
> >>>>>       xe_for_each_engine(fd, hwe) {
> >>>>> +        /* RCS/CCS sharing reset domain hence dependent engines.
> >>>>> +         * When CCS is doing reset, all the contexts of RCS are
> >>>>> +         * victimized, so skip the compute engine avoiding
> >>>>> +         * parallel execution with RCS
> >>>>> +         */
> >>>>> +        if (has_rcs && hwe->engine_class ==
> >>> DRM_XE_ENGINE_CLASS_COMPUTE &&
> >>>>> + is_engine_contexts_victimized(fd, flags))
> >>>>> +            continue;
> >>>>> +
> >>>>>           threads_data[i].mutex = &mutex;
> >>>>>           threads_data[i].cond = &cond;
> >>>>>   #define ADDRESS_SHIFT    39
> >>>>> --
> >>>>> 2.25.1
> >>>>>
> >>>> --
> >>>> Matt Roper
> >>>> Graphics Software Engineer
> >>>> Linux GPU Platform Enablement
> >>>> Intel Corporation
> >