[PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset domain aware

Wed Apr 10 19:22:14 UTC 2024

On 4/8/2024 05:00, Upadhyay, Tejas wrote:
>> -----Original Message-----
>> From: Upadhyay, Tejas
>> Sent: Monday, April 8, 2024 10:54 AM
>> To: Harrison, John C <john.c.harrison at intel.com>; Brost, Matthew
>> <matthew.brost at intel.com>
>> Cc: De Marchi, Lucas <lucas.demarchi at intel.com>; Roper, Matthew D
>> <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
>> xe at lists.freedesktop.org
>> Subject: RE: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset
>> domain aware
>>
>>
>>
>>> -----Original Message-----
>>> From: Harrison, John C <john.c.harrison at intel.com>
>>> Sent: Saturday, April 6, 2024 5:13 AM
>>> To: Brost, Matthew <matthew.brost at intel.com>
>>> Cc: Upadhyay, Tejas <tejas.upadhyay at intel.com>; De Marchi, Lucas
>>> <lucas.demarchi at intel.com>; Roper, Matthew D
>>> <matthew.d.roper at intel.com>; igt-dev at lists.freedesktop.org; intel-
>>> xe at lists.freedesktop.org
>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests
>>> reset domain aware
>>>
>>> On 4/5/2024 16:33, Matthew Brost wrote:
>>>> On Fri, Apr 05, 2024 at 11:15:14AM -0700, John Harrison wrote:
>>>>> On 4/4/2024 21:47, Upadhyay, Tejas wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Harrison, John C<john.c.harrison at intel.com>
>>>>>>> Sent: Friday, April 5, 2024 4:53 AM
>>>>>>> To: Upadhyay, Tejas<tejas.upadhyay at intel.com>; De Marchi, Lucas
>>>>>>> <lucas.demarchi at intel.com>; Roper, Matthew D
>>>>>>> <matthew.d.roper at intel.com>
>>>>>>> Cc:igt-dev at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
>>>>>>> Brost, Matthew<matthew.brost at intel.com>
>>>>>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang
>>>>>>> tests reset domain aware
>>>>>>>
>>>>>>> On 4/2/2024 22:35, Upadhyay, Tejas wrote:
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: De Marchi, Lucas<lucas.demarchi at intel.com>
>>>>>>>>> Sent: Wednesday, April 3, 2024 2:26 AM
>>>>>>>>> To: Roper, Matthew D<matthew.d.roper at intel.com>
>>>>>>>>> Cc: Upadhyay, Tejas<tejas.upadhyay at intel.com>; igt-
>>>>>>>>> dev at lists.freedesktop.org;intel-xe at lists.freedesktop.org;
>>>>>>>>> Brost, Matthew<matthew.brost at intel.com>
>>>>>>>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang
>>>>>>>>> tests reset domain aware
>>>>>>>>>
>>>>>>>>> On Tue, Apr 02, 2024 at 12:40:17PM -0700, Matt Roper wrote:
>>>>>>>>>> On Tue, Apr 02, 2024 at 05:52:23PM +0530, Tejas Upadhyay wrote:
>>>>>>>>>>> RCS/CCS are dependent engines as they are sharing reset domain.
>>>>>>>>>>> Whenever there is reset from CCS, all the exec queues running
>>>>>>>>>>> on RCS are victimised mainly on Lunarlake.
>>>>>>>>>>>
>>>>>>>>>>> Lets skip parallel execution on CCS with RCS.
>>>>>>>>>> I haven't really looked at this specific test in detail, but
>>>>>>>>>> based on your explanation here, you're also going to run into
>>>>>>>>>> problems with multiple CCS engines since they all share the
>>>>>>>>>> same reset.  You won't see that on platforms like LNL that
>>>>>>>>>> only have a single CCS, but platforms
>>>>>>>>> but it is seen on LNL because of having both RCS and CCS.
>>>>>>>>>
>>>>>>>>>> like PVC, ATS-M, DG2, etc. can all have multiple CCS where a
>>>>>>>>>> reset on one kills anything running on the others.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Matt
>>>>>>>>>>
>>>>>>>>>>> It helps in fixing following errors:
>>>>>>>>>>> 1. Test assertion failure function test_legacy_mode, file,
>>>>>>>>>>> Failed
>>>>>>>>>>> assertion: data[i].data == 0xc0ffee
>>>>>>>>>>>
>>>>>>>>>>> 2.Test assertion failure function xe_exec, file
>>>>>>>>>>> ../lib/xe/xe_ioctl.c, Failed assertion: __xe_exec(fd, exec)
>>>>>>>>>>> == 0,
>>>>>>>>>>> error: -125 != 0
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Tejas Upadhyay<tejas.upadhyay at intel.com>
>>>>>>>>>>> ---
>>>>>>>>>>>      tests/intel/xe_exec_threads.c | 26
>> +++++++++++++++++++++++++-
>>>>>>>>>>>      1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/tests/intel/xe_exec_threads.c
>>>>>>>>>>> b/tests/intel/xe_exec_threads.c index 8083980f9..31af61dc9
>>>>>>>>>>> 100644
>>>>>>>>>>> --- a/tests/intel/xe_exec_threads.c
>>>>>>>>>>> +++ b/tests/intel/xe_exec_threads.c
>>>>>>>>>>> @@ -710,6 +710,17 @@ static void *thread(void *data)
>>>>>>>>>>>      	return NULL;
>>>>>>>>>>>      }
>>>>>>>>>>>
>>>>>>>>>>> +static bool is_engine_contexts_victimized(int fd, unsigned
>>>>>>>>>>> +int
>>>>>>>>>>> +flags) {
>>>>>>>>>>> +	if (!IS_LUNARLAKE(intel_get_drm_devid(fd)))
>>>>>>>>>>> +		return false;
>>>>>>>>> as above, I don't think we should add any platform check here.
>>>>>>>>> It's impossible to keep it up to date and it's also testing the
>>>>>>>>> wrong
>>> thing.
>>>>>>>>> AFAIU you don't want parallel submission on engines that share
>>>>>>>>> the same reset domain. So, this is actually what should be tested.
>>>>>>>> Platforms like  PVC, ATS-M, DG2, etc. have some kind of WA/noWA
>>>>>>>> which
>>>>>>> helps to run things parallelly on engines in same reset domain
>>>>>>> and apparently BMG/LNL does not have that kind of support so
>>>>>>> applicable for LNL/BMG with parallel submission on RCS/CCS only.
>>>>>>>> @Harrison, John C please reply if you have any other input here.
>>>>>>> I don't get what you mean by 'have some kind of WA/noWA'. All
>>>>>>> platforms with compute engines have shared reset domains. That is
>>>>>>> all
>>> there is to it. I.e.
>>>>>>> everything from TGL onwards. That includes RCS and all CCS engines.
>>>>>>> So RCS + CCS, CCS0 + CCS1, RCS + CC0 + CCS1, etc. Any platform
>>>>>>> with multiple engines that talk to EUs will reset all of those
>>>>>>> engines in
>>> parallel.
>>>>>>> There are w/a's which make the situation even worse. E.g. on
>>>>>>> DG2/MTL you are not allowed to context switch one of those
>>>>>>> engines
>>> while another is busy.
>>>>>>> Which means that if one hangs, they all hang - you cannot just
>>>>>>> wait for other workloads to complete and/or pre-empt them off the
>>>>>>> engine prior to doing the shared reset. But there is nothing that makes
>> it better.
>>>>>>> I assume we are talking about GuC triggered engine resets here?
>>>>>>> As opposed to driver triggered full GT resets?
>>>>>>>
>>>>>>> The GuC will attempt to idle all other connected engines first by
>>>>>>> pre-empting out any executing contexts. If those contexts are
>>>>>>> pre-emptible then they will survive - GuC will automatically
>>>>>>> restart them once the reset is complete. If they are not (or at
>>>>>>> least not pre-emptible within the pre-emption timeout
>>>>>>> limit) then they will be killed as collateral damage.
>>>>>>>
>>>>>>> What are the workloads being submitted by this test? Are the
>>>>>>> pre-emptible spinners? If so, then they should survive (assuming
>>>>>>> you don't have the DG2/MTL RCS/CCS w/a in effect). If they are
>>>>>>> non-preemptible spinners then they are toast.
>>>>>> Main question here was, if this fix should be applied to all
>>>>>> platforms who
>>> has RCS and CCS both or just LNL/BMG. Reason to ask is, only LNL/BMG
>>> are hitting this issue, with same tests PVC and other platforms are
>>> not hitting issue which we are addressing here.
>>>>> And the answer is that yes, shared reset domains are common to all
>>>>> platforms with compute engines. So if only LNL/BMG are failing then
>>>>> the problem is not understood. Which is not helped by this test
>>>>> code being extremely complex and having almost zero explanation in it at
>> all :(.
>>>> Let me explain what this test is doing...
>>>>
>>>> - It creates a thread per hardware engine instance
>>>> - Within each thread it creates many exec queues targeting 1 hardware
>>>>     instance
>>>> - It submits a bunch of batches which do dword write the exec queues
>>>> - If the HANG flag is set, 1 of the exec queue per thread will insert a
>>>>     non-preemptable spinner. It is expected the GuC will reset this exec
>>>>     queue only. Cross CCS / RCS resets will break this as one of the
>>>>     'good' exec queues from another thread could also be reset.
>>> If the 'good' workloads are pre-emptible then they should not be reset.
>>> The GuC will attempt to pre-empt all shared domain engines prior to
>>> triggering any resets. If they are being killed then something is
>>> broken and needs to be fixed.
>>>
>>>> - I think the HANG sections can fail on PVC too (VLK-57725)
>>>> - This is racey, as if the resets occur when all the 'bad' exec queues
>>>>     are running the test will still work with cross CCS / RCS resets
>>> Can this description be added to the test?
>>>
>>>> As the author of this test, I am fine with compute class just being
>>>> skipped if HANF flag set. It is not testing individual engine
>>>> instance resets (we had tests for that but might be temporally
>>>> removed) rather falls into the class of tests which trying to do a
>>>> bunch of things in parallel to stress the KMD.
>>> Note that some platforms don't have RCS or media engines. Which means
>>> only running on BCS engines. Is that sufficient coverage?
>> If you look at this patch, we skip compute only if rcs is present, otherwise not.
>> So far I don’t see failure when 2 compute instances happen to enter this race.
>>
> I can modify test to skip in case 2 ccs and no rcs with HANG tests. Even though issue not seen anywhere else if change needs to be generic.
My point is not to make random and unexplained changes to the test but 
to understand why the test is failing the way it is. So far, the 
explanation does not make sense.

See above about pre-emptible workloads should not be killed. LNL/BMG do 
not have the RCS/CCS workaround of DG2 that prevents pre-emptions and 
context switches while the other side is busy. So I am not seeing a 
reason why the test is failing. That needs to be explained before simply 
making it skip on those platforms.

John.

>   
>> Thanks,
>> Tejas
>>
>>> And if this is not meant to be a reset test, why does it test resets
>>> at all? If the concern is that we don't have a stress test involving
>>> resets and this is the only coverage then it seems like we should not be
>> crippling it.
>>> John.
>>>
>>>
>>>> Matt
>>>>
>>>>> As noted, PVC has multiple compute engines but no RCS engine. If
>>>>> any compute engine is reset then all are reset. So if the test is
>>>>> running correctly and passing on PVC then it cannot be failing on
>>>>> LNL/BMG purely due to shared domain resets.
>>>>>
>>>>> Is the reset not happening on PVC? Is the test not actually running
>>>>> multiple contexts in parallel on PVC? Or are the spinners
>>>>> pre-emptible and are therefore supposed to survive the reset of a
>>>>> shared domain engine by being swapped out first? In which case
>>>>> LNL/BMG are broken because the killed contexts are not supposed to
>>>>> be
>>> killed even though the engine is reset?
>>>>> John.
>>>>>
>>>>>> Thanks,
>>>>>> Tejas
>>>>>>> John.
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Tejas
>>>>>>>>> Lucas De Marchi
>>>>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>> +	if (flags & HANG)
>>>>>>>>>>> +		return true;
>>>>>>>>>>> +
>>>>>>>>>>> +	return false;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>>      /**
>>>>>>>>>>>       * SUBTEST: threads-%s
>>>>>>>>>>>       * Description: Run threads %arg[1] test with multi
>>>>>>>>>>> threads @@
>>>>>>>>>>> -955,9 +966,13 @@ static void threads(int fd, int flags)
>>>>>>>>>>>      	bool go = false;
>>>>>>>>>>>      	int n_threads = 0;
>>>>>>>>>>>      	int gt;
>>>>>>>>>>> +	bool has_rcs = false;
>>>>>>>>>>>
>>>>>>>>>>> -	xe_for_each_engine(fd, hwe)
>>>>>>>>>>> +	xe_for_each_engine(fd, hwe) {
>>>>>>>>>>> +		if (hwe->engine_class ==
>>> DRM_XE_ENGINE_CLASS_RENDER)
>>>>>>>>>>> +			has_rcs = true;
>>>>>>>>>>>      		++n_engines;
>>>>>>>>>>> +	}
>>>>>>>>>>>
>>>>>>>>>>>      	if (flags & BALANCER) {
>>>>>>>>>>>      		xe_for_each_gt(fd, gt)
>>>>>>>>>>> @@ -990,6 +1005,15 @@ static void threads(int fd, int flags)
>>>>>>>>>>>      	}
>>>>>>>>>>>
>>>>>>>>>>>      	xe_for_each_engine(fd, hwe) {
>>>>>>>>>>> +		/* RCS/CCS sharing reset domain hence dependent
>>> engines.
>>>>>>>>>>> +		 * When CCS is doing reset, all the contexts of RCS are
>>>>>>>>>>> +		 * victimized, so skip the compute engine avoiding
>>>>>>>>>>> +		 * parallel execution with RCS
>>>>>>>>>>> +		 */
>>>>>>>>>>> +		if (has_rcs && hwe->engine_class ==
>>>>>>>>> DRM_XE_ENGINE_CLASS_COMPUTE &&
>>>>>>>>>>> +		    is_engine_contexts_victimized(fd, flags))
>>>>>>>>>>> +			continue;
>>>>>>>>>>> +
>>>>>>>>>>>      		threads_data[i].mutex = &mutex;
>>>>>>>>>>>      		threads_data[i].cond = &cond;
>>>>>>>>>>>      #define ADDRESS_SHIFT	39
>>>>>>>>>>> --
>>>>>>>>>>> 2.25.1
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Matt Roper
>>>>>>>>>> Graphics Software Engineer
>>>>>>>>>> Linux GPU Platform Enablement
>>>>>>>>>> Intel Corporation