[PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset domain aware

Thu Apr 4 23:22:50 UTC 2024

On 4/2/2024 22:35, Upadhyay, Tejas wrote:
>> -----Original Message-----
>> From: De Marchi, Lucas <lucas.demarchi at intel.com>
>> Sent: Wednesday, April 3, 2024 2:26 AM
>> To: Roper, Matthew D <matthew.d.roper at intel.com>
>> Cc: Upadhyay, Tejas <tejas.upadhyay at intel.com>; igt-
>> dev at lists.freedesktop.org; intel-xe at lists.freedesktop.org; Brost, Matthew
>> <matthew.brost at intel.com>
>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset
>> domain aware
>>
>> On Tue, Apr 02, 2024 at 12:40:17PM -0700, Matt Roper wrote:
>>> On Tue, Apr 02, 2024 at 05:52:23PM +0530, Tejas Upadhyay wrote:
>>>> RCS/CCS are dependent engines as they are sharing reset domain.
>>>> Whenever there is reset from CCS, all the exec queues running on RCS
>>>> are victimised mainly on Lunarlake.
>>>>
>>>> Lets skip parallel execution on CCS with RCS.
>>> I haven't really looked at this specific test in detail, but based on
>>> your explanation here, you're also going to run into problems with
>>> multiple CCS engines since they all share the same reset.  You won't
>>> see that on platforms like LNL that only have a single CCS, but
>>> platforms
>> but it is seen on LNL because of having both RCS and CCS.
>>
>>> like PVC, ATS-M, DG2, etc. can all have multiple CCS where a reset on
>>> one kills anything running on the others.
>>>
>>>
>>> Matt
>>>
>>>> It helps in fixing following errors:
>>>> 1. Test assertion failure function test_legacy_mode, file, Failed
>>>> assertion: data[i].data == 0xc0ffee
>>>>
>>>> 2.Test assertion failure function xe_exec, file ../lib/xe/xe_ioctl.c,
>>>> Failed assertion: __xe_exec(fd, exec) == 0, error: -125 != 0
>>>>
>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
>>>> ---
>>>>   tests/intel/xe_exec_threads.c | 26 +++++++++++++++++++++++++-
>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/tests/intel/xe_exec_threads.c
>>>> b/tests/intel/xe_exec_threads.c index 8083980f9..31af61dc9 100644
>>>> --- a/tests/intel/xe_exec_threads.c
>>>> +++ b/tests/intel/xe_exec_threads.c
>>>> @@ -710,6 +710,17 @@ static void *thread(void *data)
>>>>   	return NULL;
>>>>   }
>>>>
>>>> +static bool is_engine_contexts_victimized(int fd, unsigned int
>>>> +flags) {
>>>> +	if (!IS_LUNARLAKE(intel_get_drm_devid(fd)))
>>>> +		return false;
>> as above, I don't think we should add any platform check here. It's impossible
>> to keep it up to date and it's also testing the wrong thing.
>> AFAIU you don't want parallel submission on engines that share the same
>> reset domain. So, this is actually what should be tested.
> Platforms like  PVC, ATS-M, DG2, etc. have some kind of WA/noWA which helps to run things parallelly on engines in same reset domain and apparently BMG/LNL does not have that kind of support so applicable for LNL/BMG with parallel submission on RCS/CCS only.
>
> @Harrison, John C please reply if you have any other input here.
I don't get what you mean by 'have some kind of WA/noWA'. All platforms 
with compute engines have shared reset domains. That is all there is to 
it. I.e. everything from TGL onwards. That includes RCS and all CCS 
engines. So RCS + CCS, CCS0 + CCS1, RCS + CC0 + CCS1, etc. Any platform 
with multiple engines that talk to EUs will reset all of those engines 
in parallel.

There are w/a's which make the situation even worse. E.g. on DG2/MTL you 
are not allowed to context switch one of those engines while another is 
busy. Which means that if one hangs, they all hang - you cannot just 
wait for other workloads to complete and/or pre-empt them off the engine 
prior to doing the shared reset. But there is nothing that makes it better.

I assume we are talking about GuC triggered engine resets here? As 
opposed to driver triggered full GT resets?

The GuC will attempt to idle all other connected engines first by 
pre-empting out any executing contexts. If those contexts are 
pre-emptible then they will survive - GuC will automatically restart 
them once the reset is complete. If they are not (or at least not 
pre-emptible within the pre-emption timeout limit) then they will be 
killed as collateral damage.

What are the workloads being submitted by this test? Are the 
pre-emptible spinners? If so, then they should survive (assuming you 
don't have the DG2/MTL RCS/CCS w/a in effect). If they are 
non-preemptible spinners then they are toast.

John.

>
> Thanks,
> Tejas
>> Lucas De Marchi
>>
>>>> +
>>>> +	if (flags & HANG)
>>>> +		return true;
>>>> +
>>>> +	return false;
>>>> +}
>>>> +
>>>>   /**
>>>>    * SUBTEST: threads-%s
>>>>    * Description: Run threads %arg[1] test with multi threads @@
>>>> -955,9 +966,13 @@ static void threads(int fd, int flags)
>>>>   	bool go = false;
>>>>   	int n_threads = 0;
>>>>   	int gt;
>>>> +	bool has_rcs = false;
>>>>
>>>> -	xe_for_each_engine(fd, hwe)
>>>> +	xe_for_each_engine(fd, hwe) {
>>>> +		if (hwe->engine_class == DRM_XE_ENGINE_CLASS_RENDER)
>>>> +			has_rcs = true;
>>>>   		++n_engines;
>>>> +	}
>>>>
>>>>   	if (flags & BALANCER) {
>>>>   		xe_for_each_gt(fd, gt)
>>>> @@ -990,6 +1005,15 @@ static void threads(int fd, int flags)
>>>>   	}
>>>>
>>>>   	xe_for_each_engine(fd, hwe) {
>>>> +		/* RCS/CCS sharing reset domain hence dependent engines.
>>>> +		 * When CCS is doing reset, all the contexts of RCS are
>>>> +		 * victimized, so skip the compute engine avoiding
>>>> +		 * parallel execution with RCS
>>>> +		 */
>>>> +		if (has_rcs && hwe->engine_class ==
>> DRM_XE_ENGINE_CLASS_COMPUTE &&
>>>> +		    is_engine_contexts_victimized(fd, flags))
>>>> +			continue;
>>>> +
>>>>   		threads_data[i].mutex = &mutex;
>>>>   		threads_data[i].cond = &cond;
>>>>   #define ADDRESS_SHIFT	39
>>>> --
>>>> 2.25.1
>>>>
>>> --
>>> Matt Roper
>>> Graphics Software Engineer
>>> Linux GPU Platform Enablement
>>> Intel Corporation