[PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests reset domain aware

Thu Apr 4 23:45:04 UTC 2024

On 4/4/2024 16:22, John Harrison wrote:
> On 4/2/2024 22:35, Upadhyay, Tejas wrote:
>>> -----Original Message-----
>>> From: De Marchi, Lucas <lucas.demarchi at intel.com>
>>> Sent: Wednesday, April 3, 2024 2:26 AM
>>> To: Roper, Matthew D <matthew.d.roper at intel.com>
>>> Cc: Upadhyay, Tejas <tejas.upadhyay at intel.com>; igt-
>>> dev at lists.freedesktop.org; intel-xe at lists.freedesktop.org; Brost, 
>>> Matthew
>>> <matthew.brost at intel.com>
>>> Subject: Re: [PATCH V2 i-g-t] tests/xe_exec_threads: Make hang tests 
>>> reset
>>> domain aware
>>>
>>> On Tue, Apr 02, 2024 at 12:40:17PM -0700, Matt Roper wrote:
>>>> On Tue, Apr 02, 2024 at 05:52:23PM +0530, Tejas Upadhyay wrote:
>>>>> RCS/CCS are dependent engines as they are sharing reset domain.
>>>>> Whenever there is reset from CCS, all the exec queues running on RCS
>>>>> are victimised mainly on Lunarlake.
>>>>>
>>>>> Lets skip parallel execution on CCS with RCS.
>>>> I haven't really looked at this specific test in detail, but based on
>>>> your explanation here, you're also going to run into problems with
>>>> multiple CCS engines since they all share the same reset. You won't
>>>> see that on platforms like LNL that only have a single CCS, but
>>>> platforms
>>> but it is seen on LNL because of having both RCS and CCS.
>>>
>>>> like PVC, ATS-M, DG2, etc. can all have multiple CCS where a reset on
>>>> one kills anything running on the others.
>>>>
>>>>
>>>> Matt
>>>>
>>>>> It helps in fixing following errors:
>>>>> 1. Test assertion failure function test_legacy_mode, file, Failed
>>>>> assertion: data[i].data == 0xc0ffee
>>>>>
>>>>> 2.Test assertion failure function xe_exec, file ../lib/xe/xe_ioctl.c,
>>>>> Failed assertion: __xe_exec(fd, exec) == 0, error: -125 != 0
>>>>>
>>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
>>>>> ---
>>>>>   tests/intel/xe_exec_threads.c | 26 +++++++++++++++++++++++++-
>>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/tests/intel/xe_exec_threads.c
>>>>> b/tests/intel/xe_exec_threads.c index 8083980f9..31af61dc9 100644
>>>>> --- a/tests/intel/xe_exec_threads.c
>>>>> +++ b/tests/intel/xe_exec_threads.c
>>>>> @@ -710,6 +710,17 @@ static void *thread(void *data)
>>>>>       return NULL;
>>>>>   }
>>>>>
>>>>> +static bool is_engine_contexts_victimized(int fd, unsigned int
>>>>> +flags) {
>>>>> +    if (!IS_LUNARLAKE(intel_get_drm_devid(fd)))
>>>>> +        return false;
>>> as above, I don't think we should add any platform check here. It's 
>>> impossible
>>> to keep it up to date and it's also testing the wrong thing.
>>> AFAIU you don't want parallel submission on engines that share the same
>>> reset domain. So, this is actually what should be tested.
>> Platforms like  PVC, ATS-M, DG2, etc. have some kind of WA/noWA which 
>> helps to run things parallelly on engines in same reset domain and 
>> apparently BMG/LNL does not have that kind of support so applicable 
>> for LNL/BMG with parallel submission on RCS/CCS only.
>>
>> @Harrison, John C please reply if you have any other input here.
> I don't get what you mean by 'have some kind of WA/noWA'. All 
> platforms with compute engines have shared reset domains. That is all 
> there is to it. I.e. everything from TGL onwards. That includes RCS 
> and all CCS engines. So RCS + CCS, CCS0 + CCS1, RCS + CC0 + CCS1, etc. 
> Any platform with multiple engines that talk to EUs will reset all of 
> those engines in parallel.
>
> There are w/a's which make the situation even worse. E.g. on DG2/MTL 
> you are not allowed to context switch one of those engines while 
> another is busy. Which means that if one hangs, they all hang - you 
> cannot just wait for other workloads to complete and/or pre-empt them 
> off the engine prior to doing the shared reset. But there is nothing 
> that makes it better.
>
> I assume we are talking about GuC triggered engine resets here? As 
> opposed to driver triggered full GT resets?
>
> The GuC will attempt to idle all other connected engines first by 
> pre-empting out any executing contexts. If those contexts are 
> pre-emptible then they will survive - GuC will automatically restart 
> them once the reset is complete. If they are not (or at least not 
> pre-emptible within the pre-emption timeout limit) then they will be 
> killed as collateral damage.
>
> What are the workloads being submitted by this test? Are the 
> pre-emptible spinners? If so, then they should survive (assuming you 
> don't have the DG2/MTL RCS/CCS w/a in effect). If they are 
> non-preemptible spinners then they are toast.
>
> John.
>
>
>>
>> Thanks,
>> Tejas
>>> Lucas De Marchi
>>>
>>>>> +
>>>>> +    if (flags & HANG)
>>>>> +        return true;
>>>>> +
>>>>> +    return false;
>>>>> +}
>>>>> +
>>>>>   /**
>>>>>    * SUBTEST: threads-%s
>>>>>    * Description: Run threads %arg[1] test with multi threads @@
>>>>> -955,9 +966,13 @@ static void threads(int fd, int flags)
>>>>>       bool go = false;
>>>>>       int n_threads = 0;
>>>>>       int gt;
>>>>> +    bool has_rcs = false;
>>>>>
>>>>> -    xe_for_each_engine(fd, hwe)
>>>>> +    xe_for_each_engine(fd, hwe) {
>>>>> +        if (hwe->engine_class == DRM_XE_ENGINE_CLASS_RENDER)
>>>>> +            has_rcs = true;
>>>>>           ++n_engines;
>>>>> +    }
>>>>>
>>>>>       if (flags & BALANCER) {
>>>>>           xe_for_each_gt(fd, gt)
>>>>> @@ -990,6 +1005,15 @@ static void threads(int fd, int flags)
PS: There is nothing in the function name that suggests this is a reset 
specific test. If this is common code for multiple tests including some 
that do not expect to hit resets, then removing all testing of compute 
engines is a bad idea.

John.

>>>>>       }
>>>>>
>>>>>       xe_for_each_engine(fd, hwe) {
>>>>> +        /* RCS/CCS sharing reset domain hence dependent engines.
>>>>> +         * When CCS is doing reset, all the contexts of RCS are
>>>>> +         * victimized, so skip the compute engine avoiding
>>>>> +         * parallel execution with RCS
>>>>> +         */
>>>>> +        if (has_rcs && hwe->engine_class ==
>>> DRM_XE_ENGINE_CLASS_COMPUTE &&
>>>>> + is_engine_contexts_victimized(fd, flags))
>>>>> +            continue;
>>>>> +
>>>>>           threads_data[i].mutex = &mutex;
>>>>>           threads_data[i].cond = &cond;
>>>>>   #define ADDRESS_SHIFT    39
>>>>> -- 
>>>>> 2.25.1
>>>>>
>>>> -- 
>>>> Matt Roper
>>>> Graphics Software Engineer
>>>> Linux GPU Platform Enablement
>>>> Intel Corporation
>