[Intel-gfx] [PATCH 01/12] drm/i915: Remove bogus GEM_BUG_ON in unpark

Fri Jul 22 19:09:38 UTC 2022

On 7/21/2022 02:24, Tvrtko Ursulin wrote:
> On 21/07/2022 01:54, John Harrison wrote:
>> On 7/19/2022 02:42, Tvrtko Ursulin wrote:
>>> On 19/07/2022 01:05, John Harrison wrote:
>>>> On 7/18/2022 05:15, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 13/07/2022 00:31, John.C.Harrison at Intel.com wrote:
>>>>>> From: Matthew Brost <matthew.brost at intel.com>
>>>>>>
>>>>>> Remove bogus GEM_BUG_ON which compared kernel context timeline 
>>>>>> seqno to
>>>>>> seqno in memory on engine PM unpark. If a GT reset occurred these 
>>>>>> values
>>>>>> might not match as a kernel context could be skipped. This bug was
>>>>>> hidden by always switching to a kernel context on park (execlists
>>>>>> requirement).
>>>>>
>>>>> Reset of the kernel context? Under which circumstances does that 
>>>>> happen?
>>>> As per description, the issue is with full GT reset.
>>>>
>>>>>
>>>>> It is unclear if the claim is this to be a general problem or the 
>>>>> assert is only invalid with the GuC. Lack of a CI reported issue 
>>>>> suggests it is not a generic problem?
>>>> Currently it is not an issue because we always switch to the kernel 
>>>> context because that's how execlists works and the entire driver is 
>>>> fundamentally based on execlist operation. When we stop using the 
>>>> kernel context as a (non-functional) barrier when using GuC 
>>>> submission, then you would see an issue without this fix.
>
> Let me pick this point to try again - I am simply asking for a clear 
> description of steps which lead to the problem, instead of, what I 
> find are, generic and hard to penetrate statements of invalid 
> assumptions etc.
>
> I picked this spot because of this: "When we stop using the kernel 
> context as a (non-functional) barrier when using GuC submission, then 
> you would see an issue without this fix."
>
> I point to 363324292710 ("drm/i915/guc: Don't call 
> switch_to_kernel_context with GuC submission"). Hence it is not when 
> but it already happened. Which in my mind then does not compute - I 
> can't grok the explanation which appears to fall over on the first claim.
>
> Or perhaps the bug on is already firing today on every GuC enabled 
> machine in the CI? In which case there is a Fixes: link to be added?
>
> I have asked about, if we have 363324292710, and if this patch is 
> removing the seqno bug on, why it is not removing something more in 
> __engine_unpark, gated on "is guc"? Like ss there a point to 
> sanitizing the context which wasn't lost, because it wasn't used to 
> park the engine with?
>
> Or if the problem can't be hit with execlists (in case reset claim 
> from the commit message misleading), why shouldn't the bug on be 
> changed to contain the !guc condition instead of being remove?
>
> I am simply asking for a clear explanation of the conditions and steps 
> which lead to the bug on incorrectly firing. It doesn't have to be 
> long text or anything like that, just clear so we can close this and 
> move on.
>
> Regards,
>
> Tvrtko
@Matthew Brost, it's your patch, do you recall the details of when it 
was going bang? I vaguely recall something about it being hit in local 
testing pre-merge rather than by CI post-merge.

John.

>
>>>
>>> Issue is with GuC, GuC and full reset, or with full reset regardless 
>>> of the backend?
>> The issue is with code making invalid assumptions. The assumption is 
>> currently not failing because the execlist backend requires the use 
>> of a barrier context for a bunch of operations. The GuC backend does 
>> not require this. In fact, the barrier context does not function as a 
>> barrier when the scheduler is external to i915. Hence the desire to 
>> remove the use of the barrier context from generic i915 operation and 
>> make it only used when in execlist mode. At that point, the invalid 
>> assumption will no longer work and the BUG will fire.
>>
>>>
>>> If issue is only with GuC patch should have drm/i915/guc prefix as 
>>> minimum. But if it actually only becomes a problem when GuC backend 
>>> stops parking with the kernel context when I think the whole unpark 
>>> code should be refactored in a cleaner way than just removing the 
>>> one assert. Otherwise what is the point of leaving everything else 
>>> in there?
>>>
>>> Or if the issue is backend agnostic, *if* full reset happens to hit 
>>> during parking, then it is different. Wouldn't that be a race with 
>>> parking and reset which probably shouldn't happen to start with.
>>>
>> The issue is neither with GuC nor with resets, GT or otherwise. The 
>> issue is with generic i915 code making assumptions about backend 
>> implementations that are only correct for the execlist implementation.
>>
>> John.
>>
>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>
>>>> John.
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
>>>>>>   1 file changed, 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c 
>>>>>> b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> index b0a4a2dbe3ee9..fb3e1599d04ec 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> @@ -68,8 +68,6 @@ static int __engine_unpark(struct intel_wakeref 
>>>>>> *wf)
>>>>>>                ce->timeline->seqno,
>>>>>> READ_ONCE(*ce->timeline->hwsp_seqno),
>>>>>>                ce->ring->emit);
>>>>>> -        GEM_BUG_ON(ce->timeline->seqno !=
>>>>>> - READ_ONCE(*ce->timeline->hwsp_seqno));
>>>>>>       }
>>>>>>         if (engine->unpark)
>>>>
>>