[Intel-gfx] [PATCH 01/12] drm/i915: Remove bogus GEM_BUG_ON in unpark
John Harrison
john.c.harrison at intel.com
Fri Jul 22 19:09:38 UTC 2022
On 7/21/2022 02:24, Tvrtko Ursulin wrote:
> On 21/07/2022 01:54, John Harrison wrote:
>> On 7/19/2022 02:42, Tvrtko Ursulin wrote:
>>> On 19/07/2022 01:05, John Harrison wrote:
>>>> On 7/18/2022 05:15, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 13/07/2022 00:31, John.C.Harrison at Intel.com wrote:
>>>>>> From: Matthew Brost <matthew.brost at intel.com>
>>>>>>
>>>>>> Remove bogus GEM_BUG_ON which compared kernel context timeline
>>>>>> seqno to
>>>>>> seqno in memory on engine PM unpark. If a GT reset occurred these
>>>>>> values
>>>>>> might not match as a kernel context could be skipped. This bug was
>>>>>> hidden by always switching to a kernel context on park (execlists
>>>>>> requirement).
>>>>>
>>>>> Reset of the kernel context? Under which circumstances does that
>>>>> happen?
>>>> As per description, the issue is with full GT reset.
>>>>
>>>>>
>>>>> It is unclear if the claim is this to be a general problem or the
>>>>> assert is only invalid with the GuC. Lack of a CI reported issue
>>>>> suggests it is not a generic problem?
>>>> Currently it is not an issue because we always switch to the kernel
>>>> context because that's how execlists works and the entire driver is
>>>> fundamentally based on execlist operation. When we stop using the
>>>> kernel context as a (non-functional) barrier when using GuC
>>>> submission, then you would see an issue without this fix.
>
> Let me pick this point to try again - I am simply asking for a clear
> description of steps which lead to the problem, instead of, what I
> find are, generic and hard to penetrate statements of invalid
> assumptions etc.
>
> I picked this spot because of this: "When we stop using the kernel
> context as a (non-functional) barrier when using GuC submission, then
> you would see an issue without this fix."
>
> I point to 363324292710 ("drm/i915/guc: Don't call
> switch_to_kernel_context with GuC submission"). Hence it is not when
> but it already happened. Which in my mind then does not compute - I
> can't grok the explanation which appears to fall over on the first claim.
>
> Or perhaps the bug on is already firing today on every GuC enabled
> machine in the CI? In which case there is a Fixes: link to be added?
>
> I have asked about, if we have 363324292710, and if this patch is
> removing the seqno bug on, why it is not removing something more in
> __engine_unpark, gated on "is guc"? Like ss there a point to
> sanitizing the context which wasn't lost, because it wasn't used to
> park the engine with?
>
> Or if the problem can't be hit with execlists (in case reset claim
> from the commit message misleading), why shouldn't the bug on be
> changed to contain the !guc condition instead of being remove?
>
> I am simply asking for a clear explanation of the conditions and steps
> which lead to the bug on incorrectly firing. It doesn't have to be
> long text or anything like that, just clear so we can close this and
> move on.
>
> Regards,
>
> Tvrtko
@Matthew Brost, it's your patch, do you recall the details of when it
was going bang? I vaguely recall something about it being hit in local
testing pre-merge rather than by CI post-merge.
John.
>
>>>
>>> Issue is with GuC, GuC and full reset, or with full reset regardless
>>> of the backend?
>> The issue is with code making invalid assumptions. The assumption is
>> currently not failing because the execlist backend requires the use
>> of a barrier context for a bunch of operations. The GuC backend does
>> not require this. In fact, the barrier context does not function as a
>> barrier when the scheduler is external to i915. Hence the desire to
>> remove the use of the barrier context from generic i915 operation and
>> make it only used when in execlist mode. At that point, the invalid
>> assumption will no longer work and the BUG will fire.
>>
>>>
>>> If issue is only with GuC patch should have drm/i915/guc prefix as
>>> minimum. But if it actually only becomes a problem when GuC backend
>>> stops parking with the kernel context when I think the whole unpark
>>> code should be refactored in a cleaner way than just removing the
>>> one assert. Otherwise what is the point of leaving everything else
>>> in there?
>>>
>>> Or if the issue is backend agnostic, *if* full reset happens to hit
>>> during parking, then it is different. Wouldn't that be a race with
>>> parking and reset which probably shouldn't happen to start with.
>>>
>> The issue is neither with GuC nor with resets, GT or otherwise. The
>> issue is with generic i915 code making assumptions about backend
>> implementations that are only correct for the execlist implementation.
>>
>> John.
>>
>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>
>>>> John.
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
>>>>>> 1 file changed, 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> index b0a4a2dbe3ee9..fb3e1599d04ec 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> @@ -68,8 +68,6 @@ static int __engine_unpark(struct intel_wakeref
>>>>>> *wf)
>>>>>> ce->timeline->seqno,
>>>>>> READ_ONCE(*ce->timeline->hwsp_seqno),
>>>>>> ce->ring->emit);
>>>>>> - GEM_BUG_ON(ce->timeline->seqno !=
>>>>>> - READ_ONCE(*ce->timeline->hwsp_seqno));
>>>>>> }
>>>>>> if (engine->unpark)
>>>>
>>
More information about the Intel-gfx
mailing list