[PATCH 5/6] drm/i915/gt: Serialize GRDOM access between multiple engine resets
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Tue Jun 28 15:49:23 UTC 2022
Hi,
On 27/06/2022 10:00, Mauro Carvalho Chehab (by way of Mauro Carvalho
Chehab <mauro.chehab at linux.intel.com>) wrote:
> Hi Tvrtko,
>
> On Fri, 24 Jun 2022 09:34:21 +0100
> Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com> wrote:
>
>> On 23/06/2022 12:17, Andi Shyti wrote:
>>> Hi Mauro,
>>>
>>> On Wed, Jun 15, 2022 at 04:27:39PM +0100, Mauro Carvalho Chehab wrote:
>>>> From: Chris Wilson <chris.p.wilson at intel.com>
>>>>
>>>> Don't allow two engines to be reset in parallel, as they would both
>>>> try to select a reset bit (and send requests to common registers)
>>>> and wait on that register, at the same time. Serialize control of
>>>> the reset requests/acks using the uncore->lock, which will also ensure
>>>> that no other GT state changes at the same time as the actual reset.
>>>>
>>>> Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
>>>>
>>>> Reported-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>>>> Cc: Andi Shyti <andi.shyti at intel.com>
>>>> Cc: stable at vger.kernel.org
>>>> Acked-by: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>>> Signed-off-by: Mauro Carvalho Chehab <mchehab at kernel.org>
>>>
>>> Reviewed-by: Andi Shyti <andi.shyti at linux.intel.com>
>>
>> Notice I had a bunch of questions and asks in this series so please do
>> not merge until those are addressed.
>>
>> In this particular patch (and some others) for instance Fixes: tag, at
>> least against that sha, shouldn't be there.
>
> Hmm... I sent an answer to your points, but I can't see it at:
>
> https://lore.kernel.org/all/160e613f-a0a8-18ff-5d4b-249d4280caa8@linux.intel.com/
>
> Maybe it got lost somewhere, I dunno.
Yeah, no replies received on my end I'm afraid.
>
> Yeah, indeed the fixes tag on patch 5/6 should be removed as this is not
> directly related to changeset 7938d61591d3. Yet, this one is required for
> patch 6 to work.
>
> The other patches on this series, though, are modifying the code
> introduced by changeset 7938d61591d3.
Modifying the code does not strictly means something is a fix for a
certain patch.
> Patch 2 is clearly a workaround needed for TLB cache invalidation to
> work on some GPUs. So, while not related to Broadwell, they're also
> fixing some TLB cache issues. So, IMO, it should keep the fixes.
Umesh commented that patch 2 is not needed - who is right then? :)
> I tried to port just the two serialize patches to drm-tip, in order
> to solve the issues on Broadwell, but it didn't work, as the logic
> inside the spinlock could be calling schedule() with a spinlock hold:
>
> Jun 14 17:38:48 silver kernel: [ 23.227813] BUG: sleeping function called from invalid context at drivers/gpu/drm/i915/intel_uncore.c:2496
> Jun 14 17:38:48 silver kernel: [ 23.227816] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 37, name: kworker/u8:1
> Jun 14 17:38:48 silver kernel: [ 23.227818] preempt_count: 1, expected: 0
> Jun 14 17:38:48 silver kernel: [ 23.227819] RCU nest depth: 0, expected: 0
> Jun 14 17:38:48 silver kernel: [ 23.227820] 5 locks held by kworker/u8:1/37:
> Jun 14 17:38:48 silver kernel: [ 23.227822] #0: ffff88811159b538 ((wq_completion)i915){+.+.}-{0:0}, at: process_one_work+0x1e0/0x580
> Jun 14 17:38:48 silver kernel: [ 23.227831] #1: ffffc90000183e60 ((work_completion)(&(&i915->mm.free_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1e0/0x580
> Jun 14 17:38:48 silver kernel: [ 23.227837] #2: ffff88811b34c5e8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: __i915_gem_free_objects+0xba/0x210 [i915]
> Jun 14 17:38:48 silver kernel: [ 23.228283] #3: ffff88810a66c2d8 (>->tlb_invalidate_lock){+.+.}-{3:3}, at: intel_gt_invalidate_tlbs+0xe7/0x4d0 [i915]
> Jun 14 17:38:48 silver kernel: [ 23.228663] #4: ffff88810a668f28 (&uncore->lock){-.-.}-{2:2}, at: intel_gt_invalidate_tlbs+0x115/0x4d0 [i915]
>
> I didn't investigate the root cause, but it seems related to PM, so
> patches 1 and 3 seem to be required for the serialization logic
> to actually work.
Yes that is clear, what is needed is the split of the for_each_engine
loop into request and wait.
But question is how much backporting trouble will the _extra_ changes
patch 1 brings create.
In the ideal world patch 1 wouldn't be an optimising one, I mean adding
skipping of TLB invalidations on idle engines but just the loop split.
That would make it smaller and more suitable for Cc: stable. Because
both i915_gem_pages.c and intel_gt_pm.h hunks wouldn't even be there.
And the refactor in intel_gt_invalidate_tlbs would be smaller since it
wouldn't be adding the engine awake checks...
> So, I would keep the Fixes: tag mentioning changeset 7938d61591d3
> on patches: 1, 2, 3 and 6.
... which for me means a different patch 1, followed by patch 6 (moved
to be patch 2) would be ideal stable material.
Then we have the current patch 2 which is open/unknown (to me at least).
And the rest seem like optimisations which shouldn't be tagged as fixes.
Apart from patch 5 which should be cc: stable, but no fixes as agreed.
Could you please double check if what I am suggesting here is feasible
to implement and if it is just send those minimal patches out alone?
Maybe it even makes sense to squash such 1&2 into a single patch.
Again, since the original TLB flush was backported quite far back into
long term stable releases I think it would be much easier to really have
a minimal patch/series to fix Broadwell in those kernels.
Regards,
Tvrtko
>
> Yet, IMO the entire series should be merged on -stable.
>
> If that's OK for you and there's no additional issues to be
> addressed, I'll submit a v2 of this series removing the Fixes tag
> from patches 4 and 5.
>
> Regards,
> Mauro
More information about the dri-devel
mailing list