[Intel-gfx] [PATCH] drm/i915: Verify the engine after acquiring the active.lock

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Wed Sep 18 16:28:53 UTC 2019


On 18/09/2019 17:10, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-09-18 16:54:36)
>>
>> On 17/09/2019 16:17, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2019-09-17 15:59:25)
>>>>
>>>> On 16/09/2019 12:38, Chris Wilson wrote:
>>>>> When using virtual engines, the rq->engine is not stable until we hold
>>>>> the engine->active.lock (as the virtual engine may be exchanged with the
>>>>> sibling). Since commit 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
>>>>> we may retire a request concurrently with resubmitting it to HW, we need
>>>>> to be extra careful to verify we are holding the correct lock for the
>>>>> request's active list. This is similar to the issue we saw with
>>>>> rescheduling the virtual requests, see sched_lock_engine().
>>>>>
>>>>> Or else:
>>>>>
>>>>> <4> [876.736126] list_add corruption. prev->next should be next (ffff8883f931a1f8), but was dead000000000100. (prev=ffff888361ffa610).
> ...
>>>>> <4> [876.736415] list_del corruption. prev->next should be ffff888361ffca10, but was ffff88840ac2c730
> 
>>> Yes. So preempt-to-busy introduces a window where the request is still
>>> on HW but we have returned it back to the submission queue. We catch up
>>> with the HW on the next process_csb, but it may have completed the
>>> request in the mean time (it is just not allowed to advance beyond the
>>> subsequent breadcrumb and so prevented from overtaking our knowledge of
>>> RING_TAIL and so we avoid telling the HW to go "backwards".).
>>
>> Would it be sufficient to do:
>>
>>     engine = READ_ONCE(rq->engine);
>>     spin_lock(...);
>>     list_del(...);
>>     spin_unlock(engine->active.lock);
>>
>> To ensure the same engine is used? Although the oops is not about
>> spinlock but list corruption. How does the list get corrupt though?
>> list_del does not care on which list the request is.. If it is really
>> key to have the correct lock, then why it is enough to re-check the
>> engine after taking the lock? Why rq->engine couldn't change under the
>> lock again? rq->engine does get updated under the very lock, no?
> 
> Don't forget that list_del changes the list around it:
> list_del() {
> 	list->prev->next = list->next;
> 	list->next->prev = list->prev;
> }
> 
> rq->engine can't change under the real->active.lock, as the assignment
> to rq->engine = (virtual, real) is made under the real->active.lock.

Sheehs.. I am re-inventing paradigms here.. :(

> execlists_dequeue:
> 	real->active.lock
> 	ve->active.lock
> 
> __unwind_incomplete_requests:
> 	real->active.lock
> 
> Hmm. I trust the trick employed in the patch is well proven by this
> point, but if we took the nested ve lock inside __unwind, do we need to
> worry. Hmm.

Might be nicer indeed. We would only have ve->real nesting, never the 
opposite, right?

Regards,

Tvrtko


More information about the Intel-gfx mailing list