[Intel-gfx] [PATCH v2] drm/i915/gt: update request engine before removing virtual GuC engine

Mon Jul 24 19:40:10 UTC 2023

On 7/19/2023 05:43, Tvrtko Ursulin wrote:
> On 19/07/2023 11:41, Andrzej Hajda wrote:
>> On 18.07.2023 17:48, Tvrtko Ursulin wrote:
>>> On 17/07/2023 19:03, John Harrison wrote:
>>>> On 7/13/2023 05:11, Tvrtko Ursulin wrote:
>>>>> On 13/07/2023 12:09, Andrzej Hajda wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 13.07.2023 09:39, Tvrtko Ursulin wrote:
>>>>>>> On 12/07/2023 19:54, John Harrison wrote:
>>>>>>>> On 7/12/2023 09:27, Andrzej Hajda wrote:
>>>>>>>>> On 12.07.2023 14:35, Tvrtko Ursulin wrote:
>>>>>>>>>> On 12/07/2023 13:18, Andrzej Hajda wrote:
>>>>>>>>>>> On 11.07.2023 17:27, Tvrtko Ursulin wrote:
>>>>>>>>>>>> On 11/07/2023 14:58, Andrzej Hajda wrote:
>>>>>>>>>>>>> On 11.07.2023 13:34, Andi Shyti wrote:
>>>>>>>>>>>>>> Hi Andrzej,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 
>>>>>>>>>>>>>>> +++++++++++
>>>>>>>>>>>>>>>           1 file changed, 11 insertions(+)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>          diff --git 
>>>>>>>>>>>>>>> a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>>>>>>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>>>>>>>>          index a0e3ef1c65d246..2c877ea5eda6f0 100644
>>>>>>>>>>>>>>>          --- 
>>>>>>>>>>>>>>> a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>>>>>>>>          +++ 
>>>>>>>>>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>>>>>>>>          @@ -3461,6 +3461,8 @@ static void 
>>>>>>>>>>>>>>> guc_prio_fini(struct i915_request *rq, struct 
>>>>>>>>>>>>>>> intel_context *ce)
>>>>>>>>>>>>>>>           static void remove_from_context(struct 
>>>>>>>>>>>>>>> i915_request *rq)
>>>>>>>>>>>>>>>           {
>>>>>>>>>>>>>>>                  struct intel_context *ce = 
>>>>>>>>>>>>>>> request_to_scheduling_context(rq);
>>>>>>>>>>>>>>>          +       struct intel_engine_cs *engine;
>>>>>>>>>>>>>>>          +       intel_engine_mask_t tmp;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> GEM_BUG_ON(intel_context_is_child(ce));
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>          @@ -3478,6 +3480,15 @@ static void 
>>>>>>>>>>>>>>> remove_from_context(struct i915_request *rq)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> atomic_dec(&ce->guc_id.ref);
>>>>>>>>>>>>>>> i915_request_notify_execute_cb_imm(rq);
>>>>>>>>>>>>>>>          +
>>>>>>>>>>>>>>>          +       /*
>>>>>>>>>>>>>>>          +        * GuC virtual engine can disappear 
>>>>>>>>>>>>>>> after this call, so let's assign
>>>>>>>>>>>>>>>          +        * something valid, as driver expects 
>>>>>>>>>>>>>>> this to be always valid pointer.
>>>>>>>>>>>>>>>          +        */
>>>>>>>>>>>>>>>          + for_each_engine_masked(engine, 
>>>>>>>>>>>>>>> rq->engine->gt, rq->execution_mask, tmp) {
>>>>>>>>>>>>>>>          +               rq->engine = engine;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>      yes... here the context might lose the virtual 
>>>>>>>>>>>>>>> engine... I wonder
>>>>>>>>>>>>>>>      whether this is the rigth solution, though. Maybe 
>>>>>>>>>>>>>>> we should set
>>>>>>>>>>>>>>>      rq->engine = NULL; and check for NULL? Don't know.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Setting NULL causes occasional null page de-reference in
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> i915_request_wait_timeout:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> mutex_release(&rq->engine->gt->reset.mutex.dep_map, 
>>>>>>>>>>>>>>> _THIS_IP_)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> rq->engine after removing rq from context is (IMHO) used 
>>>>>>>>>>>>>>> as a set of aliases
>>>>>>>>>>>>>>> for gt and i915 (despite rq itself contains the alias to 
>>>>>>>>>>>>>>> i915).
>>>>>>>>>>>>>> without investigating further, but maybe that code is not 
>>>>>>>>>>>>>> even
>>>>>>>>>>>>>> supposed to be executed, at this point, if the request's 
>>>>>>>>>>>>>> assigned
>>>>>>>>>>>>>> virtual engine is removed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Real tests show it is executed and the function 
>>>>>>>>>>>>> i915_request_wait_timeout is quite generic
>>>>>>>>>>>>> I guess it is quite typical use-case, the only question is 
>>>>>>>>>>>>> about timings - what happens earlier -
>>>>>>>>>>>>> finalization of i915_request_wait_timeout or context removal.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The other point rq->engine is accessed after context 
>>>>>>>>>>>>> removal is i915_fence_release -
>>>>>>>>>>>>> there is long comment there regarding virtual context and 
>>>>>>>>>>>>> reuse retired rq.
>>>>>>>>>>>>> Anyway calling there "intel_engine_is_virtual(rq->engine)" 
>>>>>>>>>>>>> is risky without this patch and KASAN complains clearly 
>>>>>>>>>>>>> about it:
>>>>>>>>>>>>> http://gfx-ci.igk.intel.com/tree/drm-tip/kasan.html?testfilter=gem_exec_balancer 
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like a bug introduced in bcb9aa45d5a0 ("Revert 
>>>>>>>>>>>> "drm/i915: Hold reference to intel_context over life of 
>>>>>>>>>>>> i915_request""), which was a partial revert of 1e98d8c52ed5 
>>>>>>>>>>>> ("drm/i915: Hold reference to intel_context over life of 
>>>>>>>>>>>> i915_request").
>>>>>>>>>>>>
>>>>>>>>>>>> Ie. if 1e98d8c52ed5 recognised the problem with 
>>>>>>>>>>>> disappearing rq->engine, then I am confused how 
>>>>>>>>>>>> bcb9aa45d5a0 left the rq->engine dereference in there after 
>>>>>>>>>>>> removing the extra reference.
>>>>>>>>>>>>
>>>>>>>>>>>> Could it be that the intel_engine_is_virtual check simply 
>>>>>>>>>>>> needs to be removed from i915_fence_release, restoring 
>>>>>>>>>>>> things to how they were before 1e98d8c52ed5? Could you try 
>>>>>>>>>>>> it out?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have already tried something similar [1] and KASAN bugs 
>>>>>>>>>>> disappeared, or more precisely gem_exec_balance tests 
>>>>>>>>>>> passed. But I have been warned by Nirmoy guc virtual engines 
>>>>>>>>>>> can be created for only one real engine (ie. 
>>>>>>>>>>> is_power_of_2(rq->execution_mask) is true but rq->engine 
>>>>>>>>>>> points to virtual engine).
>>>>>>>>>>>
>>>>>>>>>>> [1]: https://patchwork.freedesktop.org/series/118879/
>>>>>>>>>>
>>>>>>>>>> Ugh.. Try involving media umd folks to see if they need a 
>>>>>>>>>> single engine virtual engine? Or we could always just not 
>>>>>>>>>> create it in the driver, I mean just use the physical one.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In case there is single physical engine 
>>>>>>>>> intel_engine_create_virtual falls back to intel_context_create 
>>>>>>>>> (no virtual engine), but in case of parallel contexts there is 
>>>>>>>>> special KMD flag FORCE_VIRTUAL which enforces virtual engine 
>>>>>>>>> even for single physical engine. So it seems to be KMD concept.
>>>>>>>>>
>>>>>>>>> Anyway is it worth investigating how to make 
>>>>>>>>> "is_power_of_2(rq->execution_mask)" indication of dangling 
>>>>>>>>> engine pointer? It will not help in 1st case:
>>>>>>>>> mutex_release(&rq->engine->gt->reset.mutex.dep_map, _THIS_IP_)
>>>>>>>>>
>>>>>>>>>
>>>>>>>> There seems to be a fundamental problem here. Object 1 (rq) is 
>>>>>>>> holding a pointer to a reference counted and transient object 2 
>>>>>>>> (engine) but without taking a reference count for itself. That 
>>>>>>>> is a Bad Thing(tm).
>>>>>>
>>>>>> Engine is not ref counted (at least directly), hardware engines 
>>>>>> seems to be even persistent across whole life of i915.
>>>>>> I guess before introduction of virtual engines the assumption 
>>>>>> about validity of rq->engine was correct and developers used it 
>>>>>> to access rq->engine->gt, rq->engine->i915, etc
>>>>>> So the problems described here are probably leftovers of this 
>>>>>> change.
>>>>>> After virtual engines were introduced 
>>>>>> "is_power_of_2(rq->execution_mask)" was used to detect rq->engine 
>>>>>> is virtual.
>>>>>> And after adding parallel engines it does not seem to be valid 
>>>>>> check anymore.
>>>>>>
>>>>>>>> I'm not following the description in the revert patch as to why 
>>>>>>>> rq can't ref count the context/engine. Is there actually a 
>>>>>>>> recursive counting problem? Or is it just a lifetime issue 
>>>>>>>> caused by requests hanging around indefinitely because they are 
>>>>>>>> locked by a user process?
>>>>>>>>
>>>>>>>> Either way, jumping through convoluted hoops to ensure the code 
>>>>>>>> does not attempt to dereference a dangling pointer seems like 
>>>>>>>> the wrong fix. Removing the engine pointer when the request is 
>>>>>>>> completed and no longer dependent upon an engine (but before 
>>>>>>>> the engine can possibly be destroyed) seems like a much better 
>>>>>>>> solution. And then making the request handling code check for 
>>>>>>>> and cope with a null engine pointer. It sounds like the only 
>>>>>>>> problem there is the above mutex, but there is an alternate 
>>>>>>>> route to that? Although why a completed request would need 
>>>>>>>> access to a GT reset mutex seems confusing. If the request is 
>>>>>>>> done, then what connection does it still have to the GT?
>>>>>>>
>>>>>>> Agreed in principle but the question is how invasive would it be 
>>>>>>> to change the rules.
>>>>>>>
>>>>>>> With the latest info that the issue is really just the GuC 
>>>>>>> _parallel_ engine setup, and looking at the code, I wonder if we 
>>>>>>> couldn't just flag the rq->flags with "kernel context request". 
>>>>>>> The code in i915_fence_release claims the rq pool is only 
>>>>>>> relevant for those so it sounds it would be safe to skip 
>>>>>>> everything else based on that new flag.
>>>>>>>
>>>>>>> For the mutex_release path, presumable the bad deref is only 
>>>>>>> _after_ the wait, right? (Only once the request has been retired.)
>>>>>>>
>>>>>>> In which case caching the gt pointer at the start of 
>>>>>>> i915_request_wait_timeout would be sufficient.
>>>>>>>
>>>>>>> That should be a few lines fixup overall and then the idea of 
>>>>>>> allowing rq->engine to be reset to NULL can be explored more 
>>>>>>> leisurely.
>>>>>>
>>>>>> I guess:
>>>>>> - setting engine to NULL in remove_from_context,
>>>>>> - caching gt pointer,
>>>>>> - checking for null pointer in i915_fence_release
>>>>>>
>>>>>> should be enough to solve current issue. However I am not sure if 
>>>>>> there are no more dragons hidden in other execution paths. I will 
>>>>>> try inspect the code.
>>>>>
>>>>> What you have in this patch, cheat by replacing the engine, 
>>>>> *might* work for the short term *if* you can make sure all 
>>>>> parallel readers will see the updated rq->engine pointer in time, 
>>>>> before the old one gets freed.
>>>>>
>>>>> For the longer term solution - maybe making the engine reference 
>>>>> counted?
>>>> That was the point of the original solution - having the request 
>>>> reference count the context. The context is what owns the virtual 
>>>> engine. So ensuring that the context can't be destroyed while there 
>>>> are requests outstanding on it ensured that rq->engine would always 
>>>> be valid. Nice simple and clean solution.So why did that get 
>>>> reverted? What is the problem with VM_BIND and having requests 
>>>> ensure that their resources stay alive for the lifetime of the 
>>>> request?
>>>
>>> Don't ask me, but it perhaps it does read a bit vague from the 
>>> commit message on its own:
>>>
>>> commit bcb9aa45d5a0e11ef91245330c53cde214d15e8d (tag: 
>>> intel/CI_DIGN_563)
>>> Author: Niranjana Vishwanathapura <niranjana.vishwanathapura at intel.com>
>>> Date:   Wed Jun 15 00:13:47 2022 +0530
>>>
>>>     Revert "drm/i915: Hold reference to intel_context over life of 
>>> i915_request"
>>>         This reverts commit 1e98d8c52ed5dfbaf273c4423c636525c2ce59e7.
>>>         The problem with this patch is that it makes i915_request to 
>>> hold a
>>>     reference to intel_context, which in turn holds a reference on 
>>> the VM.
>>>     This strong back referencing can lead to reference loops which 
>>> leads
>>>     to resource leak.
>>>         An example is the upcoming VM_BIND work which requires VM to 
>>> hold
>>>     a reference to some shared VM specific BO. But this BO's dma-resv
>>>     fences holds reference to the i915_request thus leading to 
>>> reference
>>>     loop.
>>>         v2:
>>>       Do not use reserved requests for virtual engines
>>>
>>> So I don't know what was the leak or was it permanent leak (?!) or not.
>>>
>>> Given VM_BIND has been abandoned maybe this could even be 
>>> unreverted. I don't see a problem with holding a reference for the 
>>> request lifetime right now but could be wrong..
>>
>> unrevert or alternatively hold reference to context only in case of 
>> virtual engines, as in this case their lifetime is the same?
>
> IMO it is simpler to go for always, especially if we have doubts 
> execlists virtual engines might have the same issue just harder to hit 
> due the RCU free path. (I have doubts at least.)
>
> It also probably does not make sense to have both 
> intel_engine_is_virtual and is_power_of_2 checks in 
> i915_fence_release. Since intel_engine_is_virtual will be safe with a 
> reference, then is_power_of_2 hack can go. So not a direct un-revert, 
> but un-revert with edits/improvements.
Sounds good to me. The 'power of 2 means virtual' thing does sounds 
quite hacky.

>
> First thing though would be to get hold of Niranjana and John to bless 
> the whole plan, given they were involved in the original reference 
> counting and the revert.
I believe it was actually Matthew Brost that wrote the original 
reference counting patch. I recall that I reviewed it and it looked good 
to me at the time. I was not involved in the revert. I only realised the 
revert had happened because I saw this thread and got confused as to why 
there was a problem at all!

John.

>
> Regards,
>
> Tvrtko
>
>>
>> Regards
>> Andrzej
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> John.
>>>>
>>>>
>>>>>
>>>>> Or if setting rq->engine to NULL then evaluating the paths which 
>>>>> derefence it. AFAIR these request retirement races have been 
>>>>> generally tricky/annoying.
>>>>>
>>>>> For instance under the i915_request_wait_timeout chain.
>>>>>
>>>>> 1)
>>>>> __i915_spin_request - could be racy if request gets retired 
>>>>> between i915_request_wait_timeout/dma_fence_is_signaled check and 
>>>>> __i915_spin_request dereferencing rq->engine.props?
>>>>>
>>>>> 2)
>>>>> intel_rps_boost - claims to be safe by serialising via 
>>>>> i915_request_retire/I915_FENCE_FLAG_BOOST but is it? What prevents 
>>>>> retire tearing down the engine between:
>>>>>
>>>>>     if (!test_and_set_bit(I915_FENCE_FLAG_BOOST, &rq->fence.flags)) {
>>>>>
>>>>> ---> HERE - if whole retirement happens here <----
>>>>>
>>>>>         struct intel_rps *rps = &READ_ONCE(rq->engine)->gt->rps;
>>>>>
>>>>> 3)
>>>>> __intel_engine_flush_submission - could be racy to? What if the 
>>>>> whole process of consuming the request by the backend and 
>>>>> retirement happens between these two lines:
>>>>>
>>>>>     if (i915_request_is_ready(rq))
>>>>>
>>>>> ---> HERE <---
>>>>>         __intel_engine_flush_submission(rq->engine, false);
>>>>>
>>>>> Then "is ready" can be true, but by the 2nd line request retired 
>>>>> and rq->engine freed/NULL already.
>>>>>
>>>>> Lets hope I am just making unwarranted panic by being to away from 
>>>>> this area of the driver for a while. :) But if I am not, then it 
>>>>> could be that rq->engine is just overall too problematic and 
>>>>> perhaps we should have a look into making it reference counted by 
>>>>> the request.
>>>>>
>>>>>> Btw there is rq->i915 but code only uses "rq->engine->i915" which 
>>>>>> way shall we go? remove former or latter?
>>>>>
>>>>> Simpler/shorter option should be better.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>
>>