[PATCH v2 01/23] drm/xe: Error handling in xe_force_wake_get()

Jani Nikula jani.nikula at linux.intel.com
Wed Sep 18 07:19:05 UTC 2024


On Wed, 18 Sep 2024, "Ghimiray, Himal Prasad" <himal.prasad.ghimiray at intel.com> wrote:
> On 18-09-2024 00:20, Matthew Brost wrote:
>> On Tue, Sep 17, 2024 at 11:18:47AM +0530, Nilawar, Badal wrote:
>>>
>>>
>>> On 13-09-2024 18:47, Ghimiray, Himal Prasad wrote:
>>>>
>>>>
>>>> On 13-09-2024 16:56, Michal Wajdeczko wrote:
>>>>>
>>>>>
>>>>> On 13.09.2024 05:59, Ghimiray, Himal Prasad wrote:
>>>>>>
>>>>>>
>>>>>> On 13-09-2024 03:01, Michal Wajdeczko wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 12.09.2024 21:15, Himal Prasad Ghimiray wrote:
>>>>>>>> If an acknowledgment timeout occurs for a domain awake request, do not
>>>>>>>> increment the reference count for the domain. This ensures that
>>>>>>>> subsequent _get calls do not incorrectly assume the
>>>>>>>> domain is awake. The
>>>>>>>> return value is a mask of domains whose reference counts were
>>>>>>>> incremented, and these domains need to be released using
>>>>>>>> xe_force_wake_put.
>>>>>>>>
>>>>>>>> The caller needs to compare the return value with the input domains to
>>>>>>>> determine the success or failure of the operation and
>>>>>>>> decide whether to
>>>>>>>> continue or return accordingly.
>>>>>>>>
>>>>>>>> While at it, add simple kernel-doc for xe_force_wake_get()
>>>>>>>>
>>>>>>>> Cc: Badal Nilawar <badal.nilawar at intel.com>
>>>>>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>>>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>>>>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/xe/xe_force_wake.c | 35
>>>>>>>> ++++++++++++++++++++++++ +-----
>>>>>>>>     1 file changed, 29 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>>> index a64c14757c84..fa42d652d23f 100644
>>>>>>>> --- a/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>>> @@ -150,26 +150,49 @@ static int domain_sleep_wait(struct xe_gt *gt,
>>>>>>>>                          (ffs(tmp__) - 1))) && \
>>>>>>>>                          domain__->reg_ctl.addr)
>>>>>>>>     +/**
>>>>>>>> + * xe_force_wake_get : Increase the domain refcount; if it was 0
>>>>>>>> initially, wake the domain
>>>>>>>
>>>>>>> while likely this is still recognized by the kernel-doc tool, this is
>>>>>>> not correct notation for the function() documentation
>>>>>>
>>>>>>
>>>>>> I assume you are suggesting %s/xe_force_wake_get/xe_force_wake_get()
>>>>>> will fix it.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>> https://docs.kernel.org/doc-guide/kernel-doc.html#function-
>>>>>>> documentation
>>>>>>>
>>>>>>>> + * @fw: struct xe_force_wake
>>>>>>>> + * @domains: forcewake domains to get refcount on
>>>>>>>> + *
>>>>>>>> + * Increment refcount for the force-wake domain. If the domain is
>>>>>>>> + * asleep, awaken it and wait for acknowledgment within the specified
>>>>>>>> + * timeout. If a timeout occurs, decrement the refcount.
>>>>>>>
>>>>>>> not sure if doc shall be 1:1 of low level implementation details
>>>>>>
>>>>>> Does this sound okay ?
>>>>>> This function takes references for the input @domains and wakes them if
>>>>>> they are asleep.
>>>>>>
>>>>>>>
>>>>>>>> + * The caller should compare the return value with the @domains to
>>>>>>>> + * determine the success or failure of the operation.
>>>>>>>> + *
>>>>>>>> + * Return: mask of refcount increased domains.
>>>>>>>
>>>>>>> if we return a 'mask' then maybe it should be of 'unsigned int' type?
>>>>>>
>>>>>> Agreed. Will fix in next version.
>>>>>>
>>>>>>>
>>>>>>>> If the return value is
>>>>>>>> + * equal to the input parameter @domains, the operation is considered
>>>>>>>> + * successful. Otherwise, the operation is considered a failure, and
>>>>>>>> + * the caller should handle the failure case, potentially returning
>>>>>>>> + * -ETIMEDOUT.
>>>>>>>
>>>>>>> it looks that all problems with the nice API is due to the
>>>>>>> XE_FORCEWAKE_ALL that is not a single domain ID and requires extra care
>>>>>>>
>>>>>>> maybe there should be different pair of functions:
>>>>>>
>>>>>> I am not convinced with different pair of functions:
>>>>>>
>>>>>> In current implementation:
>>>>>>
>>>>>> int mask = xe_force_wake_get(fw, domains)
>>>>>> if (mask != domains) {
>>>>>>       Non critical path continue with warning;
>>>>>>        or
>>>>>>       critical path:
>>>>>>           xe_force_wake_put(fw, mask);
>>>>>>           return -ETIMEDOUT;
>>>>>> }
>>>>>>
>>>>>> do_ops;
>>>>>> xe_force_wake_put(fw, mask);
>>>>>> return err;
>>>>>>
>>>>>> Above flow remains intact irrespective of individual domains or
>>>>>> FORCEWAKE_ALL.
>>>>>>
>>>>>> In case of individual domains if (mask != domains) can be replaced with
>>>>>> (!mask) and user can avoid xe_force_wake_put(fw, mask) in failure path
>>>>>> since mask is 0;
>>>>>
>>>>> so maybe we should have (by reinventing i915?):
>>>>>
>>>>> // opaque, but zero means failure/no domains are awake
>>>>> typedef unsigned long xe_wakeref_t;
>>>>>
>>>>>
>>>>> // caller should test for ref != 0
>>>>> // but shall call put if ref != 0
>>>>> xe_wakeref_t xe_force_wake_get(fw, enum xe_force_wake_domains d)
>>>>>
>>>>> // safe to call with ref == 0
>>>>> void xe_force_wake_put(fw, xe_wakeref_t ref)
>>>>>
>>>>>
>>>>> // helpers for critical work that must be sure about domain
>>>>>
>>>>> // compares opaque ref with explicit domain != ALL
>>>>> // can be used by the code that obtained the ref
>>>>> bool xe_wakeref_has_domain(xe_wakeref_t, enum xe_force_wake_domains d)
>>>>>
>>>>> // compares fw with explicit domain != ALL
>>>>> // can be used by the code that does not have direct access to the ref
>>>>> bool xe_force_wake_is_awake(fw, enum xe_force_wake_domains d)
>>>>>
>>>>>
>>>>> // helpers for checking correctness
>>>>> void xe_force_wake_assert_held(fw, enum xe_force_wake_domains d)
>>>>>
>>>>>
>>>>> then usage would be:
>>>>>
>>>>> xe_wakeref_t ref;
>>>>>
>>>>> ref = xe_force_wake_get(fw, d);
>>>>> if (ref) {
>>>>>      // ...
>>>>>      xe_force_wake_put(fw, ref);
>>>>> }
>>>>>
>>>>> or:
>>>>>
>>>>> xe_wakeref_t ref;
>>>>>
>>>>> ref = xe_force_wake_get(fw, ALL);
>>>>> if (xe_wakeref_has_domain(ref, d1))
>>>>>      // ... critical work1
>>>>> if (xe_wakeref_has_domain(ref, d2))
>>>>>      // ... critical work2
>>>>> xe_force_wake_put(fw, ref);
>>>>>
>>>>>
>>>>> so above will be very similar to what you have but by having explicit
>>>>> types IMO it will help connect all functions into proper use-case flow
>>>>
>>>>
>>>> Agreed implementation/usage will be same, will use explicit type for
>>>> clarity.
>>>> IMO typedef unsigned int xe_wakeref_t is sufficient instead of
>>>> typedef unsigned long xe_wakeref_t;
>>>
>>> I agree with this.
>>>
>> 
>> What? Really? I thought it was pretty clear rule in kernel programing
>> not use typedefs [1]. Reading through conditions acceptable and I don't
>> use anything applies to this series, maybe a) applies but not really
>> convinced. The example in a) is a pte_t which can likely change based on
>> platform target whereas here we only have one target and see no reason
>> this needs to be opaque.
>> 
>> Matt
>> 
>> [1] https://www.kernel.org/doc/html/v4.14/process/coding-style.html#typedefs
>
>
> While running checkpatch on my changes, patchwork had also issued a 
> WARNING: NEW_TYPEDEFS: do not add new typedefs. I reviewed the usage in 
> the Linux kernel tree and found it used in many places, which led me to 
> assume it was safe. I now realize that I should have been more careful 
> in understanding the context of its usage and referred to the kernel 
> coding guidelines. This was an oversight on my part.
>
> Since this doesn’t impact the CI or runtime, I will avoid reverting to 
> unsigned int immediately and will hold off until I receive the other 
> review comments. I will incorporate the changes to revert it in 
> subsequent versions while also addressing the other review comments.
> Thank you for bringing this to the attention.

If you end up replicating intel_wakeref_t from i915, and go as deep as
the rabbit hole goes, you'll realize intel_wakeref_t is a pointer
disguised as an unsigned long. It's a struct ref_tracker * when you have
certain configs enabled.

You could just use struct ref_tracker * everywhere. It's an opaque type
to start with.

BR,
Jani.


>
> BR
> Himal
>
>> 
>>> Regards,
>>> Badal
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> // for single domain where ret=0 is success, ret<0 is error
>>>>>>
>>>>>> This leads to caller only calling xe_force_wake_put incase of get
>>>>>> success. so in case of caller continuing with failure, he will need to
>>>>>> ensure the put is not called.
>>>>>>
>>>>>> for example:
>>>>>> int ret;
>>>>>>
>>>>>> ret = xe_force_wake_get(fw, DOMAIN_GT);
>>>>>> XE_WARN_ON(ret)
>>>>>> if(!ret)
>>>>>>       xe_force_wake_put(fw, DOMAIN_GT);
>>>>>>
>>>>>>> int xe_force_wake_get(fw, enum xe_force_wake_domain_id id);
>>>>>>> void xe_force_wake_put(fw, enum xe_force_wake_domain_id id);
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> // for all domain where ret=0 is success, ret<0 is error
>>>>>>> int int xe_force_wake_get_all(fw);
>>>>>>> void xe_force_wake_put_all(fw);
>>>>>>
>>>>>> In case of xe_force_wake_get_all(fw) failure, how the caller will know
>>>>>> which domains got awake and which failed ?
>>>>>>
>>>>>> ret = xe_force_wake_get_all(fw);
>>>>>> if(!ret)
>>>>>>      No way to put awake domains to sleep
>>>>>
>>>>> in case of failure, it would be the responsibility of the
>>>>> xe_force_wake_get_all() to put all partial awakes immediately, since it
>>>>> failed to awake all requested domains (same as in single domain case)
>>>>>
>>>>> but let's drop this idea
>>>>>
>>>>>>
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> // input: mask of domains, return: mask of domain
>>>>>>> unsigned int xe_force_wake_get_mask(fw, mask);
>>>>>>> void xe_force_wake_put_mask(fw, mask);
>>>>>>>
>>>>>>> this last one can be just main implementation (static or public if we
>>>>>>> really want to continue with random set of enabled domains)
>>>>>>>
>>>>>>>> + */
>>>>>>>>     int xe_force_wake_get(struct xe_force_wake *fw,
>>>>>>>>                   enum xe_force_wake_domains domains)
>>>>>>>>     {
>>>>>>>>         struct xe_gt *gt = fw->gt;
>>>>>>>>         struct xe_force_wake_domain *domain;
>>>>>>>> -    enum xe_force_wake_domains tmp, woken = 0;
>>>>>>>> +    enum xe_force_wake_domains tmp, awake_rqst = 0, awake_ack = 0;
>>>>>>>
>>>>>>> it looks that you're abusing even more all enum variables by treating
>>>>>>> them as plain integers
>>>>>>
>>>>>> Miss at my end. Will address them in next version.
>>>>>>
>>>>>>>
>>>>>>>>         unsigned long flags;
>>>>>>>> -    int ret = 0;
>>>>>>>> +    int ret = domains;
>>>>>>>>           spin_lock_irqsave(&fw->lock, flags);
>>>>>>>>         for_each_fw_domain_masked(domain, domains, fw, tmp) {
>>>>>>>>             if (!domain->ref++) {
>>>>>>>> -            woken |= BIT(domain->id);
>>>>>>>> +            awake_rqst |= BIT(domain->id);
>>>>>>>>                 domain_wake(gt, domain);
>>>>>>>>             }
>>>>>>>>         }
>>>>>>>> -    for_each_fw_domain_masked(domain, woken, fw, tmp) {
>>>>>>>> -        ret |= domain_wake_wait(gt, domain);
>>>>>>>> +    for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>>>>>>>> +        if (domain_wake_wait(gt, domain) == 0) {
>>>>>>>> +            awake_ack |= BIT(domain->id);
>>>>>>>> +        } else {
>>>>>>>> +            ret &= ~BIT(domain->id);
>>>>>>>> +            --domain->ref;
>>>>>>>> +        }
>>>>>>>>         }
>>>>>>>> -    fw->awake_domains |= woken;
>>>>>>>> +
>>>>>>>> +    fw->awake_domains |= awake_ack;
>>>>>>>>         spin_unlock_irqrestore(&fw->lock, flags);
>>>>>>>>           return ret;
>>>

-- 
Jani Nikula, Intel


More information about the Intel-xe mailing list