[PATCH v2 01/23] drm/xe: Error handling in xe_force_wake_get()

Ghimiray, Himal Prasad himal.prasad.ghimiray at intel.com
Fri Sep 13 13:17:37 UTC 2024



On 13-09-2024 16:56, Michal Wajdeczko wrote:
> 
> 
> On 13.09.2024 05:59, Ghimiray, Himal Prasad wrote:
>>
>>
>> On 13-09-2024 03:01, Michal Wajdeczko wrote:
>>>
>>>
>>> On 12.09.2024 21:15, Himal Prasad Ghimiray wrote:
>>>> If an acknowledgment timeout occurs for a domain awake request, do not
>>>> increment the reference count for the domain. This ensures that
>>>> subsequent _get calls do not incorrectly assume the domain is awake. The
>>>> return value is a mask of domains whose reference counts were
>>>> incremented, and these domains need to be released using
>>>> xe_force_wake_put.
>>>>
>>>> The caller needs to compare the return value with the input domains to
>>>> determine the success or failure of the operation and decide whether to
>>>> continue or return accordingly.
>>>>
>>>> While at it, add simple kernel-doc for xe_force_wake_get()
>>>>
>>>> Cc: Badal Nilawar <badal.nilawar at intel.com>
>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_force_wake.c | 35 +++++++++++++++++++++++++-----
>>>>    1 file changed, 29 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.c
>>>> b/drivers/gpu/drm/xe/xe_force_wake.c
>>>> index a64c14757c84..fa42d652d23f 100644
>>>> --- a/drivers/gpu/drm/xe/xe_force_wake.c
>>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.c
>>>> @@ -150,26 +150,49 @@ static int domain_sleep_wait(struct xe_gt *gt,
>>>>                         (ffs(tmp__) - 1))) && \
>>>>                         domain__->reg_ctl.addr)
>>>>    +/**
>>>> + * xe_force_wake_get : Increase the domain refcount; if it was 0
>>>> initially, wake the domain
>>>
>>> while likely this is still recognized by the kernel-doc tool, this is
>>> not correct notation for the function() documentation
>>
>>
>> I assume you are suggesting %s/xe_force_wake_get/xe_force_wake_get()
>> will fix it.
>>
>>
>>>
>>> [1]
>>> https://docs.kernel.org/doc-guide/kernel-doc.html#function-documentation
>>>
>>>> + * @fw: struct xe_force_wake
>>>> + * @domains: forcewake domains to get refcount on
>>>> + *
>>>> + * Increment refcount for the force-wake domain. If the domain is
>>>> + * asleep, awaken it and wait for acknowledgment within the specified
>>>> + * timeout. If a timeout occurs, decrement the refcount.
>>>
>>> not sure if doc shall be 1:1 of low level implementation details
>>
>> Does this sound okay ?
>> This function takes references for the input @domains and wakes them if
>> they are asleep.
>>
>>>
>>>> + * The caller should compare the return value with the @domains to
>>>> + * determine the success or failure of the operation.
>>>> + *
>>>> + * Return: mask of refcount increased domains.
>>>
>>> if we return a 'mask' then maybe it should be of 'unsigned int' type?
>>
>> Agreed. Will fix in next version.
>>
>>>
>>>> If the return value is
>>>> + * equal to the input parameter @domains, the operation is considered
>>>> + * successful. Otherwise, the operation is considered a failure, and
>>>> + * the caller should handle the failure case, potentially returning
>>>> + * -ETIMEDOUT.
>>>
>>> it looks that all problems with the nice API is due to the
>>> XE_FORCEWAKE_ALL that is not a single domain ID and requires extra care
>>>
>>> maybe there should be different pair of functions:
>>
>> I am not convinced with different pair of functions:
>>
>> In current implementation:
>>
>> int mask = xe_force_wake_get(fw, domains)
>> if (mask != domains) {
>>      Non critical path continue with warning;
>>       or
>>      critical path:
>>          xe_force_wake_put(fw, mask);
>>          return -ETIMEDOUT;
>> }
>>
>> do_ops;
>> xe_force_wake_put(fw, mask);
>> return err;
>>
>> Above flow remains intact irrespective of individual domains or
>> FORCEWAKE_ALL.
>>
>> In case of individual domains if (mask != domains) can be replaced with
>> (!mask) and user can avoid xe_force_wake_put(fw, mask) in failure path
>> since mask is 0;
> 
> so maybe we should have (by reinventing i915?):
> 
> // opaque, but zero means failure/no domains are awake
> typedef unsigned long xe_wakeref_t;
> 
> 
> // caller should test for ref != 0
> // but shall call put if ref != 0
> xe_wakeref_t xe_force_wake_get(fw, enum xe_force_wake_domains d)
> 
> // safe to call with ref == 0
> void xe_force_wake_put(fw, xe_wakeref_t ref)
> 
> 
> // helpers for critical work that must be sure about domain
> 
> // compares opaque ref with explicit domain != ALL
> // can be used by the code that obtained the ref
> bool xe_wakeref_has_domain(xe_wakeref_t, enum xe_force_wake_domains d)
> 
> // compares fw with explicit domain != ALL
> // can be used by the code that does not have direct access to the ref
> bool xe_force_wake_is_awake(fw, enum xe_force_wake_domains d)
> 
> 
> // helpers for checking correctness
> void xe_force_wake_assert_held(fw, enum xe_force_wake_domains d)
> 
> 
> then usage would be:
> 
> xe_wakeref_t ref;
> 
> ref = xe_force_wake_get(fw, d);
> if (ref) {
> 	// ...
> 	xe_force_wake_put(fw, ref);
> }
> 
> or:
> 
> xe_wakeref_t ref;
> 
> ref = xe_force_wake_get(fw, ALL);
> if (xe_wakeref_has_domain(ref, d1))
> 	// ... critical work1
> if (xe_wakeref_has_domain(ref, d2))
> 	// ... critical work2
> xe_force_wake_put(fw, ref);
> 
> 
> so above will be very similar to what you have but by having explicit
> types IMO it will help connect all functions into proper use-case flow


Agreed implementation/usage will be same, will use explicit type for 
clarity.
IMO typedef unsigned int xe_wakeref_t is sufficient instead of
typedef unsigned long xe_wakeref_t;


> 
>>
>>
>>>
>>> // for single domain where ret=0 is success, ret<0 is error
>>
>> This leads to caller only calling xe_force_wake_put incase of get
>> success. so in case of caller continuing with failure, he will need to
>> ensure the put is not called.
>>
>> for example:
>> int ret;
>>
>> ret = xe_force_wake_get(fw, DOMAIN_GT);
>> XE_WARN_ON(ret)
>> if(!ret)
>>      xe_force_wake_put(fw, DOMAIN_GT);
>>
>>> int xe_force_wake_get(fw, enum xe_force_wake_domain_id id);
>>> void xe_force_wake_put(fw, enum xe_force_wake_domain_id id);
>>>
>>> and
>>>
>>> // for all domain where ret=0 is success, ret<0 is error
>>> int int xe_force_wake_get_all(fw);
>>> void xe_force_wake_put_all(fw);
>>
>> In case of xe_force_wake_get_all(fw) failure, how the caller will know
>> which domains got awake and which failed ?
>>
>> ret = xe_force_wake_get_all(fw);
>> if(!ret)
>>     No way to put awake domains to sleep
> 
> in case of failure, it would be the responsibility of the
> xe_force_wake_get_all() to put all partial awakes immediately, since it
> failed to awake all requested domains (same as in single domain case)
> 
> but let's drop this idea
> 
>>
>>>
>>> and
>>>
>>> // input: mask of domains, return: mask of domain
>>> unsigned int xe_force_wake_get_mask(fw, mask);
>>> void xe_force_wake_put_mask(fw, mask);
>>>
>>> this last one can be just main implementation (static or public if we
>>> really want to continue with random set of enabled domains)
>>>
>>>> + */
>>>>    int xe_force_wake_get(struct xe_force_wake *fw,
>>>>                  enum xe_force_wake_domains domains)
>>>>    {
>>>>        struct xe_gt *gt = fw->gt;
>>>>        struct xe_force_wake_domain *domain;
>>>> -    enum xe_force_wake_domains tmp, woken = 0;
>>>> +    enum xe_force_wake_domains tmp, awake_rqst = 0, awake_ack = 0;
>>>
>>> it looks that you're abusing even more all enum variables by treating
>>> them as plain integers
>>
>> Miss at my end. Will address them in next version.
>>
>>>
>>>>        unsigned long flags;
>>>> -    int ret = 0;
>>>> +    int ret = domains;
>>>>          spin_lock_irqsave(&fw->lock, flags);
>>>>        for_each_fw_domain_masked(domain, domains, fw, tmp) {
>>>>            if (!domain->ref++) {
>>>> -            woken |= BIT(domain->id);
>>>> +            awake_rqst |= BIT(domain->id);
>>>>                domain_wake(gt, domain);
>>>>            }
>>>>        }
>>>> -    for_each_fw_domain_masked(domain, woken, fw, tmp) {
>>>> -        ret |= domain_wake_wait(gt, domain);
>>>> +    for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>>>> +        if (domain_wake_wait(gt, domain) == 0) {
>>>> +            awake_ack |= BIT(domain->id);
>>>> +        } else {
>>>> +            ret &= ~BIT(domain->id);
>>>> +            --domain->ref;
>>>> +        }
>>>>        }
>>>> -    fw->awake_domains |= woken;
>>>> +
>>>> +    fw->awake_domains |= awake_ack;
>>>>        spin_unlock_irqrestore(&fw->lock, flags);
>>>>          return ret;


More information about the Intel-xe mailing list