[PATCH v2 01/23] drm/xe: Error handling in xe_force_wake_get()

Michal Wajdeczko michal.wajdeczko at intel.com
Fri Sep 13 11:26:06 UTC 2024



On 13.09.2024 05:59, Ghimiray, Himal Prasad wrote:
> 
> 
> On 13-09-2024 03:01, Michal Wajdeczko wrote:
>>
>>
>> On 12.09.2024 21:15, Himal Prasad Ghimiray wrote:
>>> If an acknowledgment timeout occurs for a domain awake request, do not
>>> increment the reference count for the domain. This ensures that
>>> subsequent _get calls do not incorrectly assume the domain is awake. The
>>> return value is a mask of domains whose reference counts were
>>> incremented, and these domains need to be released using
>>> xe_force_wake_put.
>>>
>>> The caller needs to compare the return value with the input domains to
>>> determine the success or failure of the operation and decide whether to
>>> continue or return accordingly.
>>>
>>> While at it, add simple kernel-doc for xe_force_wake_get()
>>>
>>> Cc: Badal Nilawar <badal.nilawar at intel.com>
>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_force_wake.c | 35 +++++++++++++++++++++++++-----
>>>   1 file changed, 29 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.c
>>> b/drivers/gpu/drm/xe/xe_force_wake.c
>>> index a64c14757c84..fa42d652d23f 100644
>>> --- a/drivers/gpu/drm/xe/xe_force_wake.c
>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.c
>>> @@ -150,26 +150,49 @@ static int domain_sleep_wait(struct xe_gt *gt,
>>>                        (ffs(tmp__) - 1))) && \
>>>                        domain__->reg_ctl.addr)
>>>   +/**
>>> + * xe_force_wake_get : Increase the domain refcount; if it was 0
>>> initially, wake the domain
>>
>> while likely this is still recognized by the kernel-doc tool, this is
>> not correct notation for the function() documentation
> 
> 
> I assume you are suggesting %s/xe_force_wake_get/xe_force_wake_get()
> will fix it.
> 
> 
>>
>> [1]
>> https://docs.kernel.org/doc-guide/kernel-doc.html#function-documentation
>>
>>> + * @fw: struct xe_force_wake
>>> + * @domains: forcewake domains to get refcount on
>>> + *
>>> + * Increment refcount for the force-wake domain. If the domain is
>>> + * asleep, awaken it and wait for acknowledgment within the specified
>>> + * timeout. If a timeout occurs, decrement the refcount.
>>
>> not sure if doc shall be 1:1 of low level implementation details
> 
> Does this sound okay ?
> This function takes references for the input @domains and wakes them if
> they are asleep.
> 
>>
>>> + * The caller should compare the return value with the @domains to
>>> + * determine the success or failure of the operation.
>>> + *
>>> + * Return: mask of refcount increased domains.
>>
>> if we return a 'mask' then maybe it should be of 'unsigned int' type?
> 
> Agreed. Will fix in next version.
> 
>>
>>> If the return value is
>>> + * equal to the input parameter @domains, the operation is considered
>>> + * successful. Otherwise, the operation is considered a failure, and
>>> + * the caller should handle the failure case, potentially returning
>>> + * -ETIMEDOUT.
>>
>> it looks that all problems with the nice API is due to the
>> XE_FORCEWAKE_ALL that is not a single domain ID and requires extra care
>>
>> maybe there should be different pair of functions:
> 
> I am not convinced with different pair of functions:
> 
> In current implementation:
> 
> int mask = xe_force_wake_get(fw, domains)
> if (mask != domains) {
>     Non critical path continue with warning;
>      or
>     critical path:
>         xe_force_wake_put(fw, mask);
>         return -ETIMEDOUT;
> }
> 
> do_ops;
> xe_force_wake_put(fw, mask);
> return err;
> 
> Above flow remains intact irrespective of individual domains or
> FORCEWAKE_ALL.
> 
> In case of individual domains if (mask != domains) can be replaced with
> (!mask) and user can avoid xe_force_wake_put(fw, mask) in failure path
> since mask is 0;

so maybe we should have (by reinventing i915?):

// opaque, but zero means failure/no domains are awake
typedef unsigned long xe_wakeref_t;


// caller should test for ref != 0
// but shall call put if ref != 0
xe_wakeref_t xe_force_wake_get(fw, enum xe_force_wake_domains d)

// safe to call with ref == 0
void xe_force_wake_put(fw, xe_wakeref_t ref)


// helpers for critical work that must be sure about domain

// compares opaque ref with explicit domain != ALL
// can be used by the code that obtained the ref
bool xe_wakeref_has_domain(xe_wakeref_t, enum xe_force_wake_domains d)

// compares fw with explicit domain != ALL
// can be used by the code that does not have direct access to the ref
bool xe_force_wake_is_awake(fw, enum xe_force_wake_domains d)


// helpers for checking correctness
void xe_force_wake_assert_held(fw, enum xe_force_wake_domains d)


then usage would be:

xe_wakeref_t ref;

ref = xe_force_wake_get(fw, d);
if (ref) {
	// ...
	xe_force_wake_put(fw, ref);
}

or:

xe_wakeref_t ref;

ref = xe_force_wake_get(fw, ALL);
if (xe_wakeref_has_domain(ref, d1))
	// ... critical work1
if (xe_wakeref_has_domain(ref, d2))
	// ... critical work2
xe_force_wake_put(fw, ref);


so above will be very similar to what you have but by having explicit
types IMO it will help connect all functions into proper use-case flow

> 
> 
>>
>> // for single domain where ret=0 is success, ret<0 is error
> 
> This leads to caller only calling xe_force_wake_put incase of get
> success. so in case of caller continuing with failure, he will need to
> ensure the put is not called.
> 
> for example:
> int ret;
> 
> ret = xe_force_wake_get(fw, DOMAIN_GT);
> XE_WARN_ON(ret)
> if(!ret)
>     xe_force_wake_put(fw, DOMAIN_GT);
> 
>> int xe_force_wake_get(fw, enum xe_force_wake_domain_id id);
>> void xe_force_wake_put(fw, enum xe_force_wake_domain_id id);
>>
>> and
>>
>> // for all domain where ret=0 is success, ret<0 is error
>> int int xe_force_wake_get_all(fw);
>> void xe_force_wake_put_all(fw);
> 
> In case of xe_force_wake_get_all(fw) failure, how the caller will know
> which domains got awake and which failed ?
> 
> ret = xe_force_wake_get_all(fw);
> if(!ret)
>    No way to put awake domains to sleep

in case of failure, it would be the responsibility of the
xe_force_wake_get_all() to put all partial awakes immediately, since it
failed to awake all requested domains (same as in single domain case)

but let's drop this idea

> 
>>
>> and
>>
>> // input: mask of domains, return: mask of domain
>> unsigned int xe_force_wake_get_mask(fw, mask);
>> void xe_force_wake_put_mask(fw, mask);
>>
>> this last one can be just main implementation (static or public if we
>> really want to continue with random set of enabled domains)
>>
>>> + */
>>>   int xe_force_wake_get(struct xe_force_wake *fw,
>>>                 enum xe_force_wake_domains domains)
>>>   {
>>>       struct xe_gt *gt = fw->gt;
>>>       struct xe_force_wake_domain *domain;
>>> -    enum xe_force_wake_domains tmp, woken = 0;
>>> +    enum xe_force_wake_domains tmp, awake_rqst = 0, awake_ack = 0;
>>
>> it looks that you're abusing even more all enum variables by treating
>> them as plain integers
> 
> Miss at my end. Will address them in next version.
> 
>>
>>>       unsigned long flags;
>>> -    int ret = 0;
>>> +    int ret = domains;
>>>         spin_lock_irqsave(&fw->lock, flags);
>>>       for_each_fw_domain_masked(domain, domains, fw, tmp) {
>>>           if (!domain->ref++) {
>>> -            woken |= BIT(domain->id);
>>> +            awake_rqst |= BIT(domain->id);
>>>               domain_wake(gt, domain);
>>>           }
>>>       }
>>> -    for_each_fw_domain_masked(domain, woken, fw, tmp) {
>>> -        ret |= domain_wake_wait(gt, domain);
>>> +    for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>>> +        if (domain_wake_wait(gt, domain) == 0) {
>>> +            awake_ack |= BIT(domain->id);
>>> +        } else {
>>> +            ret &= ~BIT(domain->id);
>>> +            --domain->ref;
>>> +        }
>>>       }
>>> -    fw->awake_domains |= woken;
>>> +
>>> +    fw->awake_domains |= awake_ack;
>>>       spin_unlock_irqrestore(&fw->lock, flags);
>>>         return ret;


More information about the Intel-xe mailing list