[PATCH v2 01/23] drm/xe: Error handling in xe_force_wake_get()

Nilawar, Badal badal.nilawar at intel.com
Thu Sep 19 11:07:57 UTC 2024



On 18-09-2024 12:49, Jani Nikula wrote:
> On Wed, 18 Sep 2024, "Ghimiray, Himal Prasad" <himal.prasad.ghimiray at intel.com> wrote:
>> On 18-09-2024 00:20, Matthew Brost wrote:
>>> On Tue, Sep 17, 2024 at 11:18:47AM +0530, Nilawar, Badal wrote:
>>>>
>>>>
>>>> On 13-09-2024 18:47, Ghimiray, Himal Prasad wrote:
>>>>>
>>>>>
>>>>> On 13-09-2024 16:56, Michal Wajdeczko wrote:
>>>>>>
>>>>>>
>>>>>> On 13.09.2024 05:59, Ghimiray, Himal Prasad wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 13-09-2024 03:01, Michal Wajdeczko wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 12.09.2024 21:15, Himal Prasad Ghimiray wrote:
>>>>>>>>> If an acknowledgment timeout occurs for a domain awake request, do not
>>>>>>>>> increment the reference count for the domain. This ensures that
>>>>>>>>> subsequent _get calls do not incorrectly assume the
>>>>>>>>> domain is awake. The
>>>>>>>>> return value is a mask of domains whose reference counts were
>>>>>>>>> incremented, and these domains need to be released using
>>>>>>>>> xe_force_wake_put.
>>>>>>>>>
>>>>>>>>> The caller needs to compare the return value with the input domains to
>>>>>>>>> determine the success or failure of the operation and
>>>>>>>>> decide whether to
>>>>>>>>> continue or return accordingly.
>>>>>>>>>
>>>>>>>>> While at it, add simple kernel-doc for xe_force_wake_get()
>>>>>>>>>
>>>>>>>>> Cc: Badal Nilawar <badal.nilawar at intel.com>
>>>>>>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>>>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>>>>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>>>>>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>>>>>>>>> ---
>>>>>>>>>      drivers/gpu/drm/xe/xe_force_wake.c | 35
>>>>>>>>> ++++++++++++++++++++++++ +-----
>>>>>>>>>      1 file changed, 29 insertions(+), 6 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>>>> b/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>>>> index a64c14757c84..fa42d652d23f 100644
>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>>>> @@ -150,26 +150,49 @@ static int domain_sleep_wait(struct xe_gt *gt,
>>>>>>>>>                           (ffs(tmp__) - 1))) && \
>>>>>>>>>                           domain__->reg_ctl.addr)
>>>>>>>>>      +/**
>>>>>>>>> + * xe_force_wake_get : Increase the domain refcount; if it was 0
>>>>>>>>> initially, wake the domain
>>>>>>>>
>>>>>>>> while likely this is still recognized by the kernel-doc tool, this is
>>>>>>>> not correct notation for the function() documentation
>>>>>>>
>>>>>>>
>>>>>>> I assume you are suggesting %s/xe_force_wake_get/xe_force_wake_get()
>>>>>>> will fix it.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://docs.kernel.org/doc-guide/kernel-doc.html#function-
>>>>>>>> documentation
>>>>>>>>
>>>>>>>>> + * @fw: struct xe_force_wake
>>>>>>>>> + * @domains: forcewake domains to get refcount on
>>>>>>>>> + *
>>>>>>>>> + * Increment refcount for the force-wake domain. If the domain is
>>>>>>>>> + * asleep, awaken it and wait for acknowledgment within the specified
>>>>>>>>> + * timeout. If a timeout occurs, decrement the refcount.
>>>>>>>>
>>>>>>>> not sure if doc shall be 1:1 of low level implementation details
>>>>>>>
>>>>>>> Does this sound okay ?
>>>>>>> This function takes references for the input @domains and wakes them if
>>>>>>> they are asleep.
>>>>>>>
>>>>>>>>
>>>>>>>>> + * The caller should compare the return value with the @domains to
>>>>>>>>> + * determine the success or failure of the operation.
>>>>>>>>> + *
>>>>>>>>> + * Return: mask of refcount increased domains.
>>>>>>>>
>>>>>>>> if we return a 'mask' then maybe it should be of 'unsigned int' type?
>>>>>>>
>>>>>>> Agreed. Will fix in next version.
>>>>>>>
>>>>>>>>
>>>>>>>>> If the return value is
>>>>>>>>> + * equal to the input parameter @domains, the operation is considered
>>>>>>>>> + * successful. Otherwise, the operation is considered a failure, and
>>>>>>>>> + * the caller should handle the failure case, potentially returning
>>>>>>>>> + * -ETIMEDOUT.
>>>>>>>>
>>>>>>>> it looks that all problems with the nice API is due to the
>>>>>>>> XE_FORCEWAKE_ALL that is not a single domain ID and requires extra care
>>>>>>>>
>>>>>>>> maybe there should be different pair of functions:
>>>>>>>
>>>>>>> I am not convinced with different pair of functions:
>>>>>>>
>>>>>>> In current implementation:
>>>>>>>
>>>>>>> int mask = xe_force_wake_get(fw, domains)
>>>>>>> if (mask != domains) {
>>>>>>>        Non critical path continue with warning;
>>>>>>>         or
>>>>>>>        critical path:
>>>>>>>            xe_force_wake_put(fw, mask);
>>>>>>>            return -ETIMEDOUT;
>>>>>>> }
>>>>>>>
>>>>>>> do_ops;
>>>>>>> xe_force_wake_put(fw, mask);
>>>>>>> return err;
>>>>>>>
>>>>>>> Above flow remains intact irrespective of individual domains or
>>>>>>> FORCEWAKE_ALL.
>>>>>>>
>>>>>>> In case of individual domains if (mask != domains) can be replaced with
>>>>>>> (!mask) and user can avoid xe_force_wake_put(fw, mask) in failure path
>>>>>>> since mask is 0;
>>>>>>
>>>>>> so maybe we should have (by reinventing i915?):
>>>>>>
>>>>>> // opaque, but zero means failure/no domains are awake
>>>>>> typedef unsigned long xe_wakeref_t;
>>>>>>
>>>>>>
>>>>>> // caller should test for ref != 0
>>>>>> // but shall call put if ref != 0
>>>>>> xe_wakeref_t xe_force_wake_get(fw, enum xe_force_wake_domains d)
>>>>>>
>>>>>> // safe to call with ref == 0
>>>>>> void xe_force_wake_put(fw, xe_wakeref_t ref)
>>>>>>
>>>>>>
>>>>>> // helpers for critical work that must be sure about domain
>>>>>>
>>>>>> // compares opaque ref with explicit domain != ALL
>>>>>> // can be used by the code that obtained the ref
>>>>>> bool xe_wakeref_has_domain(xe_wakeref_t, enum xe_force_wake_domains d)
>>>>>>
>>>>>> // compares fw with explicit domain != ALL
>>>>>> // can be used by the code that does not have direct access to the ref
>>>>>> bool xe_force_wake_is_awake(fw, enum xe_force_wake_domains d)
>>>>>>
>>>>>>
>>>>>> // helpers for checking correctness
>>>>>> void xe_force_wake_assert_held(fw, enum xe_force_wake_domains d)
>>>>>>
>>>>>>
>>>>>> then usage would be:
>>>>>>
>>>>>> xe_wakeref_t ref;
>>>>>>
>>>>>> ref = xe_force_wake_get(fw, d);
>>>>>> if (ref) {
>>>>>>       // ...
>>>>>>       xe_force_wake_put(fw, ref);
>>>>>> }
>>>>>>
>>>>>> or:
>>>>>>
>>>>>> xe_wakeref_t ref;
>>>>>>
>>>>>> ref = xe_force_wake_get(fw, ALL);
>>>>>> if (xe_wakeref_has_domain(ref, d1))
>>>>>>       // ... critical work1
>>>>>> if (xe_wakeref_has_domain(ref, d2))
>>>>>>       // ... critical work2
>>>>>> xe_force_wake_put(fw, ref);
>>>>>>
>>>>>>
>>>>>> so above will be very similar to what you have but by having explicit
>>>>>> types IMO it will help connect all functions into proper use-case flow
>>>>>
>>>>>
>>>>> Agreed implementation/usage will be same, will use explicit type for
>>>>> clarity.
>>>>> IMO typedef unsigned int xe_wakeref_t is sufficient instead of
>>>>> typedef unsigned long xe_wakeref_t;
>>>>
>>>> I agree with this.
>>>>
>>>
>>> What? Really? I thought it was pretty clear rule in kernel programing
>>> not use typedefs [1]. Reading through conditions acceptable and I don't
>>> use anything applies to this series, maybe a) applies but not really
>>> convinced. The example in a) is a pte_t which can likely change based on
>>> platform target whereas here we only have one target and see no reason
>>> this needs to be opaque.
>>>
>>> Matt
>>>
>>> [1] https://www.kernel.org/doc/html/v4.14/process/coding-style.html#typedefs
>>
>>
>> While running checkpatch on my changes, patchwork had also issued a
>> WARNING: NEW_TYPEDEFS: do not add new typedefs. I reviewed the usage in
>> the Linux kernel tree and found it used in many places, which led me to
>> assume it was safe. I now realize that I should have been more careful
>> in understanding the context of its usage and referred to the kernel
>> coding guidelines. This was an oversight on my part.
>>
>> Since this doesn’t impact the CI or runtime, I will avoid reverting to
>> unsigned int immediately and will hold off until I receive the other
>> review comments. I will incorporate the changes to revert it in
>> subsequent versions while also addressing the other review comments.
>> Thank you for bringing this to the attention.
> 
> If you end up replicating intel_wakeref_t from i915, and go as deep as
> the rabbit hole goes, you'll realize intel_wakeref_t is a pointer
> disguised as an unsigned long. It's a struct ref_tracker * when you have
> certain configs enabled.
> 
> You could just use struct ref_tracker * everywhere. It's an opaque type
> to start with.

The original idea of using typedef for the fw return mask was for the 
sake of clarity. However, Matt B pointed that the use of typedef in this 
instance is not in accordance with the Linux kernel coding standards. 
Additionally, I agree with Matt B that there is no need for the fw 
return mask to be opaque; therefore, it is preferable to maintain the 
use of unsigned int.

Regards,
Badal
> 
> BR,
> Jani.
> 
> 
>>
>> BR
>> Himal
>>
>>>
>>>> Regards,
>>>> Badal
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> // for single domain where ret=0 is success, ret<0 is error
>>>>>>>
>>>>>>> This leads to caller only calling xe_force_wake_put incase of get
>>>>>>> success. so in case of caller continuing with failure, he will need to
>>>>>>> ensure the put is not called.
>>>>>>>
>>>>>>> for example:
>>>>>>> int ret;
>>>>>>>
>>>>>>> ret = xe_force_wake_get(fw, DOMAIN_GT);
>>>>>>> XE_WARN_ON(ret)
>>>>>>> if(!ret)
>>>>>>>        xe_force_wake_put(fw, DOMAIN_GT);
>>>>>>>
>>>>>>>> int xe_force_wake_get(fw, enum xe_force_wake_domain_id id);
>>>>>>>> void xe_force_wake_put(fw, enum xe_force_wake_domain_id id);
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> // for all domain where ret=0 is success, ret<0 is error
>>>>>>>> int int xe_force_wake_get_all(fw);
>>>>>>>> void xe_force_wake_put_all(fw);
>>>>>>>
>>>>>>> In case of xe_force_wake_get_all(fw) failure, how the caller will know
>>>>>>> which domains got awake and which failed ?
>>>>>>>
>>>>>>> ret = xe_force_wake_get_all(fw);
>>>>>>> if(!ret)
>>>>>>>       No way to put awake domains to sleep
>>>>>>
>>>>>> in case of failure, it would be the responsibility of the
>>>>>> xe_force_wake_get_all() to put all partial awakes immediately, since it
>>>>>> failed to awake all requested domains (same as in single domain case)
>>>>>>
>>>>>> but let's drop this idea
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> // input: mask of domains, return: mask of domain
>>>>>>>> unsigned int xe_force_wake_get_mask(fw, mask);
>>>>>>>> void xe_force_wake_put_mask(fw, mask);
>>>>>>>>
>>>>>>>> this last one can be just main implementation (static or public if we
>>>>>>>> really want to continue with random set of enabled domains)
>>>>>>>>
>>>>>>>>> + */
>>>>>>>>>      int xe_force_wake_get(struct xe_force_wake *fw,
>>>>>>>>>                    enum xe_force_wake_domains domains)
>>>>>>>>>      {
>>>>>>>>>          struct xe_gt *gt = fw->gt;
>>>>>>>>>          struct xe_force_wake_domain *domain;
>>>>>>>>> -    enum xe_force_wake_domains tmp, woken = 0;
>>>>>>>>> +    enum xe_force_wake_domains tmp, awake_rqst = 0, awake_ack = 0;
>>>>>>>>
>>>>>>>> it looks that you're abusing even more all enum variables by treating
>>>>>>>> them as plain integers
>>>>>>>
>>>>>>> Miss at my end. Will address them in next version.
>>>>>>>
>>>>>>>>
>>>>>>>>>          unsigned long flags;
>>>>>>>>> -    int ret = 0;
>>>>>>>>> +    int ret = domains;
>>>>>>>>>            spin_lock_irqsave(&fw->lock, flags);
>>>>>>>>>          for_each_fw_domain_masked(domain, domains, fw, tmp) {
>>>>>>>>>              if (!domain->ref++) {
>>>>>>>>> -            woken |= BIT(domain->id);
>>>>>>>>> +            awake_rqst |= BIT(domain->id);
>>>>>>>>>                  domain_wake(gt, domain);
>>>>>>>>>              }
>>>>>>>>>          }
>>>>>>>>> -    for_each_fw_domain_masked(domain, woken, fw, tmp) {
>>>>>>>>> -        ret |= domain_wake_wait(gt, domain);
>>>>>>>>> +    for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>>>>>>>>> +        if (domain_wake_wait(gt, domain) == 0) {
>>>>>>>>> +            awake_ack |= BIT(domain->id);
>>>>>>>>> +        } else {
>>>>>>>>> +            ret &= ~BIT(domain->id);
>>>>>>>>> +            --domain->ref;
>>>>>>>>> +        }
>>>>>>>>>          }
>>>>>>>>> -    fw->awake_domains |= woken;
>>>>>>>>> +
>>>>>>>>> +    fw->awake_domains |= awake_ack;
>>>>>>>>>          spin_unlock_irqrestore(&fw->lock, flags);
>>>>>>>>>            return ret;
>>>>
> 



More information about the Intel-xe mailing list