[PATCH v2 01/23] drm/xe: Error handling in xe_force_wake_get()
Ghimiray, Himal Prasad
himal.prasad.ghimiray at intel.com
Wed Sep 18 06:32:14 UTC 2024
On 18-09-2024 00:20, Matthew Brost wrote:
> On Tue, Sep 17, 2024 at 11:18:47AM +0530, Nilawar, Badal wrote:
>>
>>
>> On 13-09-2024 18:47, Ghimiray, Himal Prasad wrote:
>>>
>>>
>>> On 13-09-2024 16:56, Michal Wajdeczko wrote:
>>>>
>>>>
>>>> On 13.09.2024 05:59, Ghimiray, Himal Prasad wrote:
>>>>>
>>>>>
>>>>> On 13-09-2024 03:01, Michal Wajdeczko wrote:
>>>>>>
>>>>>>
>>>>>> On 12.09.2024 21:15, Himal Prasad Ghimiray wrote:
>>>>>>> If an acknowledgment timeout occurs for a domain awake request, do not
>>>>>>> increment the reference count for the domain. This ensures that
>>>>>>> subsequent _get calls do not incorrectly assume the
>>>>>>> domain is awake. The
>>>>>>> return value is a mask of domains whose reference counts were
>>>>>>> incremented, and these domains need to be released using
>>>>>>> xe_force_wake_put.
>>>>>>>
>>>>>>> The caller needs to compare the return value with the input domains to
>>>>>>> determine the success or failure of the operation and
>>>>>>> decide whether to
>>>>>>> continue or return accordingly.
>>>>>>>
>>>>>>> While at it, add simple kernel-doc for xe_force_wake_get()
>>>>>>>
>>>>>>> Cc: Badal Nilawar <badal.nilawar at intel.com>
>>>>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>>>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/xe/xe_force_wake.c | 35
>>>>>>> ++++++++++++++++++++++++ +-----
>>>>>>> 1 file changed, 29 insertions(+), 6 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>> b/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>> index a64c14757c84..fa42d652d23f 100644
>>>>>>> --- a/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.c
>>>>>>> @@ -150,26 +150,49 @@ static int domain_sleep_wait(struct xe_gt *gt,
>>>>>>> (ffs(tmp__) - 1))) && \
>>>>>>> domain__->reg_ctl.addr)
>>>>>>> +/**
>>>>>>> + * xe_force_wake_get : Increase the domain refcount; if it was 0
>>>>>>> initially, wake the domain
>>>>>>
>>>>>> while likely this is still recognized by the kernel-doc tool, this is
>>>>>> not correct notation for the function() documentation
>>>>>
>>>>>
>>>>> I assume you are suggesting %s/xe_force_wake_get/xe_force_wake_get()
>>>>> will fix it.
>>>>>
>>>>>
>>>>>>
>>>>>> [1]
>>>>>> https://docs.kernel.org/doc-guide/kernel-doc.html#function-
>>>>>> documentation
>>>>>>
>>>>>>> + * @fw: struct xe_force_wake
>>>>>>> + * @domains: forcewake domains to get refcount on
>>>>>>> + *
>>>>>>> + * Increment refcount for the force-wake domain. If the domain is
>>>>>>> + * asleep, awaken it and wait for acknowledgment within the specified
>>>>>>> + * timeout. If a timeout occurs, decrement the refcount.
>>>>>>
>>>>>> not sure if doc shall be 1:1 of low level implementation details
>>>>>
>>>>> Does this sound okay ?
>>>>> This function takes references for the input @domains and wakes them if
>>>>> they are asleep.
>>>>>
>>>>>>
>>>>>>> + * The caller should compare the return value with the @domains to
>>>>>>> + * determine the success or failure of the operation.
>>>>>>> + *
>>>>>>> + * Return: mask of refcount increased domains.
>>>>>>
>>>>>> if we return a 'mask' then maybe it should be of 'unsigned int' type?
>>>>>
>>>>> Agreed. Will fix in next version.
>>>>>
>>>>>>
>>>>>>> If the return value is
>>>>>>> + * equal to the input parameter @domains, the operation is considered
>>>>>>> + * successful. Otherwise, the operation is considered a failure, and
>>>>>>> + * the caller should handle the failure case, potentially returning
>>>>>>> + * -ETIMEDOUT.
>>>>>>
>>>>>> it looks that all problems with the nice API is due to the
>>>>>> XE_FORCEWAKE_ALL that is not a single domain ID and requires extra care
>>>>>>
>>>>>> maybe there should be different pair of functions:
>>>>>
>>>>> I am not convinced with different pair of functions:
>>>>>
>>>>> In current implementation:
>>>>>
>>>>> int mask = xe_force_wake_get(fw, domains)
>>>>> if (mask != domains) {
>>>>> Non critical path continue with warning;
>>>>> or
>>>>> critical path:
>>>>> xe_force_wake_put(fw, mask);
>>>>> return -ETIMEDOUT;
>>>>> }
>>>>>
>>>>> do_ops;
>>>>> xe_force_wake_put(fw, mask);
>>>>> return err;
>>>>>
>>>>> Above flow remains intact irrespective of individual domains or
>>>>> FORCEWAKE_ALL.
>>>>>
>>>>> In case of individual domains if (mask != domains) can be replaced with
>>>>> (!mask) and user can avoid xe_force_wake_put(fw, mask) in failure path
>>>>> since mask is 0;
>>>>
>>>> so maybe we should have (by reinventing i915?):
>>>>
>>>> // opaque, but zero means failure/no domains are awake
>>>> typedef unsigned long xe_wakeref_t;
>>>>
>>>>
>>>> // caller should test for ref != 0
>>>> // but shall call put if ref != 0
>>>> xe_wakeref_t xe_force_wake_get(fw, enum xe_force_wake_domains d)
>>>>
>>>> // safe to call with ref == 0
>>>> void xe_force_wake_put(fw, xe_wakeref_t ref)
>>>>
>>>>
>>>> // helpers for critical work that must be sure about domain
>>>>
>>>> // compares opaque ref with explicit domain != ALL
>>>> // can be used by the code that obtained the ref
>>>> bool xe_wakeref_has_domain(xe_wakeref_t, enum xe_force_wake_domains d)
>>>>
>>>> // compares fw with explicit domain != ALL
>>>> // can be used by the code that does not have direct access to the ref
>>>> bool xe_force_wake_is_awake(fw, enum xe_force_wake_domains d)
>>>>
>>>>
>>>> // helpers for checking correctness
>>>> void xe_force_wake_assert_held(fw, enum xe_force_wake_domains d)
>>>>
>>>>
>>>> then usage would be:
>>>>
>>>> xe_wakeref_t ref;
>>>>
>>>> ref = xe_force_wake_get(fw, d);
>>>> if (ref) {
>>>> // ...
>>>> xe_force_wake_put(fw, ref);
>>>> }
>>>>
>>>> or:
>>>>
>>>> xe_wakeref_t ref;
>>>>
>>>> ref = xe_force_wake_get(fw, ALL);
>>>> if (xe_wakeref_has_domain(ref, d1))
>>>> // ... critical work1
>>>> if (xe_wakeref_has_domain(ref, d2))
>>>> // ... critical work2
>>>> xe_force_wake_put(fw, ref);
>>>>
>>>>
>>>> so above will be very similar to what you have but by having explicit
>>>> types IMO it will help connect all functions into proper use-case flow
>>>
>>>
>>> Agreed implementation/usage will be same, will use explicit type for
>>> clarity.
>>> IMO typedef unsigned int xe_wakeref_t is sufficient instead of
>>> typedef unsigned long xe_wakeref_t;
>>
>> I agree with this.
>>
>
> What? Really? I thought it was pretty clear rule in kernel programing
> not use typedefs [1]. Reading through conditions acceptable and I don't
> use anything applies to this series, maybe a) applies but not really
> convinced. The example in a) is a pte_t which can likely change based on
> platform target whereas here we only have one target and see no reason
> this needs to be opaque.
>
> Matt
>
> [1] https://www.kernel.org/doc/html/v4.14/process/coding-style.html#typedefs
While running checkpatch on my changes, patchwork had also issued a
WARNING: NEW_TYPEDEFS: do not add new typedefs. I reviewed the usage in
the Linux kernel tree and found it used in many places, which led me to
assume it was safe. I now realize that I should have been more careful
in understanding the context of its usage and referred to the kernel
coding guidelines. This was an oversight on my part.
Since this doesn’t impact the CI or runtime, I will avoid reverting to
unsigned int immediately and will hold off until I receive the other
review comments. I will incorporate the changes to revert it in
subsequent versions while also addressing the other review comments.
Thank you for bringing this to the attention.
BR
Himal
>
>> Regards,
>> Badal
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> // for single domain where ret=0 is success, ret<0 is error
>>>>>
>>>>> This leads to caller only calling xe_force_wake_put incase of get
>>>>> success. so in case of caller continuing with failure, he will need to
>>>>> ensure the put is not called.
>>>>>
>>>>> for example:
>>>>> int ret;
>>>>>
>>>>> ret = xe_force_wake_get(fw, DOMAIN_GT);
>>>>> XE_WARN_ON(ret)
>>>>> if(!ret)
>>>>> xe_force_wake_put(fw, DOMAIN_GT);
>>>>>
>>>>>> int xe_force_wake_get(fw, enum xe_force_wake_domain_id id);
>>>>>> void xe_force_wake_put(fw, enum xe_force_wake_domain_id id);
>>>>>>
>>>>>> and
>>>>>>
>>>>>> // for all domain where ret=0 is success, ret<0 is error
>>>>>> int int xe_force_wake_get_all(fw);
>>>>>> void xe_force_wake_put_all(fw);
>>>>>
>>>>> In case of xe_force_wake_get_all(fw) failure, how the caller will know
>>>>> which domains got awake and which failed ?
>>>>>
>>>>> ret = xe_force_wake_get_all(fw);
>>>>> if(!ret)
>>>>> No way to put awake domains to sleep
>>>>
>>>> in case of failure, it would be the responsibility of the
>>>> xe_force_wake_get_all() to put all partial awakes immediately, since it
>>>> failed to awake all requested domains (same as in single domain case)
>>>>
>>>> but let's drop this idea
>>>>
>>>>>
>>>>>>
>>>>>> and
>>>>>>
>>>>>> // input: mask of domains, return: mask of domain
>>>>>> unsigned int xe_force_wake_get_mask(fw, mask);
>>>>>> void xe_force_wake_put_mask(fw, mask);
>>>>>>
>>>>>> this last one can be just main implementation (static or public if we
>>>>>> really want to continue with random set of enabled domains)
>>>>>>
>>>>>>> + */
>>>>>>> int xe_force_wake_get(struct xe_force_wake *fw,
>>>>>>> enum xe_force_wake_domains domains)
>>>>>>> {
>>>>>>> struct xe_gt *gt = fw->gt;
>>>>>>> struct xe_force_wake_domain *domain;
>>>>>>> - enum xe_force_wake_domains tmp, woken = 0;
>>>>>>> + enum xe_force_wake_domains tmp, awake_rqst = 0, awake_ack = 0;
>>>>>>
>>>>>> it looks that you're abusing even more all enum variables by treating
>>>>>> them as plain integers
>>>>>
>>>>> Miss at my end. Will address them in next version.
>>>>>
>>>>>>
>>>>>>> unsigned long flags;
>>>>>>> - int ret = 0;
>>>>>>> + int ret = domains;
>>>>>>> spin_lock_irqsave(&fw->lock, flags);
>>>>>>> for_each_fw_domain_masked(domain, domains, fw, tmp) {
>>>>>>> if (!domain->ref++) {
>>>>>>> - woken |= BIT(domain->id);
>>>>>>> + awake_rqst |= BIT(domain->id);
>>>>>>> domain_wake(gt, domain);
>>>>>>> }
>>>>>>> }
>>>>>>> - for_each_fw_domain_masked(domain, woken, fw, tmp) {
>>>>>>> - ret |= domain_wake_wait(gt, domain);
>>>>>>> + for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>>>>>>> + if (domain_wake_wait(gt, domain) == 0) {
>>>>>>> + awake_ack |= BIT(domain->id);
>>>>>>> + } else {
>>>>>>> + ret &= ~BIT(domain->id);
>>>>>>> + --domain->ref;
>>>>>>> + }
>>>>>>> }
>>>>>>> - fw->awake_domains |= woken;
>>>>>>> +
>>>>>>> + fw->awake_domains |= awake_ack;
>>>>>>> spin_unlock_irqrestore(&fw->lock, flags);
>>>>>>> return ret;
>>
More information about the Intel-xe
mailing list