[PATCH v2 01/23] drm/xe: Error handling in xe_force_wake_get()

Matthew Brost matthew.brost at intel.com
Tue Sep 17 18:50:40 UTC 2024


On Tue, Sep 17, 2024 at 11:18:47AM +0530, Nilawar, Badal wrote:
> 
> 
> On 13-09-2024 18:47, Ghimiray, Himal Prasad wrote:
> > 
> > 
> > On 13-09-2024 16:56, Michal Wajdeczko wrote:
> > > 
> > > 
> > > On 13.09.2024 05:59, Ghimiray, Himal Prasad wrote:
> > > > 
> > > > 
> > > > On 13-09-2024 03:01, Michal Wajdeczko wrote:
> > > > > 
> > > > > 
> > > > > On 12.09.2024 21:15, Himal Prasad Ghimiray wrote:
> > > > > > If an acknowledgment timeout occurs for a domain awake request, do not
> > > > > > increment the reference count for the domain. This ensures that
> > > > > > subsequent _get calls do not incorrectly assume the
> > > > > > domain is awake. The
> > > > > > return value is a mask of domains whose reference counts were
> > > > > > incremented, and these domains need to be released using
> > > > > > xe_force_wake_put.
> > > > > > 
> > > > > > The caller needs to compare the return value with the input domains to
> > > > > > determine the success or failure of the operation and
> > > > > > decide whether to
> > > > > > continue or return accordingly.
> > > > > > 
> > > > > > While at it, add simple kernel-doc for xe_force_wake_get()
> > > > > > 
> > > > > > Cc: Badal Nilawar <badal.nilawar at intel.com>
> > > > > > Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > > > > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> > > > > > Cc: Nirmoy Das <nirmoy.das at intel.com>
> > > > > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
> > > > > > ---
> > > > > >    drivers/gpu/drm/xe/xe_force_wake.c | 35
> > > > > > ++++++++++++++++++++++++ +-----
> > > > > >    1 file changed, 29 insertions(+), 6 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_force_wake.c
> > > > > > b/drivers/gpu/drm/xe/xe_force_wake.c
> > > > > > index a64c14757c84..fa42d652d23f 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_force_wake.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_force_wake.c
> > > > > > @@ -150,26 +150,49 @@ static int domain_sleep_wait(struct xe_gt *gt,
> > > > > >                         (ffs(tmp__) - 1))) && \
> > > > > >                         domain__->reg_ctl.addr)
> > > > > >    +/**
> > > > > > + * xe_force_wake_get : Increase the domain refcount; if it was 0
> > > > > > initially, wake the domain
> > > > > 
> > > > > while likely this is still recognized by the kernel-doc tool, this is
> > > > > not correct notation for the function() documentation
> > > > 
> > > > 
> > > > I assume you are suggesting %s/xe_force_wake_get/xe_force_wake_get()
> > > > will fix it.
> > > > 
> > > > 
> > > > > 
> > > > > [1]
> > > > > https://docs.kernel.org/doc-guide/kernel-doc.html#function-
> > > > > documentation
> > > > > 
> > > > > > + * @fw: struct xe_force_wake
> > > > > > + * @domains: forcewake domains to get refcount on
> > > > > > + *
> > > > > > + * Increment refcount for the force-wake domain. If the domain is
> > > > > > + * asleep, awaken it and wait for acknowledgment within the specified
> > > > > > + * timeout. If a timeout occurs, decrement the refcount.
> > > > > 
> > > > > not sure if doc shall be 1:1 of low level implementation details
> > > > 
> > > > Does this sound okay ?
> > > > This function takes references for the input @domains and wakes them if
> > > > they are asleep.
> > > > 
> > > > > 
> > > > > > + * The caller should compare the return value with the @domains to
> > > > > > + * determine the success or failure of the operation.
> > > > > > + *
> > > > > > + * Return: mask of refcount increased domains.
> > > > > 
> > > > > if we return a 'mask' then maybe it should be of 'unsigned int' type?
> > > > 
> > > > Agreed. Will fix in next version.
> > > > 
> > > > > 
> > > > > > If the return value is
> > > > > > + * equal to the input parameter @domains, the operation is considered
> > > > > > + * successful. Otherwise, the operation is considered a failure, and
> > > > > > + * the caller should handle the failure case, potentially returning
> > > > > > + * -ETIMEDOUT.
> > > > > 
> > > > > it looks that all problems with the nice API is due to the
> > > > > XE_FORCEWAKE_ALL that is not a single domain ID and requires extra care
> > > > > 
> > > > > maybe there should be different pair of functions:
> > > > 
> > > > I am not convinced with different pair of functions:
> > > > 
> > > > In current implementation:
> > > > 
> > > > int mask = xe_force_wake_get(fw, domains)
> > > > if (mask != domains) {
> > > >      Non critical path continue with warning;
> > > >       or
> > > >      critical path:
> > > >          xe_force_wake_put(fw, mask);
> > > >          return -ETIMEDOUT;
> > > > }
> > > > 
> > > > do_ops;
> > > > xe_force_wake_put(fw, mask);
> > > > return err;
> > > > 
> > > > Above flow remains intact irrespective of individual domains or
> > > > FORCEWAKE_ALL.
> > > > 
> > > > In case of individual domains if (mask != domains) can be replaced with
> > > > (!mask) and user can avoid xe_force_wake_put(fw, mask) in failure path
> > > > since mask is 0;
> > > 
> > > so maybe we should have (by reinventing i915?):
> > > 
> > > // opaque, but zero means failure/no domains are awake
> > > typedef unsigned long xe_wakeref_t;
> > > 
> > > 
> > > // caller should test for ref != 0
> > > // but shall call put if ref != 0
> > > xe_wakeref_t xe_force_wake_get(fw, enum xe_force_wake_domains d)
> > > 
> > > // safe to call with ref == 0
> > > void xe_force_wake_put(fw, xe_wakeref_t ref)
> > > 
> > > 
> > > // helpers for critical work that must be sure about domain
> > > 
> > > // compares opaque ref with explicit domain != ALL
> > > // can be used by the code that obtained the ref
> > > bool xe_wakeref_has_domain(xe_wakeref_t, enum xe_force_wake_domains d)
> > > 
> > > // compares fw with explicit domain != ALL
> > > // can be used by the code that does not have direct access to the ref
> > > bool xe_force_wake_is_awake(fw, enum xe_force_wake_domains d)
> > > 
> > > 
> > > // helpers for checking correctness
> > > void xe_force_wake_assert_held(fw, enum xe_force_wake_domains d)
> > > 
> > > 
> > > then usage would be:
> > > 
> > > xe_wakeref_t ref;
> > > 
> > > ref = xe_force_wake_get(fw, d);
> > > if (ref) {
> > >     // ...
> > >     xe_force_wake_put(fw, ref);
> > > }
> > > 
> > > or:
> > > 
> > > xe_wakeref_t ref;
> > > 
> > > ref = xe_force_wake_get(fw, ALL);
> > > if (xe_wakeref_has_domain(ref, d1))
> > >     // ... critical work1
> > > if (xe_wakeref_has_domain(ref, d2))
> > >     // ... critical work2
> > > xe_force_wake_put(fw, ref);
> > > 
> > > 
> > > so above will be very similar to what you have but by having explicit
> > > types IMO it will help connect all functions into proper use-case flow
> > 
> > 
> > Agreed implementation/usage will be same, will use explicit type for
> > clarity.
> > IMO typedef unsigned int xe_wakeref_t is sufficient instead of
> > typedef unsigned long xe_wakeref_t;
> 
> I agree with this.
> 

What? Really? I thought it was pretty clear rule in kernel programing
not use typedefs [1]. Reading through conditions acceptable and I don't
use anything applies to this series, maybe a) applies but not really
convinced. The example in a) is a pte_t which can likely change based on
platform target whereas here we only have one target and see no reason
this needs to be opaque.

Matt

[1] https://www.kernel.org/doc/html/v4.14/process/coding-style.html#typedefs 

> Regards,
> Badal
> > 
> > 
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > // for single domain where ret=0 is success, ret<0 is error
> > > > 
> > > > This leads to caller only calling xe_force_wake_put incase of get
> > > > success. so in case of caller continuing with failure, he will need to
> > > > ensure the put is not called.
> > > > 
> > > > for example:
> > > > int ret;
> > > > 
> > > > ret = xe_force_wake_get(fw, DOMAIN_GT);
> > > > XE_WARN_ON(ret)
> > > > if(!ret)
> > > >      xe_force_wake_put(fw, DOMAIN_GT);
> > > > 
> > > > > int xe_force_wake_get(fw, enum xe_force_wake_domain_id id);
> > > > > void xe_force_wake_put(fw, enum xe_force_wake_domain_id id);
> > > > > 
> > > > > and
> > > > > 
> > > > > // for all domain where ret=0 is success, ret<0 is error
> > > > > int int xe_force_wake_get_all(fw);
> > > > > void xe_force_wake_put_all(fw);
> > > > 
> > > > In case of xe_force_wake_get_all(fw) failure, how the caller will know
> > > > which domains got awake and which failed ?
> > > > 
> > > > ret = xe_force_wake_get_all(fw);
> > > > if(!ret)
> > > >     No way to put awake domains to sleep
> > > 
> > > in case of failure, it would be the responsibility of the
> > > xe_force_wake_get_all() to put all partial awakes immediately, since it
> > > failed to awake all requested domains (same as in single domain case)
> > > 
> > > but let's drop this idea
> > > 
> > > > 
> > > > > 
> > > > > and
> > > > > 
> > > > > // input: mask of domains, return: mask of domain
> > > > > unsigned int xe_force_wake_get_mask(fw, mask);
> > > > > void xe_force_wake_put_mask(fw, mask);
> > > > > 
> > > > > this last one can be just main implementation (static or public if we
> > > > > really want to continue with random set of enabled domains)
> > > > > 
> > > > > > + */
> > > > > >    int xe_force_wake_get(struct xe_force_wake *fw,
> > > > > >                  enum xe_force_wake_domains domains)
> > > > > >    {
> > > > > >        struct xe_gt *gt = fw->gt;
> > > > > >        struct xe_force_wake_domain *domain;
> > > > > > -    enum xe_force_wake_domains tmp, woken = 0;
> > > > > > +    enum xe_force_wake_domains tmp, awake_rqst = 0, awake_ack = 0;
> > > > > 
> > > > > it looks that you're abusing even more all enum variables by treating
> > > > > them as plain integers
> > > > 
> > > > Miss at my end. Will address them in next version.
> > > > 
> > > > > 
> > > > > >        unsigned long flags;
> > > > > > -    int ret = 0;
> > > > > > +    int ret = domains;
> > > > > >          spin_lock_irqsave(&fw->lock, flags);
> > > > > >        for_each_fw_domain_masked(domain, domains, fw, tmp) {
> > > > > >            if (!domain->ref++) {
> > > > > > -            woken |= BIT(domain->id);
> > > > > > +            awake_rqst |= BIT(domain->id);
> > > > > >                domain_wake(gt, domain);
> > > > > >            }
> > > > > >        }
> > > > > > -    for_each_fw_domain_masked(domain, woken, fw, tmp) {
> > > > > > -        ret |= domain_wake_wait(gt, domain);
> > > > > > +    for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
> > > > > > +        if (domain_wake_wait(gt, domain) == 0) {
> > > > > > +            awake_ack |= BIT(domain->id);
> > > > > > +        } else {
> > > > > > +            ret &= ~BIT(domain->id);
> > > > > > +            --domain->ref;
> > > > > > +        }
> > > > > >        }
> > > > > > -    fw->awake_domains |= woken;
> > > > > > +
> > > > > > +    fw->awake_domains |= awake_ack;
> > > > > >        spin_unlock_irqrestore(&fw->lock, flags);
> > > > > >          return ret;
> 


More information about the Intel-xe mailing list