[RFC 1/9] drm/xe: Error handling in xe_force_wake_get()

Upadhyay, Tejas tejas.upadhyay at intel.com
Wed Sep 11 06:40:27 UTC 2024



> -----Original Message-----
> From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of
> Ghimiray, Himal Prasad
> Sent: Friday, September 6, 2024 1:33 AM
> To: Vivi, Rodrigo <rodrigo.vivi at intel.com>
> Cc: intel-xe at lists.freedesktop.org; Nilawar, Badal <badal.nilawar at intel.com>;
> De Marchi, Lucas <lucas.demarchi at intel.com>; Das, Nirmoy
> <nirmoy.das at intel.com>
> Subject: Re: [RFC 1/9] drm/xe: Error handling in xe_force_wake_get()
> 
> 
> 
> On 06-09-2024 00:59, Rodrigo Vivi wrote:
> > On Fri, Aug 30, 2024 at 10:53:18AM +0530, Himal Prasad Ghimiray wrote:
> >> If an acknowledgment timeout occurs for a domain awake request, put
> >> to sleep all domains awakened by the caller and decrease the
> >> reference count for all requested domains. This prevents
> >> xe_force_wake_get() from leaving an unhandled reference count in case of
> failure.
> >> While at it, add simple kernel-doc for xe_force_wake_get() and
> >> xe_force_wake_put() functions.
> >>
> >> Cc: Badal Nilawar <badal.nilawar at intel.com>
> >> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> >> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
> >> Cc: Nirmoy Das <nirmoy.das at intel.com>
> >> Signed-off-by: Himal Prasad Ghimiray
> >> <himal.prasad.ghimiray at intel.com>
> >> ---
> >>   drivers/gpu/drm/xe/xe_force_wake.c | 52
> +++++++++++++++++++++++++++---
> >>   1 file changed, 47 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_force_wake.c
> >> b/drivers/gpu/drm/xe/xe_force_wake.c
> >> index b263fff15273..8aa8d9b41052 100644
> >> --- a/drivers/gpu/drm/xe/xe_force_wake.c
> >> +++ b/drivers/gpu/drm/xe/xe_force_wake.c
> >> @@ -150,31 +150,73 @@ static int domain_sleep_wait(struct xe_gt *gt,
> >>   					 (ffs(tmp__) - 1))) && \
> >>   					 domain__->reg_ctl.addr)
> >>
> >> +/**
> >> + * xe_force_wake_get : Increase the domain refcount; if it was 0
> >> +initially, wake the domain
> >> + * @fw: struct xe_force_wake
> >> + * @domains: forcewake domains to get refcount on
> >> + *
> >> + * Increment refcount for the force-wake domain. If the domain is
> >> + * asleep, awaken it and wait for acknowledgment within the
> >> +specified
> >> + * timeout. If a timeout occurs, decrement the refcount and put the
> >> + * caller awaken domains to sleep.
> >> + *
> >> + * Return: 0 on success or 1 on ack timeout from domains.
> >
> > * Returns 0 for success, negative error code otherwise.
> 
> Hi Rodrigo,
> 
> Sure. Will fix in next version.
> 
> >
> >> + */
> >>   int xe_force_wake_get(struct xe_force_wake *fw,
> >>   		      enum xe_force_wake_domains domains)
> >>   {
> >>   	struct xe_gt *gt = fw->gt;
> >>   	struct xe_force_wake_domain *domain;
> >> -	enum xe_force_wake_domains tmp, woken = 0;
> >> +	enum xe_force_wake_domains tmp, awake_rqst = 0, awake_ack = 0;
> >>   	unsigned long flags;
> >>   	int ret = 0;
> >>
> >>   	spin_lock_irqsave(&fw->lock, flags);
> >>   	for_each_fw_domain_masked(domain, domains, fw, tmp) {
> >>   		if (!domain->ref++) {
> >> -			woken |= BIT(domain->id);
> >> +			awake_rqst |= BIT(domain->id);
> >>   			domain_wake(gt, domain);
> >>   		}
> >>   	}
> >> -	for_each_fw_domain_masked(domain, woken, fw, tmp) {
> >> -		ret |= domain_wake_wait(gt, domain);
> >
> > now you suppress the mmio error code...
> > should be better to find a way to propagate that.
> 
> 
> AFAIU the only possible error code from domain_wake_wait is -ETIMEDOUT,
> was planning to assign same to ret below, which I missed in the RFC.
> 
> 
> >
> >> +	for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
> >> +		if (domain_wake_wait(gt, domain) == 0)
> >> +			awake_ack |= BIT(domain->id);
> >> +	}
> >> +
> >> +	ret = (awake_ack == awake_rqst) ? 0 : 1;
> >
> > s/1/-EIO/ ?
> 
> How about -ETIMEDOUT ? Since this is same error which will be propogated
> in case of domain_wake_wait failure ?
> 
> >
> >> +
> >> +	/*
> >> +	 * If @domains is XE_FORCEWAKE_ALL and an acknowledgment
> times out
> >> +	 * for any domain, decrease the reference count and put the awake
> >> +	 * domains to sleep. For individual domains, just decrement the
> >> +	 * reference count.
> >> +	 */
> >> +	if (ret) {
> >> +		for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
> >> +			if (!--domain->ref && (awake_ack & BIT(domain->id)))
> >> +				domain_sleep(gt, domain);
> >
> > wonder if it would help to extract this in a separate function to be
> > used here and in the -put function.
> 
> Let me think around that.
> 
> >
> > But more then that, I have a question here...
> > Do we really need to sleep other domains if we are not getting ack from
> certain domain?
> > Doesn't it generally means that we are busted anyway?
> 
> I have no strong opinion on this, main thing is refcount shouldn't be
> incremented.
> 
> >
> > But also, if we really need to sleep, then perhaps shouldn't we also
> > call the sleep function even from the guys who didn't ack? perhaps the ack
> > timedout, but it really woke-up? how sure we are that this is not possible?
> 
> I didn't want to change the hw state by calling sleep for the "ack
> failed" domain, so if necessary, Debug tools (PythonSV) can help us
> pinpoint the exact failure state of the HW registers.

If at all after timeout domain got awake, then sw/hw miss alignment will occure, so on no ack putting it to sleep looks ok here to align hw with sw.

Thanks,
Tejas
>  
> 
> >
> >> +		}
> >> +		awake_ack = 0;
> >>   	}
> >> -	fw->awake_domains |= woken;
> >> +
> >> +	fw->awake_domains |= awake_ack;
> >>   	spin_unlock_irqrestore(&fw->lock, flags);
> >>
> >>   	return ret;
> >>   }
> >>
> >> +/**
> >> + * xe_force_wake_put - Decrement the refcount and put domain to sleep
> if refcount becomes 0
> >> + * @fw: Pointer to the force wake structure
> >> + * @domains: forcewake domains to put reference
> >> + *
> >> + * This function reduces the reference counts for specified domains. If
> >> + * refcount for any of the specified domain reaches 0, it puts the domain
> to sleep
> >> + * and waits for acknowledgment for domain to sleep within specified
> timeout.
> >> + * Ensure this function is called only in case of successful
> xe_force_wake_get().
> >> + *
> >> + * Returns 0 in case of success or non-zero in case of timeout of ack
> >> + */
> >>   int xe_force_wake_put(struct xe_force_wake *fw,
> >>   		      enum xe_force_wake_domains domains)
> >>   {
> >> --
> >> 2.34.1
> >>


More information about the Intel-xe mailing list