[RFC 1/9] drm/xe: Error handling in xe_force_wake_get()

Ghimiray, Himal Prasad himal.prasad.ghimiray at intel.com
Thu Sep 5 20:02:38 UTC 2024



On 06-09-2024 00:59, Rodrigo Vivi wrote:
> On Fri, Aug 30, 2024 at 10:53:18AM +0530, Himal Prasad Ghimiray wrote:
>> If an acknowledgment timeout occurs for a domain awake request, put to
>> sleep all domains awakened by the caller and decrease the reference
>> count for all requested domains. This prevents xe_force_wake_get() from
>> leaving an unhandled reference count in case of failure.
>> While at it, add simple kernel-doc for xe_force_wake_get() and
>> xe_force_wake_put() functions.
>>
>> Cc: Badal Nilawar <badal.nilawar at intel.com>
>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_force_wake.c | 52 +++++++++++++++++++++++++++---
>>   1 file changed, 47 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.c b/drivers/gpu/drm/xe/xe_force_wake.c
>> index b263fff15273..8aa8d9b41052 100644
>> --- a/drivers/gpu/drm/xe/xe_force_wake.c
>> +++ b/drivers/gpu/drm/xe/xe_force_wake.c
>> @@ -150,31 +150,73 @@ static int domain_sleep_wait(struct xe_gt *gt,
>>   					 (ffs(tmp__) - 1))) && \
>>   					 domain__->reg_ctl.addr)
>>   
>> +/**
>> + * xe_force_wake_get : Increase the domain refcount; if it was 0 initially, wake the domain
>> + * @fw: struct xe_force_wake
>> + * @domains: forcewake domains to get refcount on
>> + *
>> + * Increment refcount for the force-wake domain. If the domain is
>> + * asleep, awaken it and wait for acknowledgment within the specified
>> + * timeout. If a timeout occurs, decrement the refcount and put the
>> + * caller awaken domains to sleep.
>> + *
>> + * Return: 0 on success or 1 on ack timeout from domains.
> 
> * Returns 0 for success, negative error code otherwise.

Hi Rodrigo,

Sure. Will fix in next version.

> 
>> + */
>>   int xe_force_wake_get(struct xe_force_wake *fw,
>>   		      enum xe_force_wake_domains domains)
>>   {
>>   	struct xe_gt *gt = fw->gt;
>>   	struct xe_force_wake_domain *domain;
>> -	enum xe_force_wake_domains tmp, woken = 0;
>> +	enum xe_force_wake_domains tmp, awake_rqst = 0, awake_ack = 0;
>>   	unsigned long flags;
>>   	int ret = 0;
>>   
>>   	spin_lock_irqsave(&fw->lock, flags);
>>   	for_each_fw_domain_masked(domain, domains, fw, tmp) {
>>   		if (!domain->ref++) {
>> -			woken |= BIT(domain->id);
>> +			awake_rqst |= BIT(domain->id);
>>   			domain_wake(gt, domain);
>>   		}
>>   	}
>> -	for_each_fw_domain_masked(domain, woken, fw, tmp) {
>> -		ret |= domain_wake_wait(gt, domain);
> 
> now you suppress the mmio error code...
> should be better to find a way to propagate that.


AFAIU the only possible error code from domain_wake_wait is -ETIMEDOUT, 
was planning to assign same to ret below, which I missed in the RFC.


> 
>> +	for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>> +		if (domain_wake_wait(gt, domain) == 0)
>> +			awake_ack |= BIT(domain->id);
>> +	}
>> +
>> +	ret = (awake_ack == awake_rqst) ? 0 : 1;
> 
> s/1/-EIO/ ?

How about -ETIMEDOUT ? Since this is same error which will be propogated 
in case of domain_wake_wait failure ?

> 
>> +
>> +	/*
>> +	 * If @domains is XE_FORCEWAKE_ALL and an acknowledgment times out
>> +	 * for any domain, decrease the reference count and put the awake
>> +	 * domains to sleep. For individual domains, just decrement the
>> +	 * reference count.
>> +	 */
>> +	if (ret) {
>> +		for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>> +			if (!--domain->ref && (awake_ack & BIT(domain->id)))
>> +				domain_sleep(gt, domain);
> 
> wonder if it would help to extract this in a separate function to be
> used here and in the -put function.

Let me think around that.

> 
> But more then that, I have a question here...
> Do we really need to sleep other domains if we are not getting ack from certain domain?
> Doesn't it generally means that we are busted anyway?

I have no strong opinion on this, main thing is refcount shouldn't be 
incremented.

> 
> But also, if we really need to sleep, then perhaps shouldn't we also
> call the sleep function even from the guys who didn't ack? perhaps the ack
> timedout, but it really woke-up? how sure we are that this is not possible?

I didn't want to change the hw state by calling sleep for the "ack 
failed" domain, so if necessary, Debug tools (PythonSV) can help us 
pinpoint the exact failure state of the HW registers.


> 
>> +		}
>> +		awake_ack = 0;
>>   	}
>> -	fw->awake_domains |= woken;
>> +
>> +	fw->awake_domains |= awake_ack;
>>   	spin_unlock_irqrestore(&fw->lock, flags);
>>   
>>   	return ret;
>>   }
>>   
>> +/**
>> + * xe_force_wake_put - Decrement the refcount and put domain to sleep if refcount becomes 0
>> + * @fw: Pointer to the force wake structure
>> + * @domains: forcewake domains to put reference
>> + *
>> + * This function reduces the reference counts for specified domains. If
>> + * refcount for any of the specified domain reaches 0, it puts the domain to sleep
>> + * and waits for acknowledgment for domain to sleep within specified timeout.
>> + * Ensure this function is called only in case of successful xe_force_wake_get().
>> + *
>> + * Returns 0 in case of success or non-zero in case of timeout of ack
>> + */
>>   int xe_force_wake_put(struct xe_force_wake *fw,
>>   		      enum xe_force_wake_domains domains)
>>   {
>> -- 
>> 2.34.1
>>


More information about the Intel-xe mailing list