[RFC 1/9] drm/xe: Error handling in xe_force_wake_get()
Nilawar, Badal
badal.nilawar at intel.com
Tue Sep 10 18:27:56 UTC 2024
On 06-09-2024 21:48, Rodrigo Vivi wrote:
> On Fri, Sep 06, 2024 at 01:32:38AM +0530, Ghimiray, Himal Prasad wrote:
>>
>>
>> On 06-09-2024 00:59, Rodrigo Vivi wrote:
>>> On Fri, Aug 30, 2024 at 10:53:18AM +0530, Himal Prasad Ghimiray wrote:
>>>> If an acknowledgment timeout occurs for a domain awake request, put to
>>>> sleep all domains awakened by the caller and decrease the reference
>>>> count for all requested domains. This prevents xe_force_wake_get() from
>>>> leaving an unhandled reference count in case of failure.
>>>> While at it, add simple kernel-doc for xe_force_wake_get() and
>>>> xe_force_wake_put() functions.
>>>>
>>>> Cc: Badal Nilawar <badal.nilawar at intel.com>
>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>>> Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
>>>> ---
>>>> drivers/gpu/drm/xe/xe_force_wake.c | 52 +++++++++++++++++++++++++++---
>>>> 1 file changed, 47 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_force_wake.c b/drivers/gpu/drm/xe/xe_force_wake.c
>>>> index b263fff15273..8aa8d9b41052 100644
>>>> --- a/drivers/gpu/drm/xe/xe_force_wake.c
>>>> +++ b/drivers/gpu/drm/xe/xe_force_wake.c
>>>> @@ -150,31 +150,73 @@ static int domain_sleep_wait(struct xe_gt *gt,
>>>> (ffs(tmp__) - 1))) && \
>>>> domain__->reg_ctl.addr)
>>>> +/**
>>>> + * xe_force_wake_get : Increase the domain refcount; if it was 0 initially, wake the domain
>>>> + * @fw: struct xe_force_wake
>>>> + * @domains: forcewake domains to get refcount on
>>>> + *
>>>> + * Increment refcount for the force-wake domain. If the domain is
>>>> + * asleep, awaken it and wait for acknowledgment within the specified
>>>> + * timeout. If a timeout occurs, decrement the refcount and put the
>>>> + * caller awaken domains to sleep.
>>>> + *
>>>> + * Return: 0 on success or 1 on ack timeout from domains.
>>>
>>> * Returns 0 for success, negative error code otherwise.
>>
>> Hi Rodrigo,
>>
>> Sure. Will fix in next version.
>>
>>>
>>>> + */
>>>> int xe_force_wake_get(struct xe_force_wake *fw,
>>>> enum xe_force_wake_domains domains)
>>>> {
>>>> struct xe_gt *gt = fw->gt;
>>>> struct xe_force_wake_domain *domain;
>>>> - enum xe_force_wake_domains tmp, woken = 0;
>>>> + enum xe_force_wake_domains tmp, awake_rqst = 0, awake_ack = 0;
>>>> unsigned long flags;
>>>> int ret = 0;
>>>> spin_lock_irqsave(&fw->lock, flags);
>>>> for_each_fw_domain_masked(domain, domains, fw, tmp) {
>>>> if (!domain->ref++) {
>>>> - woken |= BIT(domain->id);
>>>> + awake_rqst |= BIT(domain->id);
>>>> domain_wake(gt, domain);
>>>> }
>>>> }
>>>> - for_each_fw_domain_masked(domain, woken, fw, tmp) {
>>>> - ret |= domain_wake_wait(gt, domain);
>>>
>>> now you suppress the mmio error code...
>>> should be better to find a way to propagate that.
>>
>>
>> AFAIU the only possible error code from domain_wake_wait is -ETIMEDOUT, was
>> planning to assign same to ret below, which I missed in the RFC.
>>
>>
>>>
>>>> + for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>>>> + if (domain_wake_wait(gt, domain) == 0)
>>>> + awake_ack |= BIT(domain->id);
>>>> + }
>>>> +
>>>> + ret = (awake_ack == awake_rqst) ? 0 : 1;
>>>
>>> s/1/-EIO/ ?
>>
>> How about -ETIMEDOUT ? Since this is same error which will be propogated in
>> case of domain_wake_wait failure ?
>
> hmm, I guess it makes more sense indeed.
On patch 9 discussion we are aligning with returning mask of awake
domains. Make sure whenever the error code is required to return for
_get -ETIMEDOUT is maintained. May be document this as guideline.
>
>>
>>>
>>>> +
>>>> + /*
>>>> + * If @domains is XE_FORCEWAKE_ALL and an acknowledgment times out
>>>> + * for any domain, decrease the reference count and put the awake
>>>> + * domains to sleep. For individual domains, just decrement the
>>>> + * reference count.
>>>> + */
>>>> + if (ret) {
>>>> + for_each_fw_domain_masked(domain, awake_rqst, fw, tmp) {
>>>> + if (!--domain->ref && (awake_ack & BIT(domain->id)))
>>>> + domain_sleep(gt, domain);
>>>
>>> wonder if it would help to extract this in a separate function to be
>>> used here and in the -put function.
>>
>> Let me think around that.
>>
>>>
>>> But more then that, I have a question here...
>>> Do we really need to sleep other domains if we are not getting ack from certain domain?
>>> Doesn't it generally means that we are busted anyway?
>>
>> I have no strong opinion on this, main thing is refcount shouldn't be
>> incremented.
>>
>>>
>>> But also, if we really need to sleep, then perhaps shouldn't we also
>>> call the sleep function even from the guys who didn't ack? perhaps the ack
>>> timedout, but it really woke-up? how sure we are that this is not possible?
>>
>> I didn't want to change the hw state by calling sleep for the "ack failed"
>> domain, so if necessary, Debug tools (PythonSV) can help us pinpoint the
>> exact failure state of the HW registers.
Agreed, let’s avoid putting a failed domain to sleep as it will aid in
debugging. It’s possible that the acknowledgment timed out but the
domain still woke up. As discussed in patch 9, subsequent firmware
get/put calls will put the domain to sleep. The only concern is if the
device is idle and forcewake is triggered via a sysfs/debugfs entry, the
domain may remain awake until a forcewake get/put call is made.
Regards,
Badal
>>
>>
>>>
>>>> + }
>>>> + awake_ack = 0;
>>>> }
>>>> - fw->awake_domains |= woken;
>>>> +
>>>> + fw->awake_domains |= awake_ack;
>>>> spin_unlock_irqrestore(&fw->lock, flags);
>>>> return ret;
>>>> }
>>>> +/**
>>>> + * xe_force_wake_put - Decrement the refcount and put domain to sleep if refcount becomes 0
>>>> + * @fw: Pointer to the force wake structure
>>>> + * @domains: forcewake domains to put reference
>>>> + *
>>>> + * This function reduces the reference counts for specified domains. If
>>>> + * refcount for any of the specified domain reaches 0, it puts the domain to sleep
>>>> + * and waits for acknowledgment for domain to sleep within specified timeout.
>>>> + * Ensure this function is called only in case of successful xe_force_wake_get().
>>>> + *
>>>> + * Returns 0 in case of success or non-zero in case of timeout of ack
>>>> + */
>>>> int xe_force_wake_put(struct xe_force_wake *fw,
>>>> enum xe_force_wake_domains domains)
>>>> {
>>>> --
>>>> 2.34.1
>>>>
More information about the Intel-xe
mailing list