[PATCH] drm/amdgpu: support gpu recovery tests on compute rings

Christian König ckoenig.leichtzumerken at gmail.com
Fri Apr 26 08:33:27 UTC 2019


Am 26.04.19 um 10:20 schrieb Quan, Evan:
> My concern is there is already one module parameter "lockup_timeout".
> parm:           lockup_timeout:GPU lockup timeout in ms > 0 (default 10000) (int)
>
> Adding one more "timeout" seems redundant.
> And that will makes the description of "lockup_timeout"(seems working for all jobs) does not match its real effect(affect only non-compute jobs).
>
> A better way is to rename "lockup_timeout" to "non-compute lockup_timeout". But I do not think we can change existing module parameter. Right?

No, that's fine. Module parameters are not part of the API which needs 
to stay backward compatible.

Maybe use compute_lockup_timeout and other_lockup_timeout or something 
similar?

Regards,
Christian.

>
> Regards,
> Evan
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>> Christian K?nig
>> Sent: Friday, April 26, 2019 3:34 PM
>> To: Quan, Evan <Evan.Quan at amd.com>; amd-gfx at lists.freedesktop.org
>> Cc: Xu, Feifei <Feifei.Xu at amd.com>; Cui, Flora <Flora.Cui at amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: support gpu recovery tests on compute
>> rings
>>
>> Am 26.04.19 um 09:24 schrieb Evan Quan:
>>> A new module parameter is added for determining whether or not to
>>> enforce timeout on compute jobs.
>> Can we rework that a bit and instead of a bool have a separate millisecond
>> timeout for compute?
>>
>> E.g. default is 0 and that means MAX_SCHEDULE_TIMEOUT unless we are
>> under SRIOV.
>> Any other value is just the timeout in milliseconds.
>>
>> Christian.
>>
>>> Change-Id: If14b75977312e42dac0431072456e5b69cf1bc2f
>>> Signed-off-by: Evan Quan <evan.quan at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h       | 1 +
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   | 8 ++++++++
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 3 ++-
>>>    3 files changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index e16dcee2bf75..ee624d993df7 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -166,6 +166,7 @@ extern int amdgpu_si_support;
>>>    #ifdef CONFIG_DRM_AMDGPU_CIK
>>>    extern int amdgpu_cik_support;
>>>    #endif
>>> +extern bool amdgpu_compute_timeout_enforced;
>>>
>>>    #define AMDGPU_VM_MAX_NUM_CTX			4096
>>>    #define AMDGPU_SG_THRESHOLD			(256*1024*1024)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 13a68f62bcc8..91de3e90fae9 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -140,6 +140,7 @@ struct amdgpu_mgpu_info mgpu_info = {
>>>    };
>>>    int amdgpu_ras_enable = -1;
>>>    uint amdgpu_ras_mask = 0xffffffff;
>>> +bool amdgpu_compute_timeout_enforced = false;
>>>
>>>    /**
>>>     * DOC: vramlimit (int)
>>> @@ -234,6 +235,13 @@ module_param_named(msi, amdgpu_msi, int,
>> 0444);
>>>    MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms > 0
>> (default 10000)");
>>>    module_param_named(lockup_timeout, amdgpu_lockup_timeout, int,
>>> 0444);
>>>
>>> +/**
>>> + * DOC: compute_timeout_enforced (bool)
>>> + * Whether or not to enforce timeout on compute jobs (1 = enable, 0 =
>> disable). The default is 0.
>>> + */
>>> +MODULE_PARM_DESC(compute_timeout_enforced, "Enforce timeout
>> on
>>> +compute jobs (1 = enable, 0 = disable (default))");
>>> +module_param_named(compute_timeout_enforced,
>>> +amdgpu_compute_timeout_enforced, bool, 0444);
>>> +
>>>    /**
>>>     * DOC: dpm (int)
>>>     * Override for dynamic power management setting (1 = enable, 0 =
>> disable). The default is -1 (auto).
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 4dee2326b29c..4adffad04dbc 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -453,7 +453,8 @@ int amdgpu_fence_driver_init_ring(struct
>> amdgpu_ring *ring,
>>>    	if (ring->funcs->type != AMDGPU_RING_TYPE_KIQ) {
>>>    		/* for non-sriov case, no timeout enforce on compute ring */
>>>    		if ((ring->funcs->type == AMDGPU_RING_TYPE_COMPUTE)
>>> -				&& !amdgpu_sriov_vf(ring->adev))
>>> +				&& !amdgpu_sriov_vf(ring->adev)
>>> +				&& !amdgpu_compute_timeout_enforced)
>>>    			timeout = MAX_SCHEDULE_TIMEOUT;
>>>    		else
>>>    			timeout =
>> msecs_to_jiffies(amdgpu_lockup_timeout);
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the amd-gfx mailing list