[PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.
Lazar, Lijo
lijo.lazar at amd.com
Fri May 6 06:02:40 UTC 2022
On 5/6/2022 3:17 AM, Andrey Grodzovsky wrote:
>
> On 2022-05-05 15:49, Felix Kuehling wrote:
>>
>> Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky:
>>>
>>> On 2022-05-05 11:06, Christian König wrote:
>>>> Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2022-05-05 09:23, Christian König wrote:
>>>>>> Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky:
>>>>>>> On 2022-05-05 06:09, Christian König wrote:
>>>>>>>
>>>>>>>> Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky:
>>>>>>>>> Problem:
>>>>>>>>> During hive reset caused by command timing out on a ring
>>>>>>>>> extra resets are generated by triggered by KFD which is
>>>>>>>>> unable to accesses registers on the resetting ASIC.
>>>>>>>>>
>>>>>>>>> Fix: Rework GPU reset to use a list of pending reset jobs
>>>>>>>>> such that the first reset jobs that actaully resets the entire
>>>>>>>>> reset domain will cancel all those pending redundant resets.
>>>>>>>>>
>>>>>>>>> This is in line with what we already do for redundant TDRs
>>>>>>>>> in scheduler code.
>>>>>>>>
>>>>>>>> Mhm, why exactly do you need the extra linked list then?
>>>>>>>>
>>>>>>>> Let's talk about that on our call today.
>>>>>>>
>>>>>>>
>>>>>>> Going to miss it as you know, and also this is the place to
>>>>>>> discuss technical questions anyway so -
>>>>>>
>>>>>> Good point.
>>>>>>
>>>>>>> It's needed because those other resets are not time out handlers
>>>>>>> that are governed by the scheduler
>>>>>>> but rather external resets that are triggered by such clients as
>>>>>>> KFD, RAS and sysfs. Scheduler has no
>>>>>>> knowledge of them (and should not have) but they are serialized
>>>>>>> into same wq as the TO handlers
>>>>>>> from the scheduler. It just happens that TO triggered reset
>>>>>>> causes in turn another reset (from KFD in
>>>>>>> this case) and we want to prevent this second reset from taking
>>>>>>> place just as we want to avoid multiple
>>>>>>> TO resets to take place in scheduler code.
>>>>>>
>>>>>> Yeah, but why do you need multiple workers?
>>>>>>
>>>>>> You have a single worker for the GPU reset not triggered by the
>>>>>> scheduler in you adev and cancel that at the end of the reset
>>>>>> procedure.
>>>>>>
>>>>>> If anybody things it needs to trigger another reset while in reset
>>>>>> (which is actually a small design bug separately) the reset will
>>>>>> just be canceled in the same way we cancel the scheduler resets.
>>>>>>
>>>>>> Christian.
>>>>>
>>>>>
>>>>> Had this in mind at first but then I realized that each client
>>>>> (RAS, KFD and sysfs) will want to fill his own data for the work
>>>>> (see amdgpu_device_gpu_recover) - for XGMI hive each will want to
>>>>> set his own adev (which is fine if you set a work per adev as you
>>>>> suggest) but also each client might want (they all put NULL there
>>>>> but in theory in the future) also set his own bad job value and
>>>>> here you might have a collision.
>>>>
>>>> Yeah, but that is intentional. See when we have a job that needs to
>>>> be consumed by the reset handler and not overwritten or something.
>>>
>>>
>>> I am not sure why this is a requirement, multiple clients can decide
>>> concurrently to trigger a reset for some reason (possibly independent
>>> reasons) hence they cannot share same work struct to pass to it their
>>> data.
>>
>> If those concurrent clients could detect that a reset was already in
>> progress, you wouldn't need the complexity of multiple work structs
>> being scheduled. You could simply return without triggering another
>> reset.
>
>
> In my view main problem here with single work struct either at reset
> domain level or even adev level is that in some cases we optimize resets
> and don't really perform ASIC HW reset (see amdgpu_job_timedout with
> soft recovery and skip_hw_reset in amdgpu_device_gpu_recover_imp for the
> case the bad job does get signaled just before we start HW reset and we
> just skip this). You can see that if many different reset sources share
> same work struct what can happen is that the first to obtain the lock
> you describe bellow might opt out from full HW reset because his bad job
> did signal for example or because his hunged IP block was able to
> recover through SW reset but in the meantime another reset source who
> needed an actual HW reset just silently returned and we end up with
> unhandled reset request. True that today this happens only to job
> timeout reset sources that are handled form within the scheduler and
> won't use this single work struct but no one prevents a future case for
> this to happen and also, if we actually want to unify scheduler time out
> handlers within reset domain (which seems to me the right design
> approach) we won't be able to use just one work struct for this reason
> anyway.
>
Just to add to this point - a reset domain is co-operative domain. In
addition to sharing a set of clients from various reset sources for one
device, it also will have a set of devices like in XGMI hive. The job
timeout on one device may not eventually result in result, but a RAS
error happening on another device at the same time would need a reset.
The second device's RAS error cannot return seeing that a reset work
already started, or ignore the reset work given that another device has
filled the reset data.
When there is a reset domain, it should take care of the work scheduled
and keeping it in device or any other level doesn't sound good.
Thanks,
Lijo
> Andrey
>
>
>>
>> I'd put the reset work struct into the reset_domain struct. That way
>> you'd have exactly one worker for the reset domain. You could
>> implement a lock-less scheme to decide whether you need to schedule a
>> reset, e.g. using an atomic counter in the shared work struct that
>> gets incremented when a client wants to trigger a reset
>> (atomic_add_return). If that counter is exactly 1 after incrementing,
>> you need to fill in the rest of the work struct and schedule the work.
>> Otherwise, it's already scheduled (or another client is in the process
>> of scheduling it) and you just return. When the worker finishes (after
>> confirming a successful reset), it resets the counter to 0, so the
>> next client requesting a reset will schedule the worker again.
>>
>> Regards,
>> Felix
>>
>>
>>>
>>>
>>>>
>>>>
>>>> Additional to that keep in mind that you can't allocate any memory
>>>> before or during the GPU reset nor wait for the reset to complete
>>>> (so you can't allocate anything on the stack either).
>>>
>>>
>>> There is no dynamic allocation here, regarding stack allocations - we
>>> do it all the time when we call functions, even during GPU resets,
>>> how on stack allocation of work struct in amdgpu_device_gpu_recover
>>> is different from any other local variable we allocate in any
>>> function we call ?
>>>
>>> I am also not sure why it's not allowed to wait for reset to complete
>>> ? Also, see in amdgpu_ras_do_recovery and gpu_recover_get (debugfs) -
>>> the caller expects the reset to complete before he returns. I can
>>> probably work around it in RAS code by calling
>>> atomic_set(&ras->in_recovery, 0) from some callback within actual
>>> reset function but regarding sysfs it actually expects a result
>>> returned indicating whether the call was successful or not.
>>>
>>> Andrey
>>>
>>>
>>>>
>>>> I don't think that concept you try here will work.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Also in general seems to me it's cleaner approach where this logic
>>>>> (the work items) are held and handled in reset_domain and are not
>>>>> split in each adev or any other entity. We might want in the future
>>>>> to even move the scheduler handling into reset domain since reset
>>>>> domain is supposed to be a generic things and not only or AMD.
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>>>> Tested-by: Bai Zoy <Zoy.Bai at amd.com>
>>>>>>>>> ---
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 11 +---
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 +++--
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 3 +
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 73
>>>>>>>>> +++++++++++++++++++++-
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 3 +-
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 7 ++-
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 7 ++-
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 7 ++-
>>>>>>>>> 8 files changed, 104 insertions(+), 24 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>>> index 4264abc5604d..99efd8317547 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>>> @@ -109,6 +109,7 @@
>>>>>>>>> #include "amdgpu_fdinfo.h"
>>>>>>>>> #include "amdgpu_mca.h"
>>>>>>>>> #include "amdgpu_ras.h"
>>>>>>>>> +#include "amdgpu_reset.h"
>>>>>>>>> #define MAX_GPU_INSTANCE 16
>>>>>>>>> @@ -509,16 +510,6 @@ struct amdgpu_allowed_register_entry {
>>>>>>>>> bool grbm_indexed;
>>>>>>>>> };
>>>>>>>>> -enum amd_reset_method {
>>>>>>>>> - AMD_RESET_METHOD_NONE = -1,
>>>>>>>>> - AMD_RESET_METHOD_LEGACY = 0,
>>>>>>>>> - AMD_RESET_METHOD_MODE0,
>>>>>>>>> - AMD_RESET_METHOD_MODE1,
>>>>>>>>> - AMD_RESET_METHOD_MODE2,
>>>>>>>>> - AMD_RESET_METHOD_BACO,
>>>>>>>>> - AMD_RESET_METHOD_PCI,
>>>>>>>>> -};
>>>>>>>>> -
>>>>>>>>> struct amdgpu_video_codec_info {
>>>>>>>>> u32 codec_type;
>>>>>>>>> u32 max_width;
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> index e582f1044c0f..7fa82269c30f 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>> @@ -5201,6 +5201,12 @@ int amdgpu_device_gpu_recover_imp(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>> }
>>>>>>>>> tmp_vram_lost_counter =
>>>>>>>>> atomic_read(&((adev)->vram_lost_counter));
>>>>>>>>> +
>>>>>>>>> + /* Drop all pending resets since we will reset now anyway */
>>>>>>>>> + tmp_adev = list_first_entry(device_list_handle, struct
>>>>>>>>> amdgpu_device,
>>>>>>>>> + reset_list);
>>>>>>>>> + amdgpu_reset_pending_list(tmp_adev->reset_domain);
>>>>>>>>> +
>>>>>>>>> /* Actual ASIC resets if needed.*/
>>>>>>>>> /* Host driver will handle XGMI hive reset for SRIOV */
>>>>>>>>> if (amdgpu_sriov_vf(adev)) {
>>>>>>>>> @@ -5296,7 +5302,7 @@ int amdgpu_device_gpu_recover_imp(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>> }
>>>>>>>>> struct amdgpu_recover_work_struct {
>>>>>>>>> - struct work_struct base;
>>>>>>>>> + struct amdgpu_reset_work_struct base;
>>>>>>>>> struct amdgpu_device *adev;
>>>>>>>>> struct amdgpu_job *job;
>>>>>>>>> int ret;
>>>>>>>>> @@ -5304,7 +5310,7 @@ struct amdgpu_recover_work_struct {
>>>>>>>>> static void amdgpu_device_queue_gpu_recover_work(struct
>>>>>>>>> work_struct *work)
>>>>>>>>> {
>>>>>>>>> - struct amdgpu_recover_work_struct *recover_work =
>>>>>>>>> container_of(work, struct amdgpu_recover_work_struct, base);
>>>>>>>>> + struct amdgpu_recover_work_struct *recover_work =
>>>>>>>>> container_of(work, struct amdgpu_recover_work_struct,
>>>>>>>>> base.base.work);
>>>>>>>>> recover_work->ret =
>>>>>>>>> amdgpu_device_gpu_recover_imp(recover_work->adev,
>>>>>>>>> recover_work->job);
>>>>>>>>> }
>>>>>>>>> @@ -5316,12 +5322,15 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>> {
>>>>>>>>> struct amdgpu_recover_work_struct work = {.adev = adev,
>>>>>>>>> .job = job};
>>>>>>>>> - INIT_WORK(&work.base,
>>>>>>>>> amdgpu_device_queue_gpu_recover_work);
>>>>>>>>> + INIT_DELAYED_WORK(&work.base.base,
>>>>>>>>> amdgpu_device_queue_gpu_recover_work);
>>>>>>>>> + INIT_LIST_HEAD(&work.base.node);
>>>>>>>>> if (!amdgpu_reset_domain_schedule(adev->reset_domain,
>>>>>>>>> &work.base))
>>>>>>>>> return -EAGAIN;
>>>>>>>>> - flush_work(&work.base);
>>>>>>>>> + flush_delayed_work(&work.base.base);
>>>>>>>>> +
>>>>>>>>> + amdgpu_reset_domain_del_pendning_work(adev->reset_domain,
>>>>>>>>> &work.base);
>>>>>>>>> return work.ret;
>>>>>>>>> }
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>>>>>>>> index c80af0889773..ffddd419c351 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>>>>>>>> @@ -134,6 +134,9 @@ struct amdgpu_reset_domain
>>>>>>>>> *amdgpu_reset_create_reset_domain(enum amdgpu_reset_d
>>>>>>>>> atomic_set(&reset_domain->in_gpu_reset, 0);
>>>>>>>>> init_rwsem(&reset_domain->sem);
>>>>>>>>> + INIT_LIST_HEAD(&reset_domain->pending_works);
>>>>>>>>> + mutex_init(&reset_domain->reset_lock);
>>>>>>>>> +
>>>>>>>>> return reset_domain;
>>>>>>>>> }
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>>>>>>> index 1949dbe28a86..863ec5720fc1 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>>>>>>> @@ -24,7 +24,18 @@
>>>>>>>>> #ifndef __AMDGPU_RESET_H__
>>>>>>>>> #define __AMDGPU_RESET_H__
>>>>>>>>> -#include "amdgpu.h"
>>>>>>>>> +
>>>>>>>>> +#include <linux/atomic.h>
>>>>>>>>> +#include <linux/mutex.h>
>>>>>>>>> +#include <linux/list.h>
>>>>>>>>> +#include <linux/kref.h>
>>>>>>>>> +#include <linux/rwsem.h>
>>>>>>>>> +#include <linux/workqueue.h>
>>>>>>>>> +
>>>>>>>>> +struct amdgpu_device;
>>>>>>>>> +struct amdgpu_job;
>>>>>>>>> +struct amdgpu_hive_info;
>>>>>>>>> +
>>>>>>>>> enum AMDGPU_RESET_FLAGS {
>>>>>>>>> @@ -32,6 +43,17 @@ enum AMDGPU_RESET_FLAGS {
>>>>>>>>> AMDGPU_SKIP_HW_RESET = 1,
>>>>>>>>> };
>>>>>>>>> +
>>>>>>>>> +enum amd_reset_method {
>>>>>>>>> + AMD_RESET_METHOD_NONE = -1,
>>>>>>>>> + AMD_RESET_METHOD_LEGACY = 0,
>>>>>>>>> + AMD_RESET_METHOD_MODE0,
>>>>>>>>> + AMD_RESET_METHOD_MODE1,
>>>>>>>>> + AMD_RESET_METHOD_MODE2,
>>>>>>>>> + AMD_RESET_METHOD_BACO,
>>>>>>>>> + AMD_RESET_METHOD_PCI,
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> struct amdgpu_reset_context {
>>>>>>>>> enum amd_reset_method method;
>>>>>>>>> struct amdgpu_device *reset_req_dev;
>>>>>>>>> @@ -40,6 +62,8 @@ struct amdgpu_reset_context {
>>>>>>>>> unsigned long flags;
>>>>>>>>> };
>>>>>>>>> +struct amdgpu_reset_control;
>>>>>>>>> +
>>>>>>>>> struct amdgpu_reset_handler {
>>>>>>>>> enum amd_reset_method reset_method;
>>>>>>>>> struct list_head handler_list;
>>>>>>>>> @@ -76,12 +100,21 @@ enum amdgpu_reset_domain_type {
>>>>>>>>> XGMI_HIVE
>>>>>>>>> };
>>>>>>>>> +
>>>>>>>>> +struct amdgpu_reset_work_struct {
>>>>>>>>> + struct delayed_work base;
>>>>>>>>> + struct list_head node;
>>>>>>>>> +};
>>>>>>>>> +
>>>>>>>>> struct amdgpu_reset_domain {
>>>>>>>>> struct kref refcount;
>>>>>>>>> struct workqueue_struct *wq;
>>>>>>>>> enum amdgpu_reset_domain_type type;
>>>>>>>>> struct rw_semaphore sem;
>>>>>>>>> atomic_t in_gpu_reset;
>>>>>>>>> +
>>>>>>>>> + struct list_head pending_works;
>>>>>>>>> + struct mutex reset_lock;
>>>>>>>>> };
>>>>>>>>> @@ -113,9 +146,43 @@ static inline void
>>>>>>>>> amdgpu_reset_put_reset_domain(struct amdgpu_reset_domain *dom
>>>>>>>>> }
>>>>>>>>> static inline bool amdgpu_reset_domain_schedule(struct
>>>>>>>>> amdgpu_reset_domain *domain,
>>>>>>>>> - struct work_struct *work)
>>>>>>>>> + struct amdgpu_reset_work_struct *work)
>>>>>>>>> {
>>>>>>>>> - return queue_work(domain->wq, work);
>>>>>>>>> + mutex_lock(&domain->reset_lock);
>>>>>>>>> +
>>>>>>>>> + if (!queue_delayed_work(domain->wq, &work->base, 0)) {
>>>>>>>>> + mutex_unlock(&domain->reset_lock);
>>>>>>>>> + return false;
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + list_add_tail(&work->node, &domain->pending_works);
>>>>>>>>> + mutex_unlock(&domain->reset_lock);
>>>>>>>>> +
>>>>>>>>> + return true;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static inline void
>>>>>>>>> amdgpu_reset_domain_del_pendning_work(struct
>>>>>>>>> amdgpu_reset_domain *domain,
>>>>>>>>> + struct amdgpu_reset_work_struct *work)
>>>>>>>>> +{
>>>>>>>>> + mutex_lock(&domain->reset_lock);
>>>>>>>>> + list_del_init(&work->node);
>>>>>>>>> + mutex_unlock(&domain->reset_lock);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static inline void amdgpu_reset_pending_list(struct
>>>>>>>>> amdgpu_reset_domain *domain)
>>>>>>>>> +{
>>>>>>>>> + struct amdgpu_reset_work_struct *entry, *tmp;
>>>>>>>>> +
>>>>>>>>> + mutex_lock(&domain->reset_lock);
>>>>>>>>> + list_for_each_entry_safe(entry, tmp,
>>>>>>>>> &domain->pending_works, node) {
>>>>>>>>> +
>>>>>>>>> + list_del_init(&entry->node);
>>>>>>>>> +
>>>>>>>>> + /* Stop any other related pending resets */
>>>>>>>>> + cancel_delayed_work(&entry->base);
>>>>>>>>> + }
>>>>>>>>> +
>>>>>>>>> + mutex_unlock(&domain->reset_lock);
>>>>>>>>> }
>>>>>>>>> void amdgpu_device_lock_reset_domain(struct
>>>>>>>>> amdgpu_reset_domain *reset_domain);
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>>>>>>>> index 239f232f9c02..574e870d3064 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>>>>>>>> @@ -25,6 +25,7 @@
>>>>>>>>> #define AMDGPU_VIRT_H
>>>>>>>>> #include "amdgv_sriovmsg.h"
>>>>>>>>> +#include "amdgpu_reset.h"
>>>>>>>>> #define AMDGPU_SRIOV_CAPS_SRIOV_VBIOS (1 << 0) /* vBIOS is
>>>>>>>>> sr-iov ready */
>>>>>>>>> #define AMDGPU_SRIOV_CAPS_ENABLE_IOV (1 << 1) /* sr-iov is
>>>>>>>>> enabled on this GPU */
>>>>>>>>> @@ -230,7 +231,7 @@ struct amdgpu_virt {
>>>>>>>>> uint32_t reg_val_offs;
>>>>>>>>> struct amdgpu_irq_src ack_irq;
>>>>>>>>> struct amdgpu_irq_src rcv_irq;
>>>>>>>>> - struct work_struct flr_work;
>>>>>>>>> + struct amdgpu_reset_work_struct flr_work;
>>>>>>>>> struct amdgpu_mm_table mm_table;
>>>>>>>>> const struct amdgpu_virt_ops *ops;
>>>>>>>>> struct amdgpu_vf_error_buffer vf_errors;
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> index b81acf59870c..f3d1c2be9292 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> @@ -251,7 +251,7 @@ static int
>>>>>>>>> xgpu_ai_set_mailbox_ack_irq(struct amdgpu_device *adev,
>>>>>>>>> static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>>>>>>>>> {
>>>>>>>>> - struct amdgpu_virt *virt = container_of(work, struct
>>>>>>>>> amdgpu_virt, flr_work);
>>>>>>>>> + struct amdgpu_virt *virt = container_of(work, struct
>>>>>>>>> amdgpu_virt, flr_work.base.work);
>>>>>>>>> struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>> amdgpu_device, virt);
>>>>>>>>> int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>> @@ -380,7 +380,8 @@ int xgpu_ai_mailbox_get_irq(struct
>>>>>>>>> amdgpu_device *adev)
>>>>>>>>> return r;
>>>>>>>>> }
>>>>>>>>> - INIT_WORK(&adev->virt.flr_work, xgpu_ai_mailbox_flr_work);
>>>>>>>>> + INIT_DELAYED_WORK(&adev->virt.flr_work.base,
>>>>>>>>> xgpu_ai_mailbox_flr_work);
>>>>>>>>> + INIT_LIST_HEAD(&adev->virt.flr_work.node);
>>>>>>>>> return 0;
>>>>>>>>> }
>>>>>>>>> @@ -389,6 +390,8 @@ void xgpu_ai_mailbox_put_irq(struct
>>>>>>>>> amdgpu_device *adev)
>>>>>>>>> {
>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.ack_irq, 0);
>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.rcv_irq, 0);
>>>>>>>>> +
>>>>>>>>> + amdgpu_reset_domain_del_pendning_work(adev->reset_domain,
>>>>>>>>> &adev->virt.flr_work);
>>>>>>>>> }
>>>>>>>>> static int xgpu_ai_request_init_data(struct amdgpu_device
>>>>>>>>> *adev)
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> index 22c10b97ea81..927b3d5bb1d0 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> @@ -275,7 +275,7 @@ static int
>>>>>>>>> xgpu_nv_set_mailbox_ack_irq(struct amdgpu_device *adev,
>>>>>>>>> static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>>>>>>>>> {
>>>>>>>>> - struct amdgpu_virt *virt = container_of(work, struct
>>>>>>>>> amdgpu_virt, flr_work);
>>>>>>>>> + struct amdgpu_virt *virt = container_of(work, struct
>>>>>>>>> amdgpu_virt, flr_work.base.work);
>>>>>>>>> struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>> amdgpu_device, virt);
>>>>>>>>> int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>> @@ -407,7 +407,8 @@ int xgpu_nv_mailbox_get_irq(struct
>>>>>>>>> amdgpu_device *adev)
>>>>>>>>> return r;
>>>>>>>>> }
>>>>>>>>> - INIT_WORK(&adev->virt.flr_work, xgpu_nv_mailbox_flr_work);
>>>>>>>>> + INIT_DELAYED_WORK(&adev->virt.flr_work.base,
>>>>>>>>> xgpu_nv_mailbox_flr_work);
>>>>>>>>> + INIT_LIST_HEAD(&adev->virt.flr_work.node);
>>>>>>>>> return 0;
>>>>>>>>> }
>>>>>>>>> @@ -416,6 +417,8 @@ void xgpu_nv_mailbox_put_irq(struct
>>>>>>>>> amdgpu_device *adev)
>>>>>>>>> {
>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.ack_irq, 0);
>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.rcv_irq, 0);
>>>>>>>>> +
>>>>>>>>> + amdgpu_reset_domain_del_pendning_work(adev->reset_domain,
>>>>>>>>> &adev->virt.flr_work);
>>>>>>>>> }
>>>>>>>>> const struct amdgpu_virt_ops xgpu_nv_virt_ops = {
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>>>>>>>>> index 7b63d30b9b79..1d4ef5c70730 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>>>>>>>>> @@ -512,7 +512,7 @@ static int
>>>>>>>>> xgpu_vi_set_mailbox_ack_irq(struct amdgpu_device *adev,
>>>>>>>>> static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
>>>>>>>>> {
>>>>>>>>> - struct amdgpu_virt *virt = container_of(work, struct
>>>>>>>>> amdgpu_virt, flr_work);
>>>>>>>>> + struct amdgpu_virt *virt = container_of(work, struct
>>>>>>>>> amdgpu_virt, flr_work.base.work);
>>>>>>>>> struct amdgpu_device *adev = container_of(virt, struct
>>>>>>>>> amdgpu_device, virt);
>>>>>>>>> /* wait until RCV_MSG become 3 */
>>>>>>>>> @@ -610,7 +610,8 @@ int xgpu_vi_mailbox_get_irq(struct
>>>>>>>>> amdgpu_device *adev)
>>>>>>>>> return r;
>>>>>>>>> }
>>>>>>>>> - INIT_WORK(&adev->virt.flr_work, xgpu_vi_mailbox_flr_work);
>>>>>>>>> + INIT_DELAYED_WORK(&adev->virt.flr_work.base,
>>>>>>>>> xgpu_vi_mailbox_flr_work);
>>>>>>>>> + INIT_LIST_HEAD(&adev->virt.flr_work.node);
>>>>>>>>> return 0;
>>>>>>>>> }
>>>>>>>>> @@ -619,6 +620,8 @@ void xgpu_vi_mailbox_put_irq(struct
>>>>>>>>> amdgpu_device *adev)
>>>>>>>>> {
>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.ack_irq, 0);
>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.rcv_irq, 0);
>>>>>>>>> +
>>>>>>>>> + amdgpu_reset_domain_del_pendning_work(adev->reset_domain,
>>>>>>>>> &adev->virt.flr_work);
>>>>>>>>> }
>>>>>>>>> const struct amdgpu_virt_ops xgpu_vi_virt_ops = {
>>>>>>>>
>>>>>>
>>>>
More information about the amd-gfx
mailing list