[PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.
Lazar, Lijo
lijo.lazar at amd.com
Wed May 11 15:46:23 UTC 2022
On 5/11/2022 9:13 PM, Andrey Grodzovsky wrote:
>
> On 2022-05-11 11:37, Lazar, Lijo wrote:
>>
>>
>> On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote:
>>>
>>> On 2022-05-11 11:20, Lazar, Lijo wrote:
>>>>
>>>>
>>>> On 5/11/2022 7:28 PM, Christian König wrote:
>>>>> Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky:
>>>>>> On 2022-05-11 03:38, Christian König wrote:
>>>>>>> Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky:
>>>>>>>> [SNIP]
>>>>>>>>> E.g. in the reset code (either before or after the reset,
>>>>>>>>> that's debatable) you do something like this:
>>>>>>>>>
>>>>>>>>> for (i = 0; i < num_ring; ++i)
>>>>>>>>> cancel_delayed_work(ring[i]->scheduler....)
>>>>>>>>> cancel_work(adev->ras_work);
>>>>>>>>> cancel_work(adev->iofault_work);
>>>>>>>>> cancel_work(adev->debugfs_work);
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> You don't really need to track which reset source has fired and
>>>>>>>>> which hasn't, because that would be racy again. Instead just
>>>>>>>>> bluntly reset all possible sources.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>
>>>>>>>> I don't say we care if it fired once or twice (I need to add a
>>>>>>>> fix to only insert reset work to pending reset list if it's not
>>>>>>>> already there), the point of using list (or array) to which you
>>>>>>>> add and from which you remove is that the logic of this is
>>>>>>>> encapsulated within reset domain. In your way we need to be
>>>>>>>> aware who exactly schedules reset work and explicitly cancel
>>>>>>>> them, this also means that for any new source added in the
>>>>>>>> future you will need to remember to add him
>>>>>>>
>>>>>>> I don't think that this is a valid argument. Additionally to the
>>>>>>> schedulers we probably just need less than a handful of reset
>>>>>>> sources, most likely even just one or two is enough.
>>>>>>>
>>>>>>> The only justification I can see of having additional separate
>>>>>>> reset sources would be if somebody wants to know if a specific
>>>>>>> source has been handled or not (e.g. call flush_work() or
>>>>>>> work_pending()). Like in the case of a reset triggered through
>>>>>>> debugfs.
>>>>>>
>>>>>>
>>>>>> This is indeed one reason, another is as we said before that if
>>>>>> you share 'reset source' (meaning a delayed work) with another
>>>>>> client (i.e. RAS and KFD) it means you make assumption that the
>>>>>> other client always proceeds with the
>>>>>> reset exactly the same way as you expect. So today we have this
>>>>>> only in scheduler vs non scheduler reset happening - non scheduler
>>>>>> reset clients assume the reset is always fully executed in HW
>>>>>> while scheduler based reset makes shortcuts and not always does HW
>>>>>> reset hence they cannot share 'reset source' (delayed work). Yes,
>>>>>> we can always add this in the future if and when such problem will
>>>>>> arise but no one will remember this then and a new bug will be
>>>>>> introduced and will take time to find and resolve.
>>>>>
>>>>> Mhm, so your main concern is that we forget to correctly handle the
>>>>> new reset sources?
>>>>>
>>>>> How about we do it like this then:
>>>>>
>>>>> struct amdgpu_reset_domain {
>>>>> ....
>>>>> union {
>>>>> struct {
>>>>> struct work_item debugfs;
>>>>> struct work_item ras;
>>>>> ....
>>>>> };
>>>>> struct work_item array[]
>>>>> } reset_sources;
>>>>> }
>>>>>
>>>>
>>>> If it's only about static array,
>>>>
>>>> enum amdgpu_reset_soruce {
>>>>
>>>> AMDGPU_RESET_SRC_RAS,
>>>> AMDGPU_RESET_SRC_ABC,
>>>> .....
>>>> AMDGPU_RESET_SRC_XYZ,
>>>> AMDGPU_RESET_SRC_MAX
>>>>
>>>> };
>>>>
>>>> struct work_struct reset_work[AMDGPU_RESET_SRC_MAX]; => An index for
>>>> each work item
>>>>
>>>>
>>>> Thanks,
>>>> Lijo
>>>
>>>
>>> It's possible though it makes harder to generalize reset_domain later
>>> for other drivers.
>>
>> The current reset domain queue design is not good for a hierarchichal
>> reset within amdgpu itself :)
>>
>> Thanks,
>> Lijo
>
>
> Not sure what do you mean ?
>
It's tied to the TDR queue in scheduler.
Hierarchichal model - start from reset of lowest level nodes and on
failure try with a higher level reset. This model doesn't suit that.
Thanks,
Lijo
> Andrey
>
>
>>
>>> But still one caveat, look at amdgpu_recover_work_struct and it's
>>> usage in amdgpu_device_gpu_recover and in gpu_recover_get,
>>> At least for debugfs i need to return back the result of GPU reset
>>> and so I cannot store actual work items in the array mentioned above
>>> but rather pointers to work_item because i need a way to get back the
>>> return value from gpu_recover like I do now in
>>> amdgpu_device_gpu_recover.
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>> Not 100% sure if that works, but something like that should do the
>>>>> trick.
>>>>>
>>>>> My main concern is that I don't want to allocate the work items on
>>>>> the stack and dynamic allocation (e.g. kmalloc) is usually not
>>>>> possible either.
>>>>>
>>>>> Additional to that putting/removing work items from a list, array
>>>>> or other container is a very common source for race conditions.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>> to the cancellation list which you showed above. In current way
>>>>>>>> all this done automatically within reset_domain code and it's
>>>>>>>> agnostic to specific driver and it's specific list of reset
>>>>>>>> sources. Also in case we would want to generalize reset_domain
>>>>>>>> to other GPU drivers (which was
>>>>>>>> a plan as far as i remember) this explicit mention of each reset
>>>>>>>> works for cancellation is again less suitable in my opinion.
>>>>>>>
>>>>>>> Well we could put the work item for the scheduler independent
>>>>>>> reset source into the reset domain as well. But I'm not sure
>>>>>>> those additional reset sources should be part of any common
>>>>>>> handling, that is largely amdgpu specific.
>>>>>>
>>>>>>
>>>>>> So it's for sure more then one source for the reasons described
>>>>>> above, also note that for scheduler we already cancel delayed work
>>>>>> in drm_sched_stop so calling them again in amdgpu code kind of
>>>>>> superfluous.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Andrey
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The only difference is I chose to do the canceling right
>>>>>>>>>>>> BEFORE the HW reset and not AFTER. I did this because I see
>>>>>>>>>>>> a possible race where a new reset request is being generated
>>>>>>>>>>>> exactly after we finished the HW reset but before we
>>>>>>>>>>>> canceled out all pending resets - in such case you wold not
>>>>>>>>>>>> want to cancel this 'border line new' reset request.
>>>>>>>>>>>
>>>>>>>>>>> Why not? Any new reset request directly after a hardware
>>>>>>>>>>> reset is most likely just falsely generated by the reset itself.
>>>>>>>>>>>
>>>>>>>>>>> Ideally I would cancel all sources after the reset, but
>>>>>>>>>>> before starting any new work.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Andrey
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You can see that if many different reset sources share
>>>>>>>>>>>>>>> same work struct what can happen is that the first to
>>>>>>>>>>>>>>> obtain the lock you describe bellow might opt out from
>>>>>>>>>>>>>>> full HW reset because his bad job did signal for example
>>>>>>>>>>>>>>> or because his hunged IP block was able to recover
>>>>>>>>>>>>>>> through SW reset but in the meantime another reset source
>>>>>>>>>>>>>>> who needed an actual HW reset just silently returned and
>>>>>>>>>>>>>>> we end up with unhandled reset request. True that today
>>>>>>>>>>>>>>> this happens only to job timeout reset sources that are
>>>>>>>>>>>>>>> handled form within the scheduler and won't use this
>>>>>>>>>>>>>>> single work struct but no one prevents a future case for
>>>>>>>>>>>>>>> this to happen and also, if we actually want to unify
>>>>>>>>>>>>>>> scheduler time out handlers within reset domain (which
>>>>>>>>>>>>>>> seems to me the right design approach) we won't be able
>>>>>>>>>>>>>>> to use just one work struct for this reason anyway.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to add to this point - a reset domain is co-operative
>>>>>>>>>>>>>> domain. In addition to sharing a set of clients from
>>>>>>>>>>>>>> various reset sources for one device, it also will have a
>>>>>>>>>>>>>> set of devices like in XGMI hive. The job timeout on one
>>>>>>>>>>>>>> device may not eventually result in result, but a RAS
>>>>>>>>>>>>>> error happening on another device at the same time would
>>>>>>>>>>>>>> need a reset. The second device's RAS error cannot return
>>>>>>>>>>>>>> seeing that a reset work already started, or ignore the
>>>>>>>>>>>>>> reset work given that another device has filled the reset
>>>>>>>>>>>>>> data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When there is a reset domain, it should take care of the
>>>>>>>>>>>>>> work scheduled and keeping it in device or any other level
>>>>>>>>>>>>>> doesn't sound good.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Lijo
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd put the reset work struct into the reset_domain
>>>>>>>>>>>>>>>> struct. That way you'd have exactly one worker for the
>>>>>>>>>>>>>>>> reset domain. You could implement a lock-less scheme to
>>>>>>>>>>>>>>>> decide whether you need to schedule a reset, e.g. using
>>>>>>>>>>>>>>>> an atomic counter in the shared work struct that gets
>>>>>>>>>>>>>>>> incremented when a client wants to trigger a reset
>>>>>>>>>>>>>>>> (atomic_add_return). If that counter is exactly 1 after
>>>>>>>>>>>>>>>> incrementing, you need to fill in the rest of the work
>>>>>>>>>>>>>>>> struct and schedule the work. Otherwise, it's already
>>>>>>>>>>>>>>>> scheduled (or another client is in the process of
>>>>>>>>>>>>>>>> scheduling it) and you just return. When the worker
>>>>>>>>>>>>>>>> finishes (after confirming a successful reset), it
>>>>>>>>>>>>>>>> resets the counter to 0, so the next client requesting a
>>>>>>>>>>>>>>>> reset will schedule the worker again.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Felix
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Additional to that keep in mind that you can't
>>>>>>>>>>>>>>>>>> allocate any memory before or during the GPU reset nor
>>>>>>>>>>>>>>>>>> wait for the reset to complete (so you can't allocate
>>>>>>>>>>>>>>>>>> anything on the stack either).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There is no dynamic allocation here, regarding stack
>>>>>>>>>>>>>>>>> allocations - we do it all the time when we call
>>>>>>>>>>>>>>>>> functions, even during GPU resets, how on stack
>>>>>>>>>>>>>>>>> allocation of work struct in amdgpu_device_gpu_recover
>>>>>>>>>>>>>>>>> is different from any other local variable we allocate
>>>>>>>>>>>>>>>>> in any function we call ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am also not sure why it's not allowed to wait for
>>>>>>>>>>>>>>>>> reset to complete ? Also, see in amdgpu_ras_do_recovery
>>>>>>>>>>>>>>>>> and gpu_recover_get (debugfs) - the caller expects the
>>>>>>>>>>>>>>>>> reset to complete before he returns. I can probably
>>>>>>>>>>>>>>>>> work around it in RAS code by calling
>>>>>>>>>>>>>>>>> atomic_set(&ras->in_recovery, 0) from some callback
>>>>>>>>>>>>>>>>> within actual reset function but regarding sysfs it
>>>>>>>>>>>>>>>>> actually expects a result returned indicating whether
>>>>>>>>>>>>>>>>> the call was successful or not.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I don't think that concept you try here will work.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also in general seems to me it's cleaner approach
>>>>>>>>>>>>>>>>>>> where this logic (the work items) are held and
>>>>>>>>>>>>>>>>>>> handled in reset_domain and are not split in each
>>>>>>>>>>>>>>>>>>> adev or any other entity. We might want in the future
>>>>>>>>>>>>>>>>>>> to even move the scheduler handling into reset domain
>>>>>>>>>>>>>>>>>>> since reset domain is supposed to be a generic things
>>>>>>>>>>>>>>>>>>> and not only or AMD.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> Christian.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky
>>>>>>>>>>>>>>>>>>>>>>> <andrey.grodzovsky at amd.com>
>>>>>>>>>>>>>>>>>>>>>>> Tested-by: Bai Zoy <Zoy.Bai at amd.com>
>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 11 +---
>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17
>>>>>>>>>>>>>>>>>>>>>>> +++--
>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 3 +
>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 73
>>>>>>>>>>>>>>>>>>>>>>> +++++++++++++++++++++-
>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 3 +-
>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 7 ++-
>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 7 ++-
>>>>>>>>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 7 ++-
>>>>>>>>>>>>>>>>>>>>>>> 8 files changed, 104 insertions(+), 24
>>>>>>>>>>>>>>>>>>>>>>> deletions(-)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>>>>>>>>>>>>>>>>> index 4264abc5604d..99efd8317547 100644
>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>>>>>>>>>>>>>>>>> @@ -109,6 +109,7 @@
>>>>>>>>>>>>>>>>>>>>>>> #include "amdgpu_fdinfo.h"
>>>>>>>>>>>>>>>>>>>>>>> #include "amdgpu_mca.h"
>>>>>>>>>>>>>>>>>>>>>>> #include "amdgpu_ras.h"
>>>>>>>>>>>>>>>>>>>>>>> +#include "amdgpu_reset.h"
>>>>>>>>>>>>>>>>>>>>>>> #define MAX_GPU_INSTANCE 16
>>>>>>>>>>>>>>>>>>>>>>> @@ -509,16 +510,6 @@ struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_allowed_register_entry {
>>>>>>>>>>>>>>>>>>>>>>> bool grbm_indexed;
>>>>>>>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>>>>>>> -enum amd_reset_method {
>>>>>>>>>>>>>>>>>>>>>>> - AMD_RESET_METHOD_NONE = -1,
>>>>>>>>>>>>>>>>>>>>>>> - AMD_RESET_METHOD_LEGACY = 0,
>>>>>>>>>>>>>>>>>>>>>>> - AMD_RESET_METHOD_MODE0,
>>>>>>>>>>>>>>>>>>>>>>> - AMD_RESET_METHOD_MODE1,
>>>>>>>>>>>>>>>>>>>>>>> - AMD_RESET_METHOD_MODE2,
>>>>>>>>>>>>>>>>>>>>>>> - AMD_RESET_METHOD_BACO,
>>>>>>>>>>>>>>>>>>>>>>> - AMD_RESET_METHOD_PCI,
>>>>>>>>>>>>>>>>>>>>>>> -};
>>>>>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_video_codec_info {
>>>>>>>>>>>>>>>>>>>>>>> u32 codec_type;
>>>>>>>>>>>>>>>>>>>>>>> u32 max_width;
>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>> index e582f1044c0f..7fa82269c30f 100644
>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>>>>>>>>>>>>>>>>> @@ -5201,6 +5201,12 @@ int
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover_imp(struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> tmp_vram_lost_counter =
>>>>>>>>>>>>>>>>>>>>>>> atomic_read(&((adev)->vram_lost_counter));
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> + /* Drop all pending resets since we will
>>>>>>>>>>>>>>>>>>>>>>> reset now anyway */
>>>>>>>>>>>>>>>>>>>>>>> + tmp_adev =
>>>>>>>>>>>>>>>>>>>>>>> list_first_entry(device_list_handle, struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device,
>>>>>>>>>>>>>>>>>>>>>>> + reset_list);
>>>>>>>>>>>>>>>>>>>>>>> + amdgpu_reset_pending_list(tmp_adev->reset_domain);
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> /* Actual ASIC resets if needed.*/
>>>>>>>>>>>>>>>>>>>>>>> /* Host driver will handle XGMI hive reset
>>>>>>>>>>>>>>>>>>>>>>> for SRIOV */
>>>>>>>>>>>>>>>>>>>>>>> if (amdgpu_sriov_vf(adev)) {
>>>>>>>>>>>>>>>>>>>>>>> @@ -5296,7 +5302,7 @@ int
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover_imp(struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_recover_work_struct {
>>>>>>>>>>>>>>>>>>>>>>> - struct work_struct base;
>>>>>>>>>>>>>>>>>>>>>>> + struct amdgpu_reset_work_struct base;
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_device *adev;
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_job *job;
>>>>>>>>>>>>>>>>>>>>>>> int ret;
>>>>>>>>>>>>>>>>>>>>>>> @@ -5304,7 +5310,7 @@ struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_recover_work_struct {
>>>>>>>>>>>>>>>>>>>>>>> static void
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device_queue_gpu_recover_work(struct
>>>>>>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>> - struct amdgpu_recover_work_struct
>>>>>>>>>>>>>>>>>>>>>>> *recover_work = container_of(work, struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_recover_work_struct, base);
>>>>>>>>>>>>>>>>>>>>>>> + struct amdgpu_recover_work_struct
>>>>>>>>>>>>>>>>>>>>>>> *recover_work = container_of(work, struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_recover_work_struct, base.base.work);
>>>>>>>>>>>>>>>>>>>>>>> recover_work->ret =
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover_imp(recover_work->adev,
>>>>>>>>>>>>>>>>>>>>>>> recover_work->job);
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> @@ -5316,12 +5322,15 @@ int
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device_gpu_recover(struct amdgpu_device
>>>>>>>>>>>>>>>>>>>>>>> *adev,
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_recover_work_struct work =
>>>>>>>>>>>>>>>>>>>>>>> {.adev = adev, .job = job};
>>>>>>>>>>>>>>>>>>>>>>> - INIT_WORK(&work.base,
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device_queue_gpu_recover_work);
>>>>>>>>>>>>>>>>>>>>>>> + INIT_DELAYED_WORK(&work.base.base,
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device_queue_gpu_recover_work);
>>>>>>>>>>>>>>>>>>>>>>> + INIT_LIST_HEAD(&work.base.node);
>>>>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>>>> (!amdgpu_reset_domain_schedule(adev->reset_domain, &work.base))
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> return -EAGAIN;
>>>>>>>>>>>>>>>>>>>>>>> - flush_work(&work.base);
>>>>>>>>>>>>>>>>>>>>>>> + flush_delayed_work(&work.base.base);
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_domain_del_pendning_work(adev->reset_domain,
>>>>>>>>>>>>>>>>>>>>>>> &work.base);
>>>>>>>>>>>>>>>>>>>>>>> return work.ret;
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>>>>>>>>>>>>>>>>>>>>>> index c80af0889773..ffddd419c351 100644
>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>>>>>>>>>>>>>>>>>>>>>> @@ -134,6 +134,9 @@ struct amdgpu_reset_domain
>>>>>>>>>>>>>>>>>>>>>>> *amdgpu_reset_create_reset_domain(enum
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_d
>>>>>>>>>>>>>>>>>>>>>>> atomic_set(&reset_domain->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>>>>>>>> init_rwsem(&reset_domain->sem);
>>>>>>>>>>>>>>>>>>>>>>> + INIT_LIST_HEAD(&reset_domain->pending_works);
>>>>>>>>>>>>>>>>>>>>>>> + mutex_init(&reset_domain->reset_lock);
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> return reset_domain;
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>>>>>>>>>>>>>>>>>>>>> index 1949dbe28a86..863ec5720fc1 100644
>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>>>>>>>>>>>>>>>>>>>>>> @@ -24,7 +24,18 @@
>>>>>>>>>>>>>>>>>>>>>>> #ifndef __AMDGPU_RESET_H__
>>>>>>>>>>>>>>>>>>>>>>> #define __AMDGPU_RESET_H__
>>>>>>>>>>>>>>>>>>>>>>> -#include "amdgpu.h"
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +#include <linux/atomic.h>
>>>>>>>>>>>>>>>>>>>>>>> +#include <linux/mutex.h>
>>>>>>>>>>>>>>>>>>>>>>> +#include <linux/list.h>
>>>>>>>>>>>>>>>>>>>>>>> +#include <linux/kref.h>
>>>>>>>>>>>>>>>>>>>>>>> +#include <linux/rwsem.h>
>>>>>>>>>>>>>>>>>>>>>>> +#include <linux/workqueue.h>
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +struct amdgpu_device;
>>>>>>>>>>>>>>>>>>>>>>> +struct amdgpu_job;
>>>>>>>>>>>>>>>>>>>>>>> +struct amdgpu_hive_info;
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> enum AMDGPU_RESET_FLAGS {
>>>>>>>>>>>>>>>>>>>>>>> @@ -32,6 +43,17 @@ enum AMDGPU_RESET_FLAGS {
>>>>>>>>>>>>>>>>>>>>>>> AMDGPU_SKIP_HW_RESET = 1,
>>>>>>>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +enum amd_reset_method {
>>>>>>>>>>>>>>>>>>>>>>> + AMD_RESET_METHOD_NONE = -1,
>>>>>>>>>>>>>>>>>>>>>>> + AMD_RESET_METHOD_LEGACY = 0,
>>>>>>>>>>>>>>>>>>>>>>> + AMD_RESET_METHOD_MODE0,
>>>>>>>>>>>>>>>>>>>>>>> + AMD_RESET_METHOD_MODE1,
>>>>>>>>>>>>>>>>>>>>>>> + AMD_RESET_METHOD_MODE2,
>>>>>>>>>>>>>>>>>>>>>>> + AMD_RESET_METHOD_BACO,
>>>>>>>>>>>>>>>>>>>>>>> + AMD_RESET_METHOD_PCI,
>>>>>>>>>>>>>>>>>>>>>>> +};
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_reset_context {
>>>>>>>>>>>>>>>>>>>>>>> enum amd_reset_method method;
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_device *reset_req_dev;
>>>>>>>>>>>>>>>>>>>>>>> @@ -40,6 +62,8 @@ struct amdgpu_reset_context {
>>>>>>>>>>>>>>>>>>>>>>> unsigned long flags;
>>>>>>>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>>>>>>> +struct amdgpu_reset_control;
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_reset_handler {
>>>>>>>>>>>>>>>>>>>>>>> enum amd_reset_method reset_method;
>>>>>>>>>>>>>>>>>>>>>>> struct list_head handler_list;
>>>>>>>>>>>>>>>>>>>>>>> @@ -76,12 +100,21 @@ enum amdgpu_reset_domain_type {
>>>>>>>>>>>>>>>>>>>>>>> XGMI_HIVE
>>>>>>>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +struct amdgpu_reset_work_struct {
>>>>>>>>>>>>>>>>>>>>>>> + struct delayed_work base;
>>>>>>>>>>>>>>>>>>>>>>> + struct list_head node;
>>>>>>>>>>>>>>>>>>>>>>> +};
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_reset_domain {
>>>>>>>>>>>>>>>>>>>>>>> struct kref refcount;
>>>>>>>>>>>>>>>>>>>>>>> struct workqueue_struct *wq;
>>>>>>>>>>>>>>>>>>>>>>> enum amdgpu_reset_domain_type type;
>>>>>>>>>>>>>>>>>>>>>>> struct rw_semaphore sem;
>>>>>>>>>>>>>>>>>>>>>>> atomic_t in_gpu_reset;
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> + struct list_head pending_works;
>>>>>>>>>>>>>>>>>>>>>>> + struct mutex reset_lock;
>>>>>>>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>>>>>>> @@ -113,9 +146,43 @@ static inline void
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_put_reset_domain(struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_domain *dom
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> static inline bool
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_domain_schedule(struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_domain *domain,
>>>>>>>>>>>>>>>>>>>>>>> - struct work_struct *work)
>>>>>>>>>>>>>>>>>>>>>>> + struct amdgpu_reset_work_struct *work)
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>> - return queue_work(domain->wq, work);
>>>>>>>>>>>>>>>>>>>>>>> + mutex_lock(&domain->reset_lock);
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> + if (!queue_delayed_work(domain->wq,
>>>>>>>>>>>>>>>>>>>>>>> &work->base, 0)) {
>>>>>>>>>>>>>>>>>>>>>>> + mutex_unlock(&domain->reset_lock);
>>>>>>>>>>>>>>>>>>>>>>> + return false;
>>>>>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> + list_add_tail(&work->node,
>>>>>>>>>>>>>>>>>>>>>>> &domain->pending_works);
>>>>>>>>>>>>>>>>>>>>>>> + mutex_unlock(&domain->reset_lock);
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> + return true;
>>>>>>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +static inline void
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_domain_del_pendning_work(struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_domain *domain,
>>>>>>>>>>>>>>>>>>>>>>> + struct amdgpu_reset_work_struct *work)
>>>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>>>> + mutex_lock(&domain->reset_lock);
>>>>>>>>>>>>>>>>>>>>>>> + list_del_init(&work->node);
>>>>>>>>>>>>>>>>>>>>>>> + mutex_unlock(&domain->reset_lock);
>>>>>>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +static inline void
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_pending_list(struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_domain *domain)
>>>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>>>> + struct amdgpu_reset_work_struct *entry, *tmp;
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> + mutex_lock(&domain->reset_lock);
>>>>>>>>>>>>>>>>>>>>>>> + list_for_each_entry_safe(entry, tmp,
>>>>>>>>>>>>>>>>>>>>>>> &domain->pending_works, node) {
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> + list_del_init(&entry->node);
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> + /* Stop any other related pending resets */
>>>>>>>>>>>>>>>>>>>>>>> + cancel_delayed_work(&entry->base);
>>>>>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> + mutex_unlock(&domain->reset_lock);
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> void amdgpu_device_lock_reset_domain(struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_domain *reset_domain);
>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>>>>>>>>>>>>>>>>>>>>>> index 239f232f9c02..574e870d3064 100644
>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>>>>>>>>>>>>>>>>>>>>>>> @@ -25,6 +25,7 @@
>>>>>>>>>>>>>>>>>>>>>>> #define AMDGPU_VIRT_H
>>>>>>>>>>>>>>>>>>>>>>> #include "amdgv_sriovmsg.h"
>>>>>>>>>>>>>>>>>>>>>>> +#include "amdgpu_reset.h"
>>>>>>>>>>>>>>>>>>>>>>> #define AMDGPU_SRIOV_CAPS_SRIOV_VBIOS (1 <<
>>>>>>>>>>>>>>>>>>>>>>> 0) /* vBIOS is sr-iov ready */
>>>>>>>>>>>>>>>>>>>>>>> #define AMDGPU_SRIOV_CAPS_ENABLE_IOV (1 << 1)
>>>>>>>>>>>>>>>>>>>>>>> /* sr-iov is enabled on this GPU */
>>>>>>>>>>>>>>>>>>>>>>> @@ -230,7 +231,7 @@ struct amdgpu_virt {
>>>>>>>>>>>>>>>>>>>>>>> uint32_t reg_val_offs;
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_irq_src ack_irq;
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_irq_src rcv_irq;
>>>>>>>>>>>>>>>>>>>>>>> - struct work_struct flr_work;
>>>>>>>>>>>>>>>>>>>>>>> + struct amdgpu_reset_work_struct flr_work;
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_mm_table mm_table;
>>>>>>>>>>>>>>>>>>>>>>> const struct amdgpu_virt_ops *ops;
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_vf_error_buffer vf_errors;
>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>>>>>> index b81acf59870c..f3d1c2be9292 100644
>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>>>>>>>> @@ -251,7 +251,7 @@ static int
>>>>>>>>>>>>>>>>>>>>>>> xgpu_ai_set_mailbox_ack_irq(struct amdgpu_device
>>>>>>>>>>>>>>>>>>>>>>> *adev,
>>>>>>>>>>>>>>>>>>>>>>> static void xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>> - struct amdgpu_virt *virt =
>>>>>>>>>>>>>>>>>>>>>>> container_of(work, struct amdgpu_virt, flr_work);
>>>>>>>>>>>>>>>>>>>>>>> + struct amdgpu_virt *virt =
>>>>>>>>>>>>>>>>>>>>>>> container_of(work, struct amdgpu_virt,
>>>>>>>>>>>>>>>>>>>>>>> flr_work.base.work);
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_device *adev =
>>>>>>>>>>>>>>>>>>>>>>> container_of(virt, struct amdgpu_device, virt);
>>>>>>>>>>>>>>>>>>>>>>> int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>>>>>>> @@ -380,7 +380,8 @@ int
>>>>>>>>>>>>>>>>>>>>>>> xgpu_ai_mailbox_get_irq(struct amdgpu_device *adev)
>>>>>>>>>>>>>>>>>>>>>>> return r;
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> - INIT_WORK(&adev->virt.flr_work,
>>>>>>>>>>>>>>>>>>>>>>> xgpu_ai_mailbox_flr_work);
>>>>>>>>>>>>>>>>>>>>>>> + INIT_DELAYED_WORK(&adev->virt.flr_work.base,
>>>>>>>>>>>>>>>>>>>>>>> xgpu_ai_mailbox_flr_work);
>>>>>>>>>>>>>>>>>>>>>>> + INIT_LIST_HEAD(&adev->virt.flr_work.node);
>>>>>>>>>>>>>>>>>>>>>>> return 0;
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> @@ -389,6 +390,8 @@ void
>>>>>>>>>>>>>>>>>>>>>>> xgpu_ai_mailbox_put_irq(struct amdgpu_device *adev)
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.ack_irq, 0);
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.rcv_irq, 0);
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_domain_del_pendning_work(adev->reset_domain,
>>>>>>>>>>>>>>>>>>>>>>> &adev->virt.flr_work);
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> static int xgpu_ai_request_init_data(struct
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_device *adev)
>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>>>>>> index 22c10b97ea81..927b3d5bb1d0 100644
>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>>>>>>>> @@ -275,7 +275,7 @@ static int
>>>>>>>>>>>>>>>>>>>>>>> xgpu_nv_set_mailbox_ack_irq(struct amdgpu_device
>>>>>>>>>>>>>>>>>>>>>>> *adev,
>>>>>>>>>>>>>>>>>>>>>>> static void xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>> - struct amdgpu_virt *virt =
>>>>>>>>>>>>>>>>>>>>>>> container_of(work, struct amdgpu_virt, flr_work);
>>>>>>>>>>>>>>>>>>>>>>> + struct amdgpu_virt *virt =
>>>>>>>>>>>>>>>>>>>>>>> container_of(work, struct amdgpu_virt,
>>>>>>>>>>>>>>>>>>>>>>> flr_work.base.work);
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_device *adev =
>>>>>>>>>>>>>>>>>>>>>>> container_of(virt, struct amdgpu_device, virt);
>>>>>>>>>>>>>>>>>>>>>>> int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>>>>>>> @@ -407,7 +407,8 @@ int
>>>>>>>>>>>>>>>>>>>>>>> xgpu_nv_mailbox_get_irq(struct amdgpu_device *adev)
>>>>>>>>>>>>>>>>>>>>>>> return r;
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> - INIT_WORK(&adev->virt.flr_work,
>>>>>>>>>>>>>>>>>>>>>>> xgpu_nv_mailbox_flr_work);
>>>>>>>>>>>>>>>>>>>>>>> + INIT_DELAYED_WORK(&adev->virt.flr_work.base,
>>>>>>>>>>>>>>>>>>>>>>> xgpu_nv_mailbox_flr_work);
>>>>>>>>>>>>>>>>>>>>>>> + INIT_LIST_HEAD(&adev->virt.flr_work.node);
>>>>>>>>>>>>>>>>>>>>>>> return 0;
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> @@ -416,6 +417,8 @@ void
>>>>>>>>>>>>>>>>>>>>>>> xgpu_nv_mailbox_put_irq(struct amdgpu_device *adev)
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.ack_irq, 0);
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.rcv_irq, 0);
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_domain_del_pendning_work(adev->reset_domain,
>>>>>>>>>>>>>>>>>>>>>>> &adev->virt.flr_work);
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> const struct amdgpu_virt_ops xgpu_nv_virt_ops
>>>>>>>>>>>>>>>>>>>>>>> = {
>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>>>>>>>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>>>>>>>>>>>>>>>>>>>>>>> index 7b63d30b9b79..1d4ef5c70730 100644
>>>>>>>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>>>>>>>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>>>>>>>>>>>>>>>>>>>>>>> @@ -512,7 +512,7 @@ static int
>>>>>>>>>>>>>>>>>>>>>>> xgpu_vi_set_mailbox_ack_irq(struct amdgpu_device
>>>>>>>>>>>>>>>>>>>>>>> *adev,
>>>>>>>>>>>>>>>>>>>>>>> static void xgpu_vi_mailbox_flr_work(struct
>>>>>>>>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>> - struct amdgpu_virt *virt =
>>>>>>>>>>>>>>>>>>>>>>> container_of(work, struct amdgpu_virt, flr_work);
>>>>>>>>>>>>>>>>>>>>>>> + struct amdgpu_virt *virt =
>>>>>>>>>>>>>>>>>>>>>>> container_of(work, struct amdgpu_virt,
>>>>>>>>>>>>>>>>>>>>>>> flr_work.base.work);
>>>>>>>>>>>>>>>>>>>>>>> struct amdgpu_device *adev =
>>>>>>>>>>>>>>>>>>>>>>> container_of(virt, struct amdgpu_device, virt);
>>>>>>>>>>>>>>>>>>>>>>> /* wait until RCV_MSG become 3 */
>>>>>>>>>>>>>>>>>>>>>>> @@ -610,7 +610,8 @@ int
>>>>>>>>>>>>>>>>>>>>>>> xgpu_vi_mailbox_get_irq(struct amdgpu_device *adev)
>>>>>>>>>>>>>>>>>>>>>>> return r;
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> - INIT_WORK(&adev->virt.flr_work,
>>>>>>>>>>>>>>>>>>>>>>> xgpu_vi_mailbox_flr_work);
>>>>>>>>>>>>>>>>>>>>>>> + INIT_DELAYED_WORK(&adev->virt.flr_work.base,
>>>>>>>>>>>>>>>>>>>>>>> xgpu_vi_mailbox_flr_work);
>>>>>>>>>>>>>>>>>>>>>>> + INIT_LIST_HEAD(&adev->virt.flr_work.node);
>>>>>>>>>>>>>>>>>>>>>>> return 0;
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> @@ -619,6 +620,8 @@ void
>>>>>>>>>>>>>>>>>>>>>>> xgpu_vi_mailbox_put_irq(struct amdgpu_device *adev)
>>>>>>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.ack_irq, 0);
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_irq_put(adev, &adev->virt.rcv_irq, 0);
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> amdgpu_reset_domain_del_pendning_work(adev->reset_domain,
>>>>>>>>>>>>>>>>>>>>>>> &adev->virt.flr_work);
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> const struct amdgpu_virt_ops xgpu_vi_virt_ops
>>>>>>>>>>>>>>>>>>>>>>> = {
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
More information about the amd-gfx
mailing list