[PATCH v3 07/12] drm/sched: Prevent any job recoveries after device is unplugged.

Tue Nov 24 17:17:17 UTC 2020

On 11/24/20 12:11 PM, Luben Tuikov wrote:
> On 2020-11-24 2:50 a.m., Christian König wrote:
>> Am 24.11.20 um 02:12 schrieb Luben Tuikov:
>>> On 2020-11-23 3:06 a.m., Christian König wrote:
>>>> Am 23.11.20 um 06:37 schrieb Andrey Grodzovsky:
>>>>> On 11/22/20 6:57 AM, Christian König wrote:
>>>>>> Am 21.11.20 um 06:21 schrieb Andrey Grodzovsky:
>>>>>>> No point to try recovery if device is gone, it's meaningless.
>>>>>> I think that this should go into the device specific recovery
>>>>>> function and not in the scheduler.
>>>>> The timeout timer is rearmed here, so this prevents any new recovery
>>>>> work to restart from here
>>>>> after drm_dev_unplug was executed from amdgpu_pci_remove.It will not
>>>>> cover other places like
>>>>> job cleanup or starting new job but those should stop once the
>>>>> scheduler thread is stopped later.
>>>> Yeah, but this is rather unclean. We should probably return an error
>>>> code instead if the timer should be rearmed or not.
>>> Christian, this is exactly my work I told you about
>>> last week on Wednesday in our weekly meeting. And
>>> which I wrote to you in an email last year about this
>>> time.
>> Yeah, that's why I'm suggesting it here as well.
> It seems you're suggesting that Andrey do it, while
> all too well you know I've been working on this
> for some time now.
>
> I wrote you about this last year same time
> in an email. And I discussed it on the Wednesday
> meeting.
>
> You could've mentioned that here the first time.

Luben, I actually strongly prefer that you do it and share ur patch with me 
since I don't
want to do unneeded refactoring which will conflict with with ur work. Also, please
usedrm-misc for this since it's not amdgpu specific work and will be easier for me.

Andrey

>
>>> So what do we do now?
>> Split your patches into smaller parts and submit them chunk by chunk.
>>
>> E.g. renames first and then functional changes grouped by area they change.
> I have, but my final patch, a tiny one but which implements
> the core reason for the change seems buggy, and I'm looking
> for a way to debug it.
>
> Regards,
> Luben
>
>
>> Regards,
>> Christian.
>>
>>> I can submit those changes without the last part,
>>> which builds on this change.
>>>
>>> I'm still testing the last part and was hoping
>>> to submit it all in one sequence of patches,
>>> after my testing.
>>>
>>> Regards,
>>> Luben
>>>
>>>> Christian.
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  2 +-
>>>>>>>     drivers/gpu/drm/etnaviv/etnaviv_sched.c   |  3 ++-
>>>>>>>     drivers/gpu/drm/lima/lima_sched.c         |  3 ++-
>>>>>>>     drivers/gpu/drm/panfrost/panfrost_job.c   |  2 +-
>>>>>>>     drivers/gpu/drm/scheduler/sched_main.c    | 15 ++++++++++++++-
>>>>>>>     drivers/gpu/drm/v3d/v3d_sched.c           | 15 ++++++++++-----
>>>>>>>     include/drm/gpu_scheduler.h               |  6 +++++-
>>>>>>>     7 files changed, 35 insertions(+), 11 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> index d56f402..d0b0021 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> @@ -487,7 +487,7 @@ int amdgpu_fence_driver_init_ring(struct
>>>>>>> amdgpu_ring *ring,
>>>>>>>               r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>>>>>>>                        num_hw_submission, amdgpu_job_hang_limit,
>>>>>>> -                   timeout, ring->name);
>>>>>>> +                   timeout, ring->name, &adev->ddev);
>>>>>>>             if (r) {
>>>>>>>                 DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>>>>>>                       ring->name);
>>>>>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>>>>>> b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>>>>>> index cd46c88..7678287 100644
>>>>>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>>>>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>>>>>> @@ -185,7 +185,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>>>>>>>           ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
>>>>>>>                      etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
>>>>>>> -                 msecs_to_jiffies(500), dev_name(gpu->dev));
>>>>>>> +                 msecs_to_jiffies(500), dev_name(gpu->dev),
>>>>>>> +                 gpu->drm);
>>>>>>>         if (ret)
>>>>>>>             return ret;
>>>>>>>     diff --git a/drivers/gpu/drm/lima/lima_sched.c
>>>>>>> b/drivers/gpu/drm/lima/lima_sched.c
>>>>>>> index dc6df9e..8a7e5d7ca 100644
>>>>>>> --- a/drivers/gpu/drm/lima/lima_sched.c
>>>>>>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>>>>>>> @@ -505,7 +505,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe
>>>>>>> *pipe, const char *name)
>>>>>>>           return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
>>>>>>>                       lima_job_hang_limit, msecs_to_jiffies(timeout),
>>>>>>> -                  name);
>>>>>>> +                  name,
>>>>>>> +                  pipe->ldev->ddev);
>>>>>>>     }
>>>>>>>       void lima_sched_pipe_fini(struct lima_sched_pipe *pipe)
>>>>>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
>>>>>>> b/drivers/gpu/drm/panfrost/panfrost_job.c
>>>>>>> index 30e7b71..37b03b01 100644
>>>>>>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>>>>>>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>>>>>>> @@ -520,7 +520,7 @@ int panfrost_job_init(struct panfrost_device
>>>>>>> *pfdev)
>>>>>>>             ret = drm_sched_init(&js->queue[j].sched,
>>>>>>>                          &panfrost_sched_ops,
>>>>>>>                          1, 0, msecs_to_jiffies(500),
>>>>>>> -                     "pan_js");
>>>>>>> +                     "pan_js", pfdev->ddev);
>>>>>>>             if (ret) {
>>>>>>>                 dev_err(pfdev->dev, "Failed to create scheduler: %d.",
>>>>>>> ret);
>>>>>>>                 goto err_sched;
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index c3f0bd0..95db8c6 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -53,6 +53,7 @@
>>>>>>>     #include <drm/drm_print.h>
>>>>>>>     #include <drm/gpu_scheduler.h>
>>>>>>>     #include <drm/spsc_queue.h>
>>>>>>> +#include <drm/drm_drv.h>
>>>>>>>       #define CREATE_TRACE_POINTS
>>>>>>>     #include "gpu_scheduler_trace.h"
>>>>>>> @@ -283,8 +284,16 @@ static void drm_sched_job_timedout(struct
>>>>>>> work_struct *work)
>>>>>>>         struct drm_gpu_scheduler *sched;
>>>>>>>         struct drm_sched_job *job;
>>>>>>>     +    int idx;
>>>>>>> +
>>>>>>>         sched = container_of(work, struct drm_gpu_scheduler,
>>>>>>> work_tdr.work);
>>>>>>>     +    if (!drm_dev_enter(sched->ddev, &idx)) {
>>>>>>> +        DRM_INFO("%s - device unplugged skipping recovery on
>>>>>>> scheduler:%s",
>>>>>>> +             __func__, sched->name);
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>> +
>>>>>>>         /* Protects against concurrent deletion in
>>>>>>> drm_sched_get_cleanup_job */
>>>>>>>         spin_lock(&sched->job_list_lock);
>>>>>>>         job = list_first_entry_or_null(&sched->ring_mirror_list,
>>>>>>> @@ -316,6 +325,8 @@ static void drm_sched_job_timedout(struct
>>>>>>> work_struct *work)
>>>>>>>         spin_lock(&sched->job_list_lock);
>>>>>>>         drm_sched_start_timeout(sched);
>>>>>>>         spin_unlock(&sched->job_list_lock);
>>>>>>> +
>>>>>>> +    drm_dev_exit(idx);
>>>>>>>     }
>>>>>>>        /**
>>>>>>> @@ -845,7 +856,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>>>>                unsigned hw_submission,
>>>>>>>                unsigned hang_limit,
>>>>>>>                long timeout,
>>>>>>> -           const char *name)
>>>>>>> +           const char *name,
>>>>>>> +           struct drm_device *ddev)
>>>>>>>     {
>>>>>>>         int i, ret;
>>>>>>>         sched->ops = ops;
>>>>>>> @@ -853,6 +865,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>>>>         sched->name = name;
>>>>>>>         sched->timeout = timeout;
>>>>>>>         sched->hang_limit = hang_limit;
>>>>>>> +    sched->ddev = ddev;
>>>>>>>         for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT;
>>>>>>> i++)
>>>>>>>             drm_sched_rq_init(sched, &sched->sched_rq[i]);
>>>>>>>     diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> index 0747614..f5076e5 100644
>>>>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> @@ -401,7 +401,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>>>>>>>                      &v3d_bin_sched_ops,
>>>>>>>                      hw_jobs_limit, job_hang_limit,
>>>>>>>                      msecs_to_jiffies(hang_limit_ms),
>>>>>>> -                 "v3d_bin");
>>>>>>> +                 "v3d_bin",
>>>>>>> +                 &v3d->drm);
>>>>>>>         if (ret) {
>>>>>>>             dev_err(v3d->drm.dev, "Failed to create bin scheduler:
>>>>>>> %d.", ret);
>>>>>>>             return ret;
>>>>>>> @@ -411,7 +412,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>>>>>>>                      &v3d_render_sched_ops,
>>>>>>>                      hw_jobs_limit, job_hang_limit,
>>>>>>>                      msecs_to_jiffies(hang_limit_ms),
>>>>>>> -                 "v3d_render");
>>>>>>> +                 "v3d_render",
>>>>>>> +                 &v3d->drm);
>>>>>>>         if (ret) {
>>>>>>>             dev_err(v3d->drm.dev, "Failed to create render scheduler:
>>>>>>> %d.",
>>>>>>>                 ret);
>>>>>>> @@ -423,7 +425,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>>>>>>>                      &v3d_tfu_sched_ops,
>>>>>>>                      hw_jobs_limit, job_hang_limit,
>>>>>>>                      msecs_to_jiffies(hang_limit_ms),
>>>>>>> -                 "v3d_tfu");
>>>>>>> +                 "v3d_tfu",
>>>>>>> +                 &v3d->drm);
>>>>>>>         if (ret) {
>>>>>>>             dev_err(v3d->drm.dev, "Failed to create TFU scheduler: %d.",
>>>>>>>                 ret);
>>>>>>> @@ -436,7 +439,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>>>>>>>                          &v3d_csd_sched_ops,
>>>>>>>                          hw_jobs_limit, job_hang_limit,
>>>>>>>                          msecs_to_jiffies(hang_limit_ms),
>>>>>>> -                     "v3d_csd");
>>>>>>> +                     "v3d_csd",
>>>>>>> +                     &v3d->drm);
>>>>>>>             if (ret) {
>>>>>>>                 dev_err(v3d->drm.dev, "Failed to create CSD scheduler:
>>>>>>> %d.",
>>>>>>>                     ret);
>>>>>>> @@ -448,7 +452,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>>>>>>>                          &v3d_cache_clean_sched_ops,
>>>>>>>                          hw_jobs_limit, job_hang_limit,
>>>>>>>                          msecs_to_jiffies(hang_limit_ms),
>>>>>>> -                     "v3d_cache_clean");
>>>>>>> +                     "v3d_cache_clean",
>>>>>>> +                     &v3d->drm);
>>>>>>>             if (ret) {
>>>>>>>                 dev_err(v3d->drm.dev, "Failed to create CACHE_CLEAN
>>>>>>> scheduler: %d.",
>>>>>>>                     ret);
>>>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>>>> index 9243655..a980709 100644
>>>>>>> --- a/include/drm/gpu_scheduler.h
>>>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>>>> @@ -32,6 +32,7 @@
>>>>>>>       struct drm_gpu_scheduler;
>>>>>>>     struct drm_sched_rq;
>>>>>>> +struct drm_device;
>>>>>>>       /* These are often used as an (initial) index
>>>>>>>      * to an array, and as such should start at 0.
>>>>>>> @@ -267,6 +268,7 @@ struct drm_sched_backend_ops {
>>>>>>>      * @score: score to help loadbalancer pick a idle sched
>>>>>>>      * @ready: marks if the underlying HW is ready to work
>>>>>>>      * @free_guilty: A hit to time out handler to free the guilty job.
>>>>>>> + * @ddev: Pointer to drm device of this scheduler.
>>>>>>>      *
>>>>>>>      * One scheduler is implemented for each hardware ring.
>>>>>>>      */
>>>>>>> @@ -288,12 +290,14 @@ struct drm_gpu_scheduler {
>>>>>>>         atomic_t                        score;
>>>>>>>         bool                ready;
>>>>>>>         bool                free_guilty;
>>>>>>> +    struct drm_device        *ddev;
>>>>>>>     };
>>>>>>>       int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>>>>                const struct drm_sched_backend_ops *ops,
>>>>>>>                uint32_t hw_submission, unsigned hang_limit, long timeout,
>>>>>>> -           const char *name);
>>>>>>> +           const char *name,
>>>>>>> +           struct drm_device *ddev);
>>>>>>>       void drm_sched_fini(struct drm_gpu_scheduler *sched);
>>>>>>>     int drm_sched_job_init(struct drm_sched_job *job,
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Cluben.tuikov%40amd.com%7C644a4f3feb79447fd6a408d8904dab27%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637418010548375418%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wNLdozuhVS3smIpAuWB0tjFO3XDo1OmmZEgTCxviJaI%3D&reserved=0
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel at lists.freedesktop.org
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7Cluben.tuikov%40amd.com%7C644a4f3feb79447fd6a408d8904dab27%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637418010548385367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qXKgWmi%2FU042boaDF43w5uIKRLFVNgwiPYrEN%2FxV0pc%3D&reserved=0
>>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Cluben.tuikov%40amd.com%7C644a4f3feb79447fd6a408d8904dab27%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637418010548385367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OZGMVRwFXiuhoG3%2FTP54e6vk0xpMQujqAlNxtCcX7kA%3D&reserved=0
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel at lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7Cluben.tuikov%40amd.com%7C644a4f3feb79447fd6a408d8904dab27%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637418010548385367%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qXKgWmi%2FU042boaDF43w5uIKRLFVNgwiPYrEN%2FxV0pc%3D&reserved=0
>>