[PATCH v5 6/6] drm/amdgpu: Avoid HW reset if guilty job already signaled.

Mon Apr 29 19:03:52 UTC 2019

I would clean them up further, but that's only moving code around so 
feel free to add my rb to those.

Christian.

Am 29.04.19 um 16:14 schrieb Grodzovsky, Andrey:
>
> Thanks David, with that only patches 5 and 6 are left for the series 
> to be reviewed.
>
> Christian, any more comments on those patches ?
>
> Andrey
>
> On 4/27/19 10:56 PM, Zhou, David(ChunMing) wrote:
>>
>> Sorry, I only can put my Acked-by: Chunming Zhou 
>> <david1.zhou at amd.com> on patch#3.
>>
>> I cannot fully judge patch #4, #5, #6.
>>
>> -David
>>
>> *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org> *On Behalf Of 
>> *Grodzovsky, Andrey
>> *Sent:* Friday, April 26, 2019 10:09 PM
>> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Zhou, 
>> David(ChunMing) <David1.Zhou at amd.com>; 
>> dri-devel at lists.freedesktop.org; amd-gfx at lists.freedesktop.org; 
>> eric at anholt.net; etnaviv at lists.freedesktop.org
>> *Cc:* Kazlauskas, Nicholas <Nicholas.Kazlauskas at amd.com>; Liu, Monk 
>> <Monk.Liu at amd.com>
>> *Subject:* Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if guilty 
>> job already signaled.
>>
>> Ping (mostly David and Monk).
>>
>> Andrey
>>
>> On 4/24/19 3:09 AM, Christian König wrote:
>>
>>     Am 24.04.19 um 05:02 schrieb Zhou, David(ChunMing):
>>
>>         >> - drm_sched_stop(&ring->sched, &job->base);
>>         >> -
>>         >>               /* after all hw jobs are reset, hw fence is
>>         meaningless, so force_completion */
>>         >> amdgpu_fence_driver_force_completion(ring);
>>         >>       }
>>
>>         HW fence are already forced completion, then we can just
>>         disable irq fence process and ignore hw fence signal when we
>>         are trying to do GPU reset, I think. Otherwise which will
>>         make the logic much more complex.
>>
>>         If this situation happens because of long time execution, we
>>         can increase timeout of reset detection.
>>
>>
>>     You are not thinking widely enough, forcing the hw fence to
>>     complete can trigger other to start other activity in the system.
>>
>>     We first need to stop everything and make sure that we don't do
>>     any processing any more and then start with our reset procedure
>>     including forcing all hw fences to complete.
>>
>>     Christian.
>>
>>
>>         -David
>>
>>         *From:*amd-gfx <amd-gfx-bounces at lists.freedesktop.org>
>>         <mailto:amd-gfx-bounces at lists.freedesktop.org> *On Behalf Of
>>         *Grodzovsky, Andrey
>>         *Sent:* Wednesday, April 24, 2019 12:00 AM
>>         *To:* Zhou, David(ChunMing) <David1.Zhou at amd.com>
>>         <mailto:David1.Zhou at amd.com>; dri-devel at lists.freedesktop.org
>>         <mailto:dri-devel at lists.freedesktop.org>;
>>         amd-gfx at lists.freedesktop.org
>>         <mailto:amd-gfx at lists.freedesktop.org>; eric at anholt.net
>>         <mailto:eric at anholt.net>; etnaviv at lists.freedesktop.org
>>         <mailto:etnaviv at lists.freedesktop.org>;
>>         ckoenig.leichtzumerken at gmail.com
>>         <mailto:ckoenig.leichtzumerken at gmail.com>
>>         *Cc:* Kazlauskas, Nicholas <Nicholas.Kazlauskas at amd.com>
>>         <mailto:Nicholas.Kazlauskas at amd.com>; Liu, Monk
>>         <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>
>>         *Subject:* Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if
>>         guilty job already signaled.
>>
>>         No, i mean the actual HW fence which signals when the job
>>         finished execution on the HW.
>>
>>         Andrey
>>
>>         On 4/23/19 11:19 AM, Zhou, David(ChunMing) wrote:
>>
>>             do you mean fence timer? why not stop it as well when
>>             stopping sched for the reason of hw reset?
>>
>>             -------- Original Message --------
>>             Subject: Re: [PATCH v5 6/6] drm/amdgpu: Avoid HW reset if
>>             guilty job already signaled.
>>             From: "Grodzovsky, Andrey"
>>             To: "Zhou, David(ChunMing)"
>>             ,dri-devel at lists.freedesktop.org,amd-gfx at lists.freedesktop.org,eric at anholt.net,etnaviv at lists.freedesktop.org,ckoenig.leichtzumerken at gmail.com
>>             <mailto:dri-devel at lists.freedesktop.org,amd-gfx at lists.freedesktop.org,eric at anholt.net,etnaviv at lists.freedesktop.org,ckoenig.leichtzumerken at gmail.com>
>>             CC: "Kazlauskas, Nicholas" ,"Liu, Monk"
>>
>>
>>             On 4/22/19 9:09 AM, Zhou, David(ChunMing) wrote:
>>             > +Monk.
>>             >
>>             > GPU reset is used widely in SRIOV, so need
>>             virtulizatino guy take a look.
>>             >
>>             > But out of curious, why guilty job can signal more if
>>             the job is already
>>             > set to guilty? set it wrongly?
>>             >
>>             >
>>             > -David
>>
>>
>>             It's possible that the job does completes at a later time
>>             then it's
>>             timeout handler started processing so in this patch we
>>             try to protect
>>             against this by rechecking the HW fence after stopping
>>             all SW
>>             schedulers. We do it BEFORE marking guilty on the job's
>>             sched_entity so
>>             at the point we check the guilty flag is not set yet.
>>
>>             Andrey
>>
>>
>>             >
>>             > 在 2019/4/18 23:00, Andrey Grodzovsky 写道:
>>             >> Also reject TDRs if another one already running.
>>             >>
>>             >> v2:
>>             >> Stop all schedulers across device and entire XGMI hive
>>             before
>>             >> force signaling HW fences.
>>             >> Avoid passing job_signaled to helper fnctions to keep
>>             all the decision
>>             >> making about skipping HW reset in one place.
>>             >>
>>             >> v3:
>>             >> Fix SW sched. hang after non HW reset.
>>             sched.hw_rq_count has to be balanced
>>             >> against it's decrement in drm_sched_stop in non HW
>>             reset case.
>>             >> v4: rebase
>>             >> v5: Revert v3 as we do it now in sceduler code.
>>             >>
>>             >> Signed-off-by: Andrey Grodzovsky
>>             <andrey.grodzovsky at amd.com>
>>             <mailto:andrey.grodzovsky at amd.com>
>>             >> ---
>>             >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 143
>>             +++++++++++++++++++----------
>>             >>    1 file changed, 95 insertions(+), 48 deletions(-)
>>             >>
>>             >> diff --git
>>             a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>             b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>             >> index a0e165c..85f8792 100644
>>             >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>             >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>             >> @@ -3334,8 +3334,6 @@ static int
>>             amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>             >>               if (!ring || !ring->sched.thread)
>>             >>                       continue;
>>             >>
>>             >> - drm_sched_stop(&ring->sched, &job->base);
>>             >> -
>>             >>               /* after all hw jobs are reset, hw fence
>>             is meaningless, so force_completion */
>>             >> amdgpu_fence_driver_force_completion(ring);
>>             >>       }
>>             >> @@ -3343,6 +3341,7 @@ static int
>>             amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>             >>       if(job)
>>             >> drm_sched_increase_karma(&job->base);
>>             >>
>>             >> +    /* Don't suspend on bare metal if we are not
>>             going to HW reset the ASIC */
>>             >>       if (!amdgpu_sriov_vf(adev)) {
>>             >>
>>             >>               if (!need_full_reset)
>>             >> @@ -3480,37 +3479,21 @@ static int
>>             amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
>>             >>       return r;
>>             >>    }
>>             >>
>>             >> -static void amdgpu_device_post_asic_reset(struct
>>             amdgpu_device *adev)
>>             >> +static bool amdgpu_device_lock_adev(struct
>>             amdgpu_device *adev, bool trylock)
>>             >>    {
>>             >> -    int i;
>>             >> -
>>             >> -    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>             >> -            struct amdgpu_ring *ring = adev->rings[i];
>>             >> -
>>             >> -            if (!ring || !ring->sched.thread)
>>             >> -                    continue;
>>             >> -
>>             >> -            if (!adev->asic_reset_res)
>>             >> - drm_sched_resubmit_jobs(&ring->sched);
>>             >> +    if (trylock) {
>>             >> +            if (!mutex_trylock(&adev->lock_reset))
>>             >> +                    return false;
>>             >> +    } else
>>             >> + mutex_lock(&adev->lock_reset);
>>             >>
>>             >> - drm_sched_start(&ring->sched, !adev->asic_reset_res);
>>             >> -    }
>>             >> -
>>             >> -    if (!amdgpu_device_has_dc_support(adev)) {
>>             >> - drm_helper_resume_force_mode(adev->ddev);
>>             >> -    }
>>             >> -
>>             >> -    adev->asic_reset_res = 0;
>>             >> -}
>>             >> -
>>             >> -static void amdgpu_device_lock_adev(struct
>>             amdgpu_device *adev)
>>             >> -{
>>             >> - mutex_lock(&adev->lock_reset);
>>             >> atomic_inc(&adev->gpu_reset_counter);
>>             >>       adev->in_gpu_reset = 1;
>>             >>       /* Block kfd: SRIOV would do it separately */
>>             >>       if (!amdgpu_sriov_vf(adev))
>>             >> amdgpu_amdkfd_pre_reset(adev);
>>             >> +
>>             >> +    return true;
>>             >>    }
>>             >>
>>             >>    static void amdgpu_device_unlock_adev(struct
>>             amdgpu_device *adev)
>>             >> @@ -3538,40 +3521,42 @@ static void
>>             amdgpu_device_unlock_adev(struct amdgpu_device *adev)
>>             >>    int amdgpu_device_gpu_recover(struct amdgpu_device
>>             *adev,
>>             >>                             struct amdgpu_job *job)
>>             >>    {
>>             >> -    int r;
>>             >> +    struct list_head device_list, *device_list_handle
>>             =  NULL;
>>             >> +    bool need_full_reset, job_signaled;
>>             >>       struct amdgpu_hive_info *hive = NULL;
>>             >> -    bool need_full_reset = false;
>>             >>       struct amdgpu_device *tmp_adev = NULL;
>>             >> -    struct list_head device_list, *device_list_handle
>>             =  NULL;
>>             >> +    int i, r = 0;
>>             >>
>>             >> +    need_full_reset = job_signaled = false;
>>             >>       INIT_LIST_HEAD(&device_list);
>>             >>
>>             >>       dev_info(adev->dev, "GPU reset begin!\n");
>>             >>
>>             >> +    hive = amdgpu_get_xgmi_hive(adev, false);
>>             >> +
>>             >>       /*
>>             >> -     * In case of XGMI hive disallow concurrent
>>             resets to be triggered
>>             >> -     * by different nodes. No point also since the
>>             one node already executing
>>             >> -     * reset will also reset all the other nodes in
>>             the hive.
>>             >> +     * Here we trylock to avoid chain of resets
>>             executing from
>>             >> +     * either trigger by jobs on different adevs in
>>             XGMI hive or jobs on
>>             >> +     * different schedulers for same device while
>>             this TO handler is running.
>>             >> +     * We always reset all schedulers for device and
>>             all devices for XGMI
>>             >> +     * hive so that should take care of them too.
>>             >>        */
>>             >> -    hive = amdgpu_get_xgmi_hive(adev, 0);
>>             >> -    if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
>>             >> - !mutex_trylock(&hive->reset_lock))
>>             >> +
>>             >> +    if (hive && !mutex_trylock(&hive->reset_lock)) {
>>             >> +            DRM_INFO("Bailing on TDR for s_job:%llx,
>>             hive: %llx as another already in progress",
>>             >> +                     job->base.id, hive->hive_id);
>>             >>               return 0;
>>             >> +    }
>>             >>
>>             >>       /* Start with adev pre asic reset first for soft
>>             reset check.*/
>>             >> -    amdgpu_device_lock_adev(adev);
>>             >> -    r = amdgpu_device_pre_asic_reset(adev,
>>             >> - job,
>>             >> - &need_full_reset);
>>             >> -    if (r) {
>>             >> -            /*TODO Should we stop ?*/
>>             >> -            DRM_ERROR("GPU pre asic reset failed with
>>             err, %d for drm dev, %s ",
>>             >> -                      r, adev->ddev->unique);
>>             >> -            adev->asic_reset_res = r;
>>             >> +    if (!amdgpu_device_lock_adev(adev, !hive)) {
>>             >> +            DRM_INFO("Bailing on TDR for s_job:%llx,
>>             as another already in progress",
>>             >> + job->base.id);
>>             >> +            return 0;
>>             >>       }
>>             >>
>>             >>       /* Build list of devices to reset */
>>             >> -    if  (need_full_reset &&
>>             adev->gmc.xgmi.num_physical_nodes > 1) {
>>             >> +    if (adev->gmc.xgmi.num_physical_nodes > 1) {
>>             >>               if (!hive) {
>>             >> amdgpu_device_unlock_adev(adev);
>>             >>                       return -ENODEV;
>>             >> @@ -3588,13 +3573,56 @@ int
>>             amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>             >>               device_list_handle = &device_list;
>>             >>       }
>>             >>
>>             >> +    /* block all schedulers and reset given job's ring */
>>             >> +    list_for_each_entry(tmp_adev, device_list_handle,
>>             gmc.xgmi.head) {
>>             >> +            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>             >> +                    struct amdgpu_ring *ring =
>>             tmp_adev->rings[i];
>>             >> +
>>             >> +                    if (!ring || !ring->sched.thread)
>>             >> +                            continue;
>>             >> +
>>             >> + drm_sched_stop(&ring->sched, &job->base);
>>             >> +            }
>>             >> +    }
>>             >> +
>>             >> +
>>             >> +    /*
>>             >> +     * Must check guilty signal here since after this
>>             point all old
>>             >> +     * HW fences are force signaled.
>>             >> +     *
>>             >> +     * job->base holds a reference to parent fence
>>             >> +     */
>>             >> +    if (job && job->base.s_fence->parent &&
>>             >> + dma_fence_is_signaled(job->base.s_fence->parent))
>>             >> +            job_signaled = true;
>>             >> +
>>             >> +    if (!amdgpu_device_ip_need_full_reset(adev))
>>             >> +            device_list_handle = &device_list;
>>             >> +
>>             >> +    if (job_signaled) {
>>             >> +            dev_info(adev->dev, "Guilty job already
>>             signaled, skipping HW reset");
>>             >> +            goto skip_hw_reset;
>>             >> +    }
>>             >> +
>>             >> +
>>             >> +    /* Guilty job will be freed after this*/
>>             >> +    r = amdgpu_device_pre_asic_reset(adev,
>>             >> + job,
>>             >> + &need_full_reset);
>>             >> +    if (r) {
>>             >> +            /*TODO Should we stop ?*/
>>             >> +            DRM_ERROR("GPU pre asic reset failed with
>>             err, %d for drm dev, %s ",
>>             >> +                      r, adev->ddev->unique);
>>             >> +            adev->asic_reset_res = r;
>>             >> +    }
>>             >> +
>>             >>    retry:    /* Rest of adevs pre asic reset from XGMI
>>             hive. */
>>             >>       list_for_each_entry(tmp_adev,
>>             device_list_handle, gmc.xgmi.head) {
>>             >>
>>             >>               if (tmp_adev == adev)
>>             >>                       continue;
>>             >>
>>             >> - amdgpu_device_lock_adev(tmp_adev);
>>             >> + amdgpu_device_lock_adev(tmp_adev, false);
>>             >>               r = amdgpu_device_pre_asic_reset(tmp_adev,
>>             >>                                                NULL,
>>             >> &need_full_reset);
>>             >> @@ -3618,9 +3646,28 @@ int
>>             amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>             >>                       goto retry;
>>             >>       }
>>             >>
>>             >> +skip_hw_reset:
>>             >> +
>>             >>       /* Post ASIC reset for all devs .*/
>>             >>       list_for_each_entry(tmp_adev,
>>             device_list_handle, gmc.xgmi.head) {
>>             >> - amdgpu_device_post_asic_reset(tmp_adev);
>>             >> +            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>             >> +                    struct amdgpu_ring *ring =
>>             tmp_adev->rings[i];
>>             >> +
>>             >> +                    if (!ring || !ring->sched.thread)
>>             >> +                            continue;
>>             >> +
>>             >> +                    /* No point to resubmit jobs if
>>             we didn't HW reset*/
>>             >> +                    if (!tmp_adev->asic_reset_res &&
>>             !job_signaled)
>>             >> + drm_sched_resubmit_jobs(&ring->sched);
>>             >> +
>>             >> + drm_sched_start(&ring->sched,
>>             !tmp_adev->asic_reset_res);
>>             >> +            }
>>             >> +
>>             >> +            if
>>             (!amdgpu_device_has_dc_support(tmp_adev) && !job_signaled) {
>>             >> + drm_helper_resume_force_mode(tmp_adev->ddev);
>>             >> +            }
>>             >> +
>>             >> +            tmp_adev->asic_reset_res = 0;
>>             >>
>>             >>               if (r) {
>>             >>                       /* bad news, how to tell it to
>>             userspace ? */
>>             >> @@ -3633,7 +3680,7 @@ int
>>             amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>             >> amdgpu_device_unlock_adev(tmp_adev);
>>             >>       }
>>             >>
>>             >> -    if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
>>             >> +    if (hive)
>>             >> mutex_unlock(&hive->reset_lock);
>>             >>
>>             >>       if (r)
>>             _______________________________________________
>>             amd-gfx mailing list
>>             amd-gfx at lists.freedesktop.org
>>             <mailto:amd-gfx at lists.freedesktop.org>
>>             https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20190429/a1807f1c/attachment-0001.html>