<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">Am 24.10.19 um 17:06 schrieb
      Grodzovsky, Andrey:<br>
    </div>
    <blockquote type="cite"
      cite="mid:f3be329d-d350-c821-00b7-d94858335796@amd.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <p><br>
      </p>
      <div class="moz-cite-prefix">On 10/24/19 7:01 AM, Christian König
        wrote:<br>
      </div>
      <blockquote type="cite"
        cite="mid:23ea615d-5ef4-d0b3-a0ec-6fae67b102f2@gmail.com">
        Am 24.10.19 um 12:58 schrieb S, Shirish: <br>
        <blockquote type="cite">[Why] <br>
          Upon GPU reset, kernel cleans up already submitted jobs <br>
          via drm_sched_cleanup_jobs. <br>
          This schedules ib's via drm_sched_main()->run_job, leading
          to <br>
          race condition of rings being ready or not, since during reset
          <br>
          rings may be suspended. <br>
        </blockquote>
        <br>
        NAK, exactly that's what should not happen. <br>
        <br>
        The scheduler should be suspend while a GPU reset is in
        progress. <br>
        <br>
        So you are running into a completely different race here. <br>
        <br>
        Please sync up with Andrey how this was able to happen. <br>
        <br>
        Regards, <br>
        Christian. <br>
      </blockquote>
      <p><br>
      </p>
      <p>Shirish - Christian makes a good point - note that in
        amdgpu_device_gpu_recover drm_sched_stop which stop all the
        scheduler threads is called way before we suspend the HW in
        amdgpu_device_pre_asic_reset->amdgpu_device_ip_suspend where
        SDMA suspension is happening and where the HW ring marked as not
        ready - please provide call stack for when you hit
        <span style="color: rgb(51, 51, 51); font-family: Arial,
          sans-serif; font-size: 14px; font-style: normal;
          font-variant-ligatures: normal; font-variant-caps: normal;
          font-weight: 400; letter-spacing: normal; orphans: 2;
          text-align: start; text-indent: 0px; text-transform: none;
          white-space: normal; widows: 2; word-spacing: 0px;
          -webkit-text-stroke-width: 0px; background-color: rgb(245,
          245, 245); text-decoration-style: initial;
          text-decoration-color: initial; display: inline !important;
          float: none;">
          [drm:amdgpu_job_run] *ERROR* Error scheduling IBs (-22) to
          identify the code path which tried to submit the SDMA IB<br>
        </span></p>
    </blockquote>
    <br>
    Well the most likely cause of this is that the hardware failed to
    resume after the reset.<br>
    <br>
    Christian.<br>
    <br>
    <blockquote type="cite"
      cite="mid:f3be329d-d350-c821-00b7-d94858335796@amd.com">
      <p><span style="color: rgb(51, 51, 51); font-family: Arial,
          sans-serif; font-size: 14px; font-style: normal;
          font-variant-ligatures: normal; font-variant-caps: normal;
          font-weight: 400; letter-spacing: normal; orphans: 2;
          text-align: start; text-indent: 0px; text-transform: none;
          white-space: normal; widows: 2; word-spacing: 0px;
          -webkit-text-stroke-width: 0px; background-color: rgb(245,
          245, 245); text-decoration-style: initial;
          text-decoration-color: initial; display: inline !important;
          float: none;">
        </span></p>
      <p><span style="color: rgb(51, 51, 51); font-family: Arial,
          sans-serif; font-size: 14px; font-style: normal;
          font-variant-ligatures: normal; font-variant-caps: normal;
          font-weight: 400; letter-spacing: normal; orphans: 2;
          text-align: start; text-indent: 0px; text-transform: none;
          white-space: normal; widows: 2; word-spacing: 0px;
          -webkit-text-stroke-width: 0px; background-color: rgb(245,
          245, 245); text-decoration-style: initial;
          text-decoration-color: initial; display: inline !important;
          float: none;">Andrey</span></p>
      <p><span style="color: rgb(51, 51, 51); font-family: Arial,
          sans-serif; font-size: 14px; font-style: normal;
          font-variant-ligatures: normal; font-variant-caps: normal;
          font-weight: 400; letter-spacing: normal; orphans: 2;
          text-align: start; text-indent: 0px; text-transform: none;
          white-space: normal; widows: 2; word-spacing: 0px;
          -webkit-text-stroke-width: 0px; background-color: rgb(245,
          245, 245); text-decoration-style: initial;
          text-decoration-color: initial; display: inline !important;
          float: none;"><br>
        </span></p>
      <blockquote type="cite"
        cite="mid:23ea615d-5ef4-d0b3-a0ec-6fae67b102f2@gmail.com">
        <br>
        <blockquote type="cite"><br>
          [How] <br>
          make GPU reset's amdgpu_device_ip_resume_phase2() & <br>
          amdgpu_ib_schedule() in amdgpu_job_run() mutually exclusive. <br>
          <br>
          Signed-off-by: Shirish S <a class="moz-txt-link-rfc2396E"
            href="mailto:shirish.s@amd.com" moz-do-not-send="true">
            <shirish.s@amd.com></a> <br>
          --- <br>
            drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 + <br>
            drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ <br>
            drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 2 ++ <br>
            3 files changed, 6 insertions(+) <br>
          <br>
          diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
          b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
          <br>
          index f4d9041..7b07a47b 100644 <br>
          --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h <br>
          +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h <br>
          @@ -973,6 +973,7 @@ struct amdgpu_device { <br>
                bool                            in_gpu_reset; <br>
                enum pp_mp1_state               mp1_state; <br>
                struct mutex  lock_reset; <br>
          +    struct mutex  lock_ib_sched; <br>
                struct amdgpu_doorbell_index doorbell_index; <br>
                  int asic_reset_res; <br>
          diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
          b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
          <br>
          index 676cad1..63cad74 100644 <br>
          --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <br>
          +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <br>
          @@ -2759,6 +2759,7 @@ int amdgpu_device_init(struct
          amdgpu_device *adev, <br>
                mutex_init(&adev->virt.vf_errors.lock); <br>
                hash_init(adev->mn_hash); <br>
                mutex_init(&adev->lock_reset); <br>
          +    mutex_init(&adev->lock_ib_sched); <br>
                mutex_init(&adev->virt.dpm_mutex); <br>
                mutex_init(&adev->psp.mutex); <br>
            @@ -3795,7 +3796,9 @@ static int amdgpu_do_asic_reset(struct
          amdgpu_hive_info *hive,
          <br>
                            if (r) <br>
                                return r; <br>
            +               
          mutex_lock(&tmp_adev->lock_ib_sched); <br>
                            r =
          amdgpu_device_ip_resume_phase2(tmp_adev); <br>
          +               
          mutex_unlock(&tmp_adev->lock_ib_sched); <br>
                            if (r) <br>
                                goto out; <br>
            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
          b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
          <br>
          index e1bad99..cd6082d 100644 <br>
          --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c <br>
          +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c <br>
          @@ -233,8 +233,10 @@ static struct dma_fence
          *amdgpu_job_run(struct drm_sched_job *sched_job)
          <br>
                if (finished->error < 0) { <br>
                    DRM_INFO("Skip scheduling IBs!\n"); <br>
                } else { <br>
          +        mutex_lock(&ring->adev->lock_ib_sched); <br>
                    r = amdgpu_ib_schedule(ring, job->num_ibs,
          job->ibs, job, <br>
                                   &fence); <br>
          +        mutex_unlock(&ring->adev->lock_ib_sched); <br>
                    if (r) <br>
                        DRM_ERROR("Error scheduling IBs (%d)\n", r); <br>
                } <br>
        </blockquote>
        <br>
      </blockquote>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
amd-gfx mailing list
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a>
<a class="moz-txt-link-freetext" href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a></pre>
    </blockquote>
    <br>
  </body>
</html>