[PATCH] drm/amdgpu: guard ib scheduling while in reset

Fri Oct 25 10:32:13 UTC 2019

Am 25.10.19 um 12:08 schrieb S, Shirish:


On 10/25/2019 2:56 PM, Koenig, Christian wrote:
Am 25.10.19 um 11:22 schrieb S, Shirish:


On 10/25/2019 2:23 PM, Koenig, Christian wrote:

amdgpu_do_asic_reset starting to resume blocks

...

amdgpu 0000:03:00.0: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run] *ERROR* Error scheduling IBs (-22)
...

amdgpu_device_ip_resume_phase2 resumed gfx_v9_0

amdgpu_device_ip_resume_phase2 resumed sdma_v4_0

amdgpu_device_ip_resume_phase2 resumed powerplay

This is what's the root of the problem.

The scheduler should never be resumed before we are done with bringing back the hardware into an usable state.

I dont see the scheduler being resumed when the ib is scheduled, its done way after the hardware is ready in reset code path.

Below are the logs:

amdgpu 0000:03:00.0: GPU reset begin!
amdgpu_device_gpu_recover calling drm_sched_stop             <==
...
amdgpu 0000:03:00.0: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run] *ERROR* Error scheduling IBs (-22)
...
amdgpu_device_ip_resume_phase2 resumed sdma_v4_0
amdgpu_device_ip_resume_phase2 resumed powerplay
amdgpu_device_ip_resume_phase2 resumed dm
...
[drm] recover vram bo from shadow done
amdgpu_device_gpu_recover calling  drm_sched_start         <==
...

As mentioned in the call trace, drm_sched_main() is responsible for this job_run which seems to be called during cleanup.

Then the scheduler isn't stopped for some reason and we need to investigate why.


The drm_sched_stop() as i mentioned only parks the thread

That should be sufficient for the thread to be halted and not submit more jobs to the hardware.


and cancels work and nothing else, not sure why you think it hasn't stopped or done what it is supposed to do.

Since it works 3/5 times.

We used to have another kthread_park()/unpark() in drm_sched_entity_fini(), maybe an application is crashing while we are trying to reset the GPU?

Would be rather unlikely, especially that would be rather hard to reproduce but currently my best bet what's going wrong here.

Its sometimes triggered from drm_sched_entity_fini(), as i can see in prints but not always.

I believe application crashing while GPU resets is anticipated, depending upon workloads and state of gfx renderer when reset has happened.

Since reset is something that is not a usual/routine/regular event, such anomalies are to be expected when it happens,

so we need to have failsafe methods like this patch and may be some more based on system behavior upon reset.

Sorry but your patch is complete nonsense from the design point of view.

It is absolutely mandatory that the scheduler is stopped and the thread correctly parked for the GPU reset to work properly.

See you not only need to prevent running new jobs, but also job preparation (e.g. grabbing VMIDs) or otherwise you will immediately run into the next GPU reset after the first one finished.

Regards,
Christian.


Regards,

Shirish S

Regards,
Christian.


Regards,

Shirish S

Regards,
Christian.

Am 25.10.19 um 10:50 schrieb S, Shirish:

Here is the call trace:

Call Trace:
 dump_stack+0x4d/0x63
 amdgpu_ib_schedule+0x86/0x4b7
 ? __mod_timer+0x21e/0x244
 amdgpu_job_run+0x108/0x178
 drm_sched_main+0x253/0x2fa
 ? remove_wait_queue+0x51/0x51
 ? drm_sched_cleanup_jobs.part.12+0xda/0xda
 kthread+0x14f/0x157
 ? kthread_park+0x86/0x86
 ret_from_fork+0x22/0x40
amdgpu 0000:03:00.0: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run] *ERROR* Error scheduling IBs (-22)

printed via below change:

@@ -151,6 +152,10 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs,
        }

        if (!ring->sched.ready) {
+              dump_stack();
                dev_err(adev->dev, "couldn't schedule ib on ring <%s>\n", ring->name);
                return -EINVAL;

On 10/24/2019 10:00 PM, Christian König wrote:
Am 24.10.19 um 17:06 schrieb Grodzovsky, Andrey:


On 10/24/19 7:01 AM, Christian König wrote:
Am 24.10.19 um 12:58 schrieb S, Shirish:
[Why]
Upon GPU reset, kernel cleans up already submitted jobs
via drm_sched_cleanup_jobs.
This schedules ib's via drm_sched_main()->run_job, leading to
race condition of rings being ready or not, since during reset
rings may be suspended.

NAK, exactly that's what should not happen.

The scheduler should be suspend while a GPU reset is in progress.

So you are running into a completely different race here.

Below is the series of events when the issue occurs.

(Note that as you & Andrey mentioned the scheduler has been suspended but the job is scheduled nonetheless.)

amdgpu 0000:03:00.0: GPU reset begin!

...

amdgpu_device_gpu_recover stopping ring sdma0 via drm_sched_stop

...

amdgpu 0000:03:00.0: GPU reset succeeded, trying to resume

amdgpu_do_asic_reset starting to resume blocks

...

amdgpu 0000:03:00.0: couldn't schedule ib on ring <sdma0>
[drm:amdgpu_job_run] *ERROR* Error scheduling IBs (-22)
...

amdgpu_device_ip_resume_phase2 resumed gfx_v9_0

amdgpu_device_ip_resume_phase2 resumed sdma_v4_0

amdgpu_device_ip_resume_phase2 resumed powerplay

...

FWIW, since the job is always NULL when "drm_sched_stop(&ring->sched, job ? &job->base : NULL);" when called during reset, all drm_sched_stop() does

is  cancel delayed work and park the sched->thread. There is no job list to be iterated to de-activate or remove or update fences.

Based on all this analysis, adding a mutex is more failsafe and less intrusive in the current code flow and lastly seems to be logical as well, hence I devised this approach


Please sync up with Andrey how this was able to happen.

Regards,
Christian.


Shirish - Christian makes a good point - note that in amdgpu_device_gpu_recover drm_sched_stop which stop all the scheduler threads is called way before we suspend the HW in amdgpu_device_pre_asic_reset->amdgpu_device_ip_suspend where SDMA suspension is happening and where the HW ring marked as not ready - please provide call stack for when you hit [drm:amdgpu_job_run] *ERROR* Error scheduling IBs (-22) to identify the code path which tried to submit the SDMA IB

Well the most likely cause of this is that the hardware failed to resume after the reset.

Infact hardware resume has not yet started, when the job is scheduled, which is the race am trying to address with this patch.

Regards,

Shirish S

Christian.


Andrey



[How]
make GPU reset's amdgpu_device_ip_resume_phase2() &
amdgpu_ib_schedule() in amdgpu_job_run() mutually exclusive.

Signed-off-by: Shirish S <shirish.s at amd.com><mailto:shirish.s at amd.com>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 2 ++
  3 files changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index f4d9041..7b07a47b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -973,6 +973,7 @@ struct amdgpu_device {
      bool                            in_gpu_reset;
      enum pp_mp1_state               mp1_state;
      struct mutex  lock_reset;
+    struct mutex  lock_ib_sched;
      struct amdgpu_doorbell_index doorbell_index;
        int asic_reset_res;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 676cad1..63cad74 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2759,6 +2759,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
      mutex_init(&adev->virt.vf_errors.lock);
      hash_init(adev->mn_hash);
      mutex_init(&adev->lock_reset);
+    mutex_init(&adev->lock_ib_sched);
      mutex_init(&adev->virt.dpm_mutex);
      mutex_init(&adev->psp.mutex);
  @@ -3795,7 +3796,9 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
                  if (r)
                      return r;
  +                mutex_lock(&tmp_adev->lock_ib_sched);
                  r = amdgpu_device_ip_resume_phase2(tmp_adev);
+                mutex_unlock(&tmp_adev->lock_ib_sched);
                  if (r)
                      goto out;
  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index e1bad99..cd6082d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -233,8 +233,10 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
      if (finished->error < 0) {
          DRM_INFO("Skip scheduling IBs!\n");
      } else {
+        mutex_lock(&ring->adev->lock_ib_sched);
          r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
                         &fence);
+        mutex_unlock(&ring->adev->lock_ib_sched);
          if (r)
              DRM_ERROR("Error scheduling IBs (%d)\n", r);
      }




_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191025/5df39def/attachment-0001.html>