<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p><br>
</p>
<div class="moz-cite-prefix">On 10/24/19 7:01 AM, Christian König wrote:<br>
</div>
<blockquote type="cite" cite="mid:23ea615d-5ef4-d0b3-a0ec-6fae67b102f2@gmail.com">
Am 24.10.19 um 12:58 schrieb S, Shirish: <br>
<blockquote type="cite">[Why] <br>
Upon GPU reset, kernel cleans up already submitted jobs <br>
via drm_sched_cleanup_jobs. <br>
This schedules ib's via drm_sched_main()->run_job, leading to <br>
race condition of rings being ready or not, since during reset <br>
rings may be suspended. <br>
</blockquote>
<br>
NAK, exactly that's what should not happen. <br>
<br>
The scheduler should be suspend while a GPU reset is in progress. <br>
<br>
So you are running into a completely different race here. <br>
<br>
Please sync up with Andrey how this was able to happen. <br>
<br>
Regards, <br>
Christian. <br>
</blockquote>
<p><br>
</p>
<p>Shirish - Christian makes a good point - note that in amdgpu_device_gpu_recover drm_sched_stop which stop all the scheduler threads is called way before we suspend the HW in amdgpu_device_pre_asic_reset->amdgpu_device_ip_suspend where SDMA suspension is
happening and where the HW ring marked as not ready - please provide call stack for when you hit
<span style="color: rgb(51, 51, 51); font-family: Arial, sans-serif;
font-size: 14px; font-style: normal; font-variant-ligatures:
normal; font-variant-caps: normal; font-weight: 400;
letter-spacing: normal; orphans: 2; text-align: start;
text-indent: 0px; text-transform: none; white-space: normal;
widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(245, 245, 245); text-decoration-style:
initial; text-decoration-color: initial; display: inline
!important; float: none;">
[drm:amdgpu_job_run] *ERROR* Error scheduling IBs (-22) to identify the code path which tried to submit the SDMA IB<br>
</span></p>
<p><span style="color: rgb(51, 51, 51); font-family: Arial,
sans-serif; font-size: 14px; font-style: normal;
font-variant-ligatures: normal; font-variant-caps: normal;
font-weight: 400; letter-spacing: normal; orphans: 2;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: 2; word-spacing: 0px;
-webkit-text-stroke-width: 0px; background-color: rgb(245, 245,
245); text-decoration-style: initial; text-decoration-color:
initial; display: inline !important; float: none;">Andrey</span></p>
<p><span style="color: rgb(51, 51, 51); font-family: Arial,
sans-serif; font-size: 14px; font-style: normal;
font-variant-ligatures: normal; font-variant-caps: normal;
font-weight: 400; letter-spacing: normal; orphans: 2;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: 2; word-spacing: 0px;
-webkit-text-stroke-width: 0px; background-color: rgb(245, 245,
245); text-decoration-style: initial; text-decoration-color:
initial; display: inline !important; float: none;"><br>
</span></p>
<blockquote type="cite" cite="mid:23ea615d-5ef4-d0b3-a0ec-6fae67b102f2@gmail.com">
<br>
<blockquote type="cite"><br>
[How] <br>
make GPU reset's amdgpu_device_ip_resume_phase2() & <br>
amdgpu_ib_schedule() in amdgpu_job_run() mutually exclusive. <br>
<br>
Signed-off-by: Shirish S <a class="moz-txt-link-rfc2396E" href="mailto:shirish.s@amd.com">
<shirish.s@amd.com></a> <br>
--- <br>
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ <br>
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 ++ <br>
3 files changed, 6 insertions(+) <br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
<br>
index f4d9041..7b07a47b 100644 <br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h <br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h <br>
@@ -973,6 +973,7 @@ struct amdgpu_device { <br>
bool in_gpu_reset; <br>
enum pp_mp1_state mp1_state; <br>
struct mutex lock_reset; <br>
+ struct mutex lock_ib_sched; <br>
struct amdgpu_doorbell_index doorbell_index; <br>
int asic_reset_res; <br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
<br>
index 676cad1..63cad74 100644 <br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <br>
@@ -2759,6 +2759,7 @@ int amdgpu_device_init(struct amdgpu_device *adev, <br>
mutex_init(&adev->virt.vf_errors.lock); <br>
hash_init(adev->mn_hash); <br>
mutex_init(&adev->lock_reset); <br>
+ mutex_init(&adev->lock_ib_sched); <br>
mutex_init(&adev->virt.dpm_mutex); <br>
mutex_init(&adev->psp.mutex); <br>
@@ -3795,7 +3796,9 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
<br>
if (r) <br>
return r; <br>
+ mutex_lock(&tmp_adev->lock_ib_sched); <br>
r = amdgpu_device_ip_resume_phase2(tmp_adev); <br>
+ mutex_unlock(&tmp_adev->lock_ib_sched); <br>
if (r) <br>
goto out; <br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
<br>
index e1bad99..cd6082d 100644 <br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c <br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c <br>
@@ -233,8 +233,10 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
<br>
if (finished->error < 0) { <br>
DRM_INFO("Skip scheduling IBs!\n"); <br>
} else { <br>
+ mutex_lock(&ring->adev->lock_ib_sched); <br>
r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job, <br>
&fence); <br>
+ mutex_unlock(&ring->adev->lock_ib_sched); <br>
if (r) <br>
DRM_ERROR("Error scheduling IBs (%d)\n", r); <br>
} <br>
</blockquote>
<br>
</blockquote>
</body>
</html>