[PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job TDR (V2)
Christian König
deathsimple at vodafone.de
Wed May 10 11:13:48 UTC 2017
> You still avoid my question: what’s the theoretical backend you that
> you think check once instead of twice or even more is good**before**
> hw_job_reset() ?
Because stopping all the scheduler threads takes a moment and it is
entirely possible that the job finishes within that time.
> 1) if you check the fence and found it not signaled, then you will
> call hw_job_reset(), but there is still chance that between your check
> and the hw_job_reset() the
>
> Sched fence could signaled , isn’t it ? you still cannot avoid such
> race condition
>
Crap, you're right. We would indeed need to check twice and that
wouldn't be consistent anyway.
Once after stopping the schedulers and before hw_job_reset() because
then we can just start the schedulers again and continue as if nothing
has happened.
And another time after calling hw_job_reset() if we want to set the
error code.
> Don’t forget your approach still have chance to hit the race
> condition, and to me I don’t think the race condition matters that’s
> why I don’t even consider it
>
Yeah, you convinced me. Please go ahead with the current approach, but
at least add a comment that we might want to improve that.
Regards,
Christian.
Am 10.05.2017 um 13:02 schrieb Liu, Monk:
>
> > Checking a second time is pointless since it can't signal any more
> after calling amd_sched_hw_job_reset().
>
> [ML] you seems not response to me … I of cause know fence cannot
> signal after hw_job_reset() ….
>
> My question is , before you call hw_job_reset(), why you want to check
> the fence ? why not check twice ?
>
> You still avoid my question: what’s the theoretical backend you that
> you think check once instead of twice or even more is good**before**
> hw_job_reset() ?
>
> >No, the timeout is pretty meaningless. It's just the trigger that we
> need to do something.
>
> 2) And even it signaled after entering gpu_reset(), it will
> automatically done like normal cases, that’s good. Why remove those
> callback instead ?
>
> > No, that's absolutely not good. We don't know if it's the hardware
> which results in the job being signaled or our reset code.
>
> [ML] you are wrong, for SR-IOV case, the timeout is all that matters,
> because one VF can only have such time slice within timeout, and I’m
> doing the TDR on SR-IOV so don’t
>
> Always looks things with bare-metal mind
>
> >Otherwise we have a race condition here where we can't determine if
> the reset finished the job or if it did just on it's own while we
> stopped the scheduler.
>
> [ML] you are wrong :
> 1) if you check the fence and found it not signaled, then you will
> call hw_job_reset(), but there is still chance that between your check
> and the hw_job_reset() the
>
> Sched fence could signaled , isn’t it ? you still cannot avoid such
> race condition
>
> 2) I don’t care if the fence is signaled due to its own finish or
> because we force the hw fence signaled, for SR-IOV case, as long as
> the job exceeds timeout, we consider
>
> It hang.
>
> 3) Even for bare-metal case, you still cannot sure if the fence is
> signaled due to its own or hw_fence force signal, reason is in #1)
>
> >No, you are insist on a vague rules not strict, like I said, what is the
> theoretic to backend your approach that only check once on the in
> question job ? why not check again if not signaled ?
>
> >Because we have removed the connection between the job and the
> hardware fence and because of this the job can never signal.
>
> [ML] like I said, before you call hw_job_reset(), you only check on
> the job once, why not check again and again ? I don’t see you have a
> reason to only check once, and
>
> If you don’t have such reason I think you should not check at all.
>
> If you have such reason, prove me that only check once is good and enough.
>
> Don’t forget your approach still have chance to hit the race
> condition, and to me I don’t think the race condition matters that’s
> why I don’t even consider it
>
> BR Monk
>
> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On
> Behalf Of *Christian König
> *Sent:* Wednesday, May 10, 2017 6:26 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job TDR (V2)
>
> Am 10.05.2017 um 12:05 schrieb Liu, Monk:
>
> [ML] yes, but we cannot guarantee the job is 100% really hang when
> entering gpu_reset(), we can only trust our amdgpu_job_timeout as
> a deadline for each job.
>
> You approach that check the fence first before charge it as
> guilty/hang is incorrect looks to me because why you not check it
> twice, triple, and even more loops ?
>
> Because the job can't signal any more after calling
> amd_sched_hw_job_reset().
>
> [ML] No … that’s where I think your approach is vague:
>
> 1) see that you check after scheduler stopped, see if job
> signaled, my question is if the job is not signaled (like most
> usual case)
>
> Why you not check it again and again ? maybe the second time you
> will find it signaled …
>
>
> Checking a second time is pointless since it can't signal any more
> after calling amd_sched_hw_job_reset().
>
>
> My point is the checking here is meaningless, we already have
> timedout for the guard.
>
> No, the timeout is pretty meaningless. It's just the trigger that we
> need to do something.
>
> But to determine what to do we first need to stop the scheduler,
> remove the hardware fence and THEN check the current status.
>
> Otherwise we have a race condition here where we can't determine if
> the reset finished the job or if it did just on it's own while we
> stopped the scheduler.
>
>
> 2) And even it signaled after entering gpu_reset(), it will
> automatically done like normal cases, that’s good. Why remove
> those callback instead ?
>
> No, that's absolutely not good. We don't know if it's the hardware
> which results in the job being signaled or our reset code.
>
>
>
>
> So I refuse to check if @job is just signaled in gpu_reset,
> because this action is vague (and no one can guarantee the job
> won’t signal during gpu_reset, we should not argue on this
> event …), I prefer clean and restrict rules.
>
> Yeah, completely agree that we need to have struct rules for that.
> That's why I insists on doing this :)
>
> No, you are insist on a vague rules not strict, like I said, what
> is the theoretic to backend your approach that only check once on
> the in question job ? why not check again if not signaled ?
>
> Because we have removed the connection between the job and the
> hardware fence and because of this the job can never signal.
>
> Regards,
> Christian.
>
>
> I don’t agree this approach is clean and strict. You are abuse
> timedout parameter.
>
>
>
> See I just want to avoid problems for the case that the job
> signaled while we stop the scheduler (because stopping the
> scheduler actually can take a moment).
>
> Because when this happened the scheduler could already have pushed
> the next job to the hardware and then we abort it with the GPU
> reset and might create more problems than we solve.
>
>
>
>
>
> [ML] I don’t see my approach will have chance to fence twice…
> on the contrast I think my approach is more clear: no matter
> the in question job finally signaled or not, I just kick it
> out from mirror-list
>
> Without remove the callback from hw fence, that way even it
> really signaled during the gpu_reset() period the logic is
> still perfect and its sched fence will act like usual …
>
> We want to set an error code on the job before signaling it don't
> we? So we need to be sure how and when the job is signaled as
> finished.
>
> I mean letting it signal when we force the hardware fence to
> complete will work as well, but I still think that this isn't as
> clean as signaling it manually.
>
> Please also see the helper function the Intel guys introduced
> drm_fence_set_error(), we will run into a BUG_ON if we can't
> guarantee the order of execution here.
>
> Regards,
> Christian.
>
> Am 10.05.2017 um 06:00 schrieb Liu, Monk:
>
> Christian,
>
> Looks like we need more discuss with it…
>
> Here is your approach:
>
> 1. Stop the scheduler from feeding more jobs to the hardware
> when a jobs completes. //this is where I agree with you
>
> 2. Then call hw_job_reset to remove the connection between job
> and hardware fence.
>
> 3. Test if job is now completed. It is entirely possible that
> the job completed while we started the work to reset the hardware.
>
> Removing the connection between the hardware fence and the job
> is the deadline, if the job completes after that it is lost.
>
> 4. Check the karma and kick out the job if we found it guilty.
>
> 5. Get the whole stuff working again, e.g. reset the hardware,
> restart the scheduler etc...
>
> */[ML]: One thing I agree to change with your way: in
> gpu_reset() we should first stop the in question ring’s
> scheduler (not the all) before kick out the guilty job./*
>
> > Indeed, but I still think that this is a bad approach cause
> we then reset the hardware without a good reason.
>
> [ML] yes, but we cannot guarantee the job is 100% really hang
> when entering gpu_reset(), we can only trust our
> amdgpu_job_timeout as a deadline for each job.
>
> You approach that check the fence first before charge it as
> guilty/hang is incorrect looks to me because why you not check
> it twice, triple, and even more loops ?
>
> You check it one time and you found it just signaled that’s
> great and lucky(really lucky…), But what if it didn’t signaled
> (like most usual case) , why not check it again and again ? do
> you have a theoretic to support on how much time you need to
> check before finally consider it hang ? No I don’t think you
> have so please just cut this unnecessary checking, we already
> use amdgpu_job_timeout to give the deadline of each job.
>
> So I refuse to check if @job is just signaled in gpu_reset,
> because this action is vague (and no one can guarantee the job
> won’t signal during gpu_reset, we should not argue on this
> event …), I prefer clean and restrict rules.
>
> >Force completion is not so much of the issue, but rather in
> which order you do things.
>
> >See the original code first stops the scheduler and removes
> the connection between hardware fence and job in an atomic
> manner. And THEN forces the hardware fence to complete.
>
> >This way we could be sure that nothing happened in parallel,
> e.g. that we don't try to signal the fence twice or something
> like that.
>
> [ML] I don’t see my approach will have chance to fence twice…
> on the contrast I think my approach is more clear: no matter
> the in question job finally signaled or not, I just kick it
> out from mirror-list
>
> Without remove the callback from hw fence, that way even it
> really signaled during the gpu_reset() period the logic is
> still perfect and its sched fence will act like usual …
>
> Please point out where or how my approach will go wrong like
> “e.g. that we don't try to signal the fence twice or something
> like that.”, otherwise I cannot be persuaded and fix my way …
>
> At last. I run the TDR test and it ends up with Hypervisor
> side error, the guest side is all perfect, here is what I ran:
>
> The vk_example and vulkan CTS test under MCBP case, we have
> all kinds of hang on compute ring, without TDR this test won’t
> suffer for 5 seconds, and with TDR although MCBP is buggy now
> but we can
>
> Finish this test (of cause test result is mismatch due to MCBP
> issue), and there are tongs of job_timed_out in dmesg, but
> guest driver didn’t have any error report. I was also
> surprised this behave really stable …
>
> The second test the using “watch” to trigger a gpu hang (bad
> command stream) every 2 seconds, with amdgpu_job_timeout also
> set to 2seconds, I can run it till hypervisor side hit VF FLR
> error, and guest side
>
> Still runs perfectly and nothing wrong happened, with hw fence
> seq number I can tell there are 3000 loops of TDR finished
> before Hypervisor hit error.
>
> BR Monk
>
> *From:*Koenig, Christian
> *Sent:* Tuesday, May 09, 2017 8:52 PM
> *To:* Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
> Christian König <deathsimple at vodafone.de>
> <mailto:deathsimple at vodafone.de>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement guilty
> job TDR (V2)
>
> [ML] if the job complete, the job’s sched fence callback
> will take this spin_lock and remove itself from
> mirror_list, so we are still safe to call
> amd_sched_job_kickout(), and it will do nothing if so
>
> Indeed, but I still think that this is a bad approach cause we
> then reset the hardware without a good reason.
>
>
>
>
> Besides, original logic also force complete the hw fence,
> and it works well …
>
> Force completion is not so much of the issue, but rather in
> which order you do things.
>
> See the original code first stops the scheduler and removes
> the connection between hardware fence and job in an atomic
> manner. And THEN forces the hardware fence to complete.
>
> This way we could be sure that nothing happened in parallel,
> e.g. that we don't try to signal the fence twice or something
> like that.
>
>
>
>
> State like “You are missing that it is entirely possible
> that the job will complete while we are trying to kick it
> out.”
>
> Sorry I should have been more clear.
>
>
>
>
> Is not a good reason to reject my approach, because that
> is okay if the job just completed …
>
> We usually try to take a defensive approach, so stopping
> everything, removing the hardware fence connection and then
> explicitly kicking out the job in question sounds much better
> than doing it implicitly with the hardware fence completion.
>
> Even when this works (which I'm still not sure of) that is a
> really awkward and hard to understand approach.
>
> Regards,
> Christian.
>
> Am 09.05.2017 um 13:58 schrieb Liu, Monk:
>
> You are missing that it is entirely possible that the job
> will complete while we are trying to kick it out.
>
> [ML] if the job complete, the job’s sched fence callback
> will take this spin_lock and remove itself from
> mirror_list, so we are still safe to call
> amd_sched_job_kickout(), and it will do nothing if so
>
> Please go through the whole steps again,
>
> Besides, original logic also force complete the hw fence,
> and it works well …
>
> I don’t see the solid reason why you insist on your
> approach, please go through the steps again and give me
> the details about where is incorrect than I can fix it
>
> State like “You are missing that it is entirely possible
> that the job will complete while we are trying to kick it
> out.” Is not a good reason to reject my approach, because
> that is okay if the job just completed …
>
> BR Monk
>
>
>
>
> *From:*Koenig, Christian
> *Sent:* Tuesday, May 09, 2017 3:49 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>; Christian König
> <deathsimple at vodafone.de>
> <mailto:deathsimple at vodafone.de>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement
> guilty job TDR (V2)
>
> [ML] Really not necessary, we have spin_lock to
> protect the mirror-list, nothing will be messed up ...
>
> You are missing that it is entirely possible that the job
> will complete while we are trying to kick it out.
>
>
>
>
>
> [ML] why don't touch hardware fence at all ? the
> original/bare-metal gpu reset also signal all ring's
> hardware fence first, I just follow the original logic ...
> Scheduler fence will be auto signaled after hw fence
> signaled, any problem with that ? what's the concern ?
>
> The hardware runs async to the CPU which tries to reset
> it, so we need to be careful in which order things are done.
>
>
>
>
>
> [ML] No I don't think so, the kickout must be prior to
> the hw_job_reset, otherwise the scheduler fence
> callback will be removed and you need manually install
> it later , which is not correct:
> For the guity job, we just kick it out before job
> reset, in job_reset we only reset other innocent jobs(
> and unbind the scheduler fence callback for them),
> after hw fence
> Forcely set to drv_seq, all hw fence are signaled
> (this is the way of original logic, I didn't change
> that). When go to sched_recovery(), it will recover
> all innocent job and hook
> The scheduler fence with new hw fence. That way only
> the guilty job is dropped forever.
>
> Again same problem here.
>
> To be absolutely sure that everything works as expected we
> need to do it in the following order:
>
> 1. Stop the scheduler from feeding more jobs to the
> hardware when a jobs completes.
>
> 2. Then call hw_job_reset to remove the connection between
> job and hardware fence.
>
> 3. Test if job is now completed. It is entirely possible
> that the job completed while we started the work to reset
> the hardware.
>
> Removing the connection between the hardware fence and the
> job is the deadline, if the job completes after that it is
> lost.
>
> 4. Check the karma and kick out the job if we found it guilty.
>
> 5. Get the whole stuff working again, e.g. reset the
> hardware, restart the scheduler etc...
>
> Regards,
> Christian.
>
> Am 09.05.2017 um 04:45 schrieb Liu, Monk:
>
> >
> > - /* block scheduler */
> > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > - ring = adev->rings[i];
> > + /* we start from the ring trigger GPU hang */
> > + j = job ? job->ring->idx : 0;
> > +
> > + if (job)
> > + if
> (amd_sched_invalidate_job(&job->base,
> amdgpu_job_hang_limit))
> > + amd_sched_job_kickout(&job->base);
>
> Well that looks like the wrong order to me. We should
> probably stop the scheduler before trying to mess
> anything with the job.
>
> [ML] Really not necessary, we have spin_lock to
> protect the mirror-list, nothing will be messed up ...
>
>
> >
> > +void
> amdgpu_fence_driver_force_completion_ring(struct
> amdgpu_ring
> > +*ring) {
> > + if (ring)
> > + amdgpu_fence_write(ring,
> ring->fence_drv.sync_seq); }
> > +
>
> The coding style is completely off.
>
> [ML] I don't know why at email side it looks wrong
> coding style, but I'm sure it is correct at my side,
> check here:
> void amdgpu_fence_driver_force_completion_ring(struct
> amdgpu_ring *ring)
> {
> if (ring)
> amdgpu_fence_write(ring,
> ring->fence_drv.sync_seq);
> }
>
> Additional to that I don't think that this is a good
> idea. We should probably rather just signal all
> scheduler fences instead and don't touch the hardware
> fence at all.
>
> [ML] why don't touch hardware fence at all ? the
> original/bare-metal gpu reset also signal all ring's
> hardware fence first, I just follow the original logic ...
> Scheduler fence will be auto signaled after hw fence
> signaled, any problem with that ? what's the concern ?
>
>
> > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > - struct amdgpu_ring *ring = adev->rings[i];
> > + for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
> > + ring = adev->rings[i % AMDGPU_MAX_RINGS];
> > if (!ring || !ring->sched.thread)
> > continue;
> >
> > + if (job && j != i) {
> > + kthread_unpark(ring->sched.thread);
> > + continue;
> > + }
> > +
>
> Please split up that patch a bit further. E.g. first
> the handling to only hw_job_reset the ring in
> question, then the kickout handling.
>
> [ML] No I don't think so, the kickout must be prior to
> the hw_job_reset, otherwise the scheduler fence
> callback will be removed and you need manually install
> it later , which is not correct:
> For the guity job, we just kick it out before job
> reset, in job_reset we only reset other innocent jobs(
> and unbind the scheduler fence callback for them),
> after hw fence
> Forcely set to drv_seq, all hw fence are signaled
> (this is the way of original logic, I didn't change
> that). When go to sched_recovery(), it will recover
> all innocent job and hook
> The scheduler fence with new hw fence. That way only
> the guilty job is dropped forever.
>
> -----Original Message-----
> From: Christian König [mailto:deathsimple at vodafone.de]
> Sent: Monday, May 08, 2017 9:12 PM
> To: Liu, Monk <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> Cc: Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> Subject: Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement
> guilty job TDR (V2)
>
> Am 08.05.2017 um 09:01 schrieb Liu, Monk:
> > @Christian
> >
> > This one is changed to guilty job scheme accordingly
> with your
> > response
> >
> > BR Monk
> >
> > -----Original Message-----
> > From: Monk Liu [mailto:Monk.Liu at amd.com]
> > Sent: Monday, May 08, 2017 3:00 PM
> > To: amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Cc: Liu, Monk <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>
> > Subject: [PATCH] drm/amdgpu/SRIOV:implement guilty
> job TDR (V2)
> >
> > 1,TDR will kickout guilty job if it hang exceed the
> threshold of the given one from kernel paramter
> "job_hang_limit", that way a bad command stream will
> not infinitly cause GPU hang.
> >
> > by default this threshold is 1 so a job will be
> kicked out after it hang.
> >
> > 2,if a job timeout TDR routine will not reset all
> sched/ring, instead if will only reset on the givn one
> which is indicated by @job of amdgpu_sriov_gpu_reset,
> that way we don't need to reset and recover each
> sched/ring if we already know which job cause GPU hang.
> >
> > 3,unblock sriov_gpu_reset for AI family.
> > 4,don't init entity for KIQ
> >
> > TODO:
> > when a job is considered as guilty, we should mark
> some flag in its fence status flag, and let UMD side
> aware that this fence signaling is not due to job
> complete but job hang.
> >
> > Change-Id: I7b89c19a3de93249db570d0a80522176b1525a09
> > Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 4 +++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 36
> ++++++++++++++++++++-------
> > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 +++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 6 +++++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
> > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 11
> +++++++- drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> | 7 ++++++
> > 8 files changed, 60 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 90a69bf..93bcea2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -111,6 +111,7 @@ extern int
> amdgpu_prim_buf_per_se; extern int
> > amdgpu_pos_buf_per_se; extern int
> amdgpu_cntl_sb_buf_per_se; extern
> > int amdgpu_param_buf_per_se;
> > +extern int amdgpu_job_hang_limit;
> >
> > #define AMDGPU_DEFAULT_GTT_SIZE_MB 3072ULL /* 3GB
> by default */
> > #define
> AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS 3000
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > index b4bbbb3..23afc58 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > @@ -52,6 +52,10 @@ static int amdgpu_ctx_init(struct
> amdgpu_device *adev, struct amdgpu_ctx *ctx)
> > struct amd_sched_rq *rq;
> >
> > rq =
> &ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL];
> > +
> > + if (ring == &adev->gfx.kiq.ring)
> > + continue;
> > +
>
> That looks like a bug fix and should probably go into
> a separate patch.
>
> > r =
> amd_sched_entity_init(&ring->sched, &ctx->rings[i].entity,
> > rq,
> amdgpu_sched_jobs);
> > if (r)
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 0e5f314..f3990fe 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2537,7 +2537,7 @@ static int
> amdgpu_recover_vram_from_shadow(struct amdgpu_device
> *adev,
> > */
> > int amdgpu_sriov_gpu_reset(struct amdgpu_device
> *adev, struct amdgpu_job *job) {
> > - int i, r = 0;
> > + int i, j, r = 0;
> > int resched;
> > struct amdgpu_bo *bo, *tmp;
> > struct amdgpu_ring *ring;
> > @@ -2550,19 +2550,30 @@ int
> amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
> struct amdgpu_job *job)
> > /* block TTM */
> > resched =
> ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
> >
> > - /* block scheduler */
> > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > - ring = adev->rings[i];
> > + /* we start from the ring trigger GPU hang */
> > + j = job ? job->ring->idx : 0;
> > +
> > + if (job)
> > + if
> (amd_sched_invalidate_job(&job->base,
> amdgpu_job_hang_limit))
> > + amd_sched_job_kickout(&job->base);
>
> Well that looks like the wrong order to me. We should
> probably stop the scheduler before trying to mess
> anything with the job.
>
> >
> > + /* block scheduler */
> > + for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
> > + ring = adev->rings[i % AMDGPU_MAX_RINGS];
> > if (!ring || !ring->sched.thread)
> > continue;
> >
> > kthread_park(ring->sched.thread);
> > +
> > + if (job && j != i)
> > + continue;
> > +
> > + /* only do job_reset on the hang ring
> if @job not NULL */
> > amd_sched_hw_job_reset(&ring->sched);
> > - }
> >
> > - /* after all hw jobs are reset, hw fence is
> meaningless, so force_completion */
> > - amdgpu_fence_driver_force_completion(adev);
> > + /* after all hw jobs are reset, hw
> fence is meaningless, so force_completion */
> > + amdgpu_fence_driver_force_completion_ring(ring);
> > + }
> >
> > /* request to take full control of GPU before
> re-initialization */
> > if (job)
> > @@ -2615,11 +2626,16 @@ int
> amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
> struct amdgpu_job *job)
> > }
> > fence_put(fence);
> >
> > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > - struct amdgpu_ring *ring = adev->rings[i];
> > + for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
> > + ring = adev->rings[i % AMDGPU_MAX_RINGS];
> > if (!ring || !ring->sched.thread)
> > continue;
> >
> > + if (job && j != i) {
> > + kthread_unpark(ring->sched.thread);
> > + continue;
> > + }
> > +
>
> Please split up that patch a bit further. E.g. first
> the handling to only hw_job_reset the ring in
> question, then the kickout handling.
>
> > amd_sched_job_recovery(&ring->sched);
> > kthread_unpark(ring->sched.thread);
> > }
> > @@ -2629,6 +2645,8 @@ int
> amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
> struct amdgpu_job *job)
> > if (r) {
> > /* bad news, how to tell it to
> userspace ? */
> > dev_info(adev->dev, "GPU reset
> failed\n");
> > + } else {
> > + dev_info(adev->dev, "GPU reset
> successed!\n");
> > }
> >
> > adev->gfx.in_reset = false;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 416908a..fd3691a8 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -112,6 +112,7 @@ int amdgpu_prim_buf_per_se = 0;
> int
> > amdgpu_pos_buf_per_se = 0; int
> amdgpu_cntl_sb_buf_per_se = 0; int
> > amdgpu_param_buf_per_se = 0;
> > +int amdgpu_job_hang_limit = 0;
> >
> > MODULE_PARM_DESC(vramlimit, "Restrict VRAM for
> testing, in
> > megabytes"); module_param_named(vramlimit,
> amdgpu_vram_limit, int,
> > 0600); @@ -237,6 +238,9 @@
> module_param_named(cntl_sb_buf_per_se,
> > amdgpu_cntl_sb_buf_per_se, int, 0444);
> > MODULE_PARM_DESC(param_buf_per_se, "the size of
> Off-Chip Pramater
> > Cache per Shader Engine (default depending on gfx)");
> > module_param_named(param_buf_per_se,
> amdgpu_param_buf_per_se, int,
> > 0444);
> >
> > +MODULE_PARM_DESC(job_hang_limit, "how much time
> allow a job hang and
> > +not drop it (default 0)");
> module_param_named(job_hang_limit,
> > +amdgpu_job_hang_limit, int ,0444);
> > +
> >
> > static const struct pci_device_id pciidlist[] = {
> #ifdef
> > CONFIG_DRM_AMDGPU_SI diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index d7523d1..8de3bd3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -541,6 +541,12 @@ void
> amdgpu_fence_driver_force_completion(struct
> amdgpu_device *adev)
> > }
> > }
> >
> > +void
> amdgpu_fence_driver_force_completion_ring(struct
> amdgpu_ring
> > +*ring) {
> > + if (ring)
> > + amdgpu_fence_write(ring,
> ring->fence_drv.sync_seq); }
> > +
>
> The coding style is completely off.
>
> Additional to that I don't think that this is a good
> idea. We should probably rather just signal all
> scheduler fences instead and don't touch the hardware
> fence at all.
>
> > /*
> > * Common fence implementation
> > */
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > index 981ef08..03e88c6 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > @@ -76,6 +76,7 @@ struct amdgpu_fence_driver { int
> > amdgpu_fence_driver_init(struct amdgpu_device
> *adev); void
> > amdgpu_fence_driver_fini(struct amdgpu_device
> *adev); void
> > amdgpu_fence_driver_force_completion(struct
> amdgpu_device *adev);
> > +void
> amdgpu_fence_driver_force_completion_ring(struct
> amdgpu_ring
> > +*ring);
> >
> > int amdgpu_fence_driver_init_ring(struct
> amdgpu_ring *ring,
> > unsigned
> num_hw_submission);
> > diff --git
> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > index 6f4e31f..4e97e6d 100644
> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > @@ -390,9 +390,18 @@ void
> amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched)
> > &s_job->s_fence->cb)) {
> > fence_put(s_job->s_fence->parent);
> > s_job->s_fence->parent = NULL;
> > + atomic_dec(&sched->hw_rq_count);
> > }
> > }
> > - atomic_set(&sched->hw_rq_count, 0);
> > + spin_unlock(&sched->job_list_lock);
> > +}
> > +
> > +void amd_sched_job_kickout(struct amd_sched_job
> *s_job) {
> > + struct amd_gpu_scheduler *sched = s_job->sched;
> > +
> > + spin_lock(&sched->job_list_lock);
> > + list_del_init(&s_job->node);
> > spin_unlock(&sched->job_list_lock);
> > }
> >
> > diff --git
> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > index 8cb41d3..59694f3 100644
> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > @@ -81,6 +81,7 @@ struct amd_sched_job {
> > struct list_head node;
> > struct delayed_work work_tdr;
> > uint64_t id;
> > + atomic_t karma;
> > };
> >
> > extern const struct fence_ops
> amd_sched_fence_ops_scheduled; @@ -96,6 +97,11 @@
> static inline struct amd_sched_fence
> *to_amd_sched_fence(struct fence *f)
> > return NULL;
> > }
> >
> > +static inline bool amd_sched_invalidate_job(struct
> amd_sched_job
> > +*s_job, int threshold) {
> > + return (s_job &&
> atomic_inc_return(&s_job->karma) > threshold); }
> > +
>
> Again coding style is completely off.
>
> Christian.
>
> > /**
> > * Define the backend operations called by the
> scheduler,
> > * these functions should be implemented in driver
> side @@ -158,4 +164,5 @@ int amd_sched_job_init(struct
> amd_sched_job *job,
> > void *owner);
> > void amd_sched_hw_job_reset(struct
> amd_gpu_scheduler *sched); void
> > amd_sched_job_recovery(struct amd_gpu_scheduler *sched);
> > +void amd_sched_job_kickout(struct amd_sched_job
> *s_job);
> > #endif
> > --
> > 2.7.4
> >
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
>
>
>
>
>
> _______________________________________________
>
> amd-gfx mailing list
>
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
>
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170510/512c3ddc/attachment-0001.html>
More information about the amd-gfx
mailing list