[PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job TDR (V2)
Christian König
deathsimple at vodafone.de
Wed May 10 12:35:24 UTC 2017
Am 10.05.2017 um 13:50 schrieb Liu, Monk:
>
> >Because stopping all the scheduler threads takes a moment and it is
> entirely possible that the job finishes within that time.
>
> Sounds reasonable, but instead not, because if so why not increase
> timed_out value to + 0.5 sec ? that should cover the time on stop all
> schedulers,
>
That won't work, see you need to stop the scheduler to figure out if the
job is actually hung.
> So the point is we should rely on and trust timedout parm since it is
> the only way to let kernel side aware GPU may hang …
>
> If there is a way to detect GPU hang (actually we have such method,
> implemented in RLCV side) in kernel driver side, that will convince me
>
> To check the fence at least one time, like:
>
> Void Gpu_reset()
>
> {
>
> Hang = Check_gpu_hang_or_busy();
>
> If (!hang) {
>
> //looks like the job is still running, we can let it run …
>
> Return;
>
> } else {
>
> Signaled = fence_signaled(job);
>
> If (signaled)
>
> Return; //do nothing
>
> }
>
> …
>
> }
>
Well what we actually need to do is to stop the hardware from doing any
processing, but that's easier said than done.
> I can say that we may still have chance to implement this
> hang_detection in kernel driver side, but it is not easy so currently
> I only trust timedout event
>
I still don't think that this is a good idea. At least on bare metal we
need more checks than that to determine if we really should reset the
hardware or not.
Regards,
Christian.
> BR Monk
>
> *From:*Christian König [mailto:deathsimple at vodafone.de]
> *Sent:* Wednesday, May 10, 2017 7:14 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job TDR (V2)
>
> You still avoid my question: what’s the theoretical backend you
> that you think check once instead of twice or even more is
> good**before** hw_job_reset() ?
>
> Because stopping all the scheduler threads takes a moment and it is
> entirely possible that the job finishes within that time.
>
>
> 1) if you check the fence and found it not signaled, then you will
> call hw_job_reset(), but there is still chance that between your
> check and the hw_job_reset() the
>
> Sched fence could signaled , isn’t it ? you still cannot avoid
> such race condition
>
> Crap, you're right. We would indeed need to check twice and that
> wouldn't be consistent anyway.
>
> Once after stopping the schedulers and before hw_job_reset() because
> then we can just start the schedulers again and continue as if nothing
> has happened.
>
> And another time after calling hw_job_reset() if we want to set the
> error code.
>
>
> Don’t forget your approach still have chance to hit the race
> condition, and to me I don’t think the race condition matters
> that’s why I don’t even consider it
>
> Yeah, you convinced me. Please go ahead with the current approach, but
> at least add a comment that we might want to improve that.
>
> Regards,
> Christian.
>
> Am 10.05.2017 um 13:02 schrieb Liu, Monk:
>
> > Checking a second time is pointless since it can't signal any
> more after calling amd_sched_hw_job_reset().
>
> [ML] you seems not response to me … I of cause know fence cannot
> signal after hw_job_reset() ….
>
> My question is , before you call hw_job_reset(), why you want to
> check the fence ? why not check twice ?
>
> You still avoid my question: what’s the theoretical backend you
> that you think check once instead of twice or even more is
> good**before** hw_job_reset() ?
>
> >No, the timeout is pretty meaningless. It's just the trigger that
> we need to do something.
>
> 2) And even it signaled after entering gpu_reset(), it will
> automatically done like normal cases, that’s good. Why remove
> those callback instead ?
>
> > No, that's absolutely not good. We don't know if it's the
> hardware which results in the job being signaled or our reset code.
>
> [ML] you are wrong, for SR-IOV case, the timeout is all that
> matters, because one VF can only have such time slice within
> timeout, and I’m doing the TDR on SR-IOV so don’t
>
> Always looks things with bare-metal mind
>
> >Otherwise we have a race condition here where we can't determine
> if the reset finished the job or if it did just on it's own while
> we stopped the scheduler.
>
> [ML] you are wrong :
> 1) if you check the fence and found it not signaled, then you will
> call hw_job_reset(), but there is still chance that between your
> check and the hw_job_reset() the
>
> Sched fence could signaled , isn’t it ? you still cannot avoid
> such race condition
>
> 2) I don’t care if the fence is signaled due to its own finish or
> because we force the hw fence signaled, for SR-IOV case, as long
> as the job exceeds timeout, we consider
>
> It hang.
>
> 3) Even for bare-metal case, you still cannot sure if the fence is
> signaled due to its own or hw_fence force signal, reason is in #1)
>
> >No, you are insist on a vague rules not strict, like I said, what is
> the theoretic to backend your approach that only check once on the
> in question job ? why not check again if not signaled ?
>
> >Because we have removed the connection between the job and the
> hardware fence and because of this the job can never signal.
>
> [ML] like I said, before you call hw_job_reset(), you only check
> on the job once, why not check again and again ? I don’t see you
> have a reason to only check once, and
>
> If you don’t have such reason I think you should not check at all.
>
> If you have such reason, prove me that only check once is good and
> enough.
>
> Don’t forget your approach still have chance to hit the race
> condition, and to me I don’t think the race condition matters
> that’s why I don’t even consider it
>
> BR Monk
>
> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On
> Behalf Of *Christian König
> *Sent:* Wednesday, May 10, 2017 6:26 PM
> *To:* Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
> Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement guilty job
> TDR (V2)
>
> Am 10.05.2017 um 12:05 schrieb Liu, Monk:
>
> [ML] yes, but we cannot guarantee the job is 100% really hang
> when entering gpu_reset(), we can only trust our
> amdgpu_job_timeout as a deadline for each job.
>
> You approach that check the fence first before charge it as
> guilty/hang is incorrect looks to me because why you not check
> it twice, triple, and even more loops ?
>
> Because the job can't signal any more after calling
> amd_sched_hw_job_reset().
>
> [ML] No … that’s where I think your approach is vague:
>
> 1) see that you check after scheduler stopped, see if job
> signaled, my question is if the job is not signaled (like most
> usual case)
>
> Why you not check it again and again ? maybe the second time
> you will find it signaled …
>
>
> Checking a second time is pointless since it can't signal any more
> after calling amd_sched_hw_job_reset().
>
>
>
> My point is the checking here is meaningless, we already have
> timedout for the guard.
>
> No, the timeout is pretty meaningless. It's just the trigger that
> we need to do something.
>
> But to determine what to do we first need to stop the scheduler,
> remove the hardware fence and THEN check the current status.
>
> Otherwise we have a race condition here where we can't determine
> if the reset finished the job or if it did just on it's own while
> we stopped the scheduler.
>
>
>
> 2) And even it signaled after entering gpu_reset(), it will
> automatically done like normal cases, that’s good. Why remove
> those callback instead ?
>
> No, that's absolutely not good. We don't know if it's the hardware
> which results in the job being signaled or our reset code.
>
>
>
>
>
>
> So I refuse to check if @job is just signaled in
> gpu_reset, because this action is vague (and no one can
> guarantee the job won’t signal during gpu_reset, we should
> not argue on this event …), I prefer clean and restrict rules.
>
> Yeah, completely agree that we need to have struct rules for
> that. That's why I insists on doing this :)
>
> No, you are insist on a vague rules not strict, like I said,
> what is the theoretic to backend your approach that only check
> once on the in question job ? why not check again if not
> signaled ?
>
> Because we have removed the connection between the job and the
> hardware fence and because of this the job can never signal.
>
> Regards,
> Christian.
>
>
>
> I don’t agree this approach is clean and strict. You are abuse
> timedout parameter.
>
>
>
> See I just want to avoid problems for the case that the job
> signaled while we stop the scheduler (because stopping the
> scheduler actually can take a moment).
>
> Because when this happened the scheduler could already have
> pushed the next job to the hardware and then we abort it with
> the GPU reset and might create more problems than we solve.
>
>
>
>
>
>
> [ML] I don’t see my approach will have chance to fence
> twice… on the contrast I think my approach is more clear:
> no matter the in question job finally signaled or not, I
> just kick it out from mirror-list
>
> Without remove the callback from hw fence, that way even
> it really signaled during the gpu_reset() period the logic
> is still perfect and its sched fence will act like usual …
>
> We want to set an error code on the job before signaling it
> don't we? So we need to be sure how and when the job is
> signaled as finished.
>
> I mean letting it signal when we force the hardware fence to
> complete will work as well, but I still think that this isn't
> as clean as signaling it manually.
>
> Please also see the helper function the Intel guys introduced
> drm_fence_set_error(), we will run into a BUG_ON if we can't
> guarantee the order of execution here.
>
> Regards,
> Christian.
>
> Am 10.05.2017 um 06:00 schrieb Liu, Monk:
>
> Christian,
>
> Looks like we need more discuss with it…
>
> Here is your approach:
>
> 1. Stop the scheduler from feeding more jobs to the
> hardware when a jobs completes. //this is where I agree
> with you
>
> 2. Then call hw_job_reset to remove the connection between
> job and hardware fence.
>
> 3. Test if job is now completed. It is entirely possible
> that the job completed while we started the work to reset
> the hardware.
>
> Removing the connection between the hardware fence and the
> job is the deadline, if the job completes after that it is
> lost.
>
> 4. Check the karma and kick out the job if we found it guilty.
>
> 5. Get the whole stuff working again, e.g. reset the
> hardware, restart the scheduler etc...
>
> */[ML]: One thing I agree to change with your way: in
> gpu_reset() we should first stop the in question ring’s
> scheduler (not the all) before kick out the guilty job./*
>
> > Indeed, but I still think that this is a bad approach
> cause we then reset the hardware without a good reason.
>
> [ML] yes, but we cannot guarantee the job is 100% really
> hang when entering gpu_reset(), we can only trust our
> amdgpu_job_timeout as a deadline for each job.
>
> You approach that check the fence first before charge it
> as guilty/hang is incorrect looks to me because why you
> not check it twice, triple, and even more loops ?
>
> You check it one time and you found it just signaled
> that’s great and lucky(really lucky…), But what if it
> didn’t signaled (like most usual case) , why not check it
> again and again ? do you have a theoretic to support on
> how much time you need to check before finally consider it
> hang ? No I don’t think you have so please just cut this
> unnecessary checking, we already use amdgpu_job_timeout to
> give the deadline of each job.
>
> So I refuse to check if @job is just signaled in
> gpu_reset, because this action is vague (and no one can
> guarantee the job won’t signal during gpu_reset, we should
> not argue on this event …), I prefer clean and restrict rules.
>
> >Force completion is not so much of the issue, but rather
> in which order you do things.
>
> >See the original code first stops the scheduler and
> removes the connection between hardware fence and job in
> an atomic manner. And THEN forces the hardware fence to
> complete.
>
> >This way we could be sure that nothing happened in
> parallel, e.g. that we don't try to signal the fence twice
> or something like that.
>
> [ML] I don’t see my approach will have chance to fence
> twice… on the contrast I think my approach is more clear:
> no matter the in question job finally signaled or not, I
> just kick it out from mirror-list
>
> Without remove the callback from hw fence, that way even
> it really signaled during the gpu_reset() period the logic
> is still perfect and its sched fence will act like usual …
>
> Please point out where or how my approach will go wrong
> like “e.g. that we don't try to signal the fence twice or
> something like that.”, otherwise I cannot be persuaded and
> fix my way …
>
> At last. I run the TDR test and it ends up with Hypervisor
> side error, the guest side is all perfect, here is what I ran:
>
> The vk_example and vulkan CTS test under MCBP case, we
> have all kinds of hang on compute ring, without TDR this
> test won’t suffer for 5 seconds, and with TDR although
> MCBP is buggy now but we can
>
> Finish this test (of cause test result is mismatch due to
> MCBP issue), and there are tongs of job_timed_out in
> dmesg, but guest driver didn’t have any error report. I
> was also surprised this behave really stable …
>
> The second test the using “watch” to trigger a gpu hang
> (bad command stream) every 2 seconds, with
> amdgpu_job_timeout also set to 2seconds, I can run it till
> hypervisor side hit VF FLR error, and guest side
>
> Still runs perfectly and nothing wrong happened, with hw
> fence seq number I can tell there are 3000 loops of TDR
> finished before Hypervisor hit error.
>
> BR Monk
>
> *From:*Koenig, Christian
> *Sent:* Tuesday, May 09, 2017 8:52 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>; Christian König
> <deathsimple at vodafone.de>
> <mailto:deathsimple at vodafone.de>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement
> guilty job TDR (V2)
>
> [ML] if the job complete, the job’s sched fence
> callback will take this spin_lock and remove itself
> from mirror_list, so we are still safe to call
> amd_sched_job_kickout(), and it will do nothing if so
>
> Indeed, but I still think that this is a bad approach
> cause we then reset the hardware without a good reason.
>
>
>
>
>
> Besides, original logic also force complete the hw
> fence, and it works well …
>
> Force completion is not so much of the issue, but rather
> in which order you do things.
>
> See the original code first stops the scheduler and
> removes the connection between hardware fence and job in
> an atomic manner. And THEN forces the hardware fence to
> complete.
>
> This way we could be sure that nothing happened in
> parallel, e.g. that we don't try to signal the fence twice
> or something like that.
>
>
>
>
>
> State like “You are missing that it is entirely
> possible that the job will complete while we are
> trying to kick it out.”
>
> Sorry I should have been more clear.
>
>
>
>
>
> Is not a good reason to reject my approach, because
> that is okay if the job just completed …
>
> We usually try to take a defensive approach, so stopping
> everything, removing the hardware fence connection and
> then explicitly kicking out the job in question sounds
> much better than doing it implicitly with the hardware
> fence completion.
>
> Even when this works (which I'm still not sure of) that is
> a really awkward and hard to understand approach.
>
> Regards,
> Christian.
>
> Am 09.05.2017 um 13:58 schrieb Liu, Monk:
>
> You are missing that it is entirely possible that the
> job will complete while we are trying to kick it out.
>
> [ML] if the job complete, the job’s sched fence
> callback will take this spin_lock and remove itself
> from mirror_list, so we are still safe to call
> amd_sched_job_kickout(), and it will do nothing if so
>
> Please go through the whole steps again,
>
> Besides, original logic also force complete the hw
> fence, and it works well …
>
> I don’t see the solid reason why you insist on your
> approach, please go through the steps again and give
> me the details about where is incorrect than I can fix it
>
> State like “You are missing that it is entirely
> possible that the job will complete while we are
> trying to kick it out.” Is not a good reason to reject
> my approach, because that is okay if the job just
> completed …
>
> BR Monk
>
>
>
>
>
> *From:*Koenig, Christian
> *Sent:* Tuesday, May 09, 2017 3:49 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>; Christian König
> <deathsimple at vodafone.de>
> <mailto:deathsimple at vodafone.de>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: [PATCH 4/4] drm/amdgpu/SRIOV:implement
> guilty job TDR (V2)
>
> [ML] Really not necessary, we have spin_lock to
> protect the mirror-list, nothing will be messed up ...
>
> You are missing that it is entirely possible that the
> job will complete while we are trying to kick it out.
>
>
>
>
>
>
> [ML] why don't touch hardware fence at all ? the
> original/bare-metal gpu reset also signal all
> ring's hardware fence first, I just follow the
> original logic ...
> Scheduler fence will be auto signaled after hw
> fence signaled, any problem with that ? what's the
> concern ?
>
> The hardware runs async to the CPU which tries to
> reset it, so we need to be careful in which order
> things are done.
>
>
>
>
>
>
> [ML] No I don't think so, the kickout must be
> prior to the hw_job_reset, otherwise the scheduler
> fence callback will be removed and you need
> manually install it later , which is not correct:
> For the guity job, we just kick it out before job
> reset, in job_reset we only reset other innocent
> jobs( and unbind the scheduler fence callback for
> them), after hw fence
> Forcely set to drv_seq, all hw fence are signaled
> (this is the way of original logic, I didn't
> change that). When go to sched_recovery(), it
> will recover all innocent job and hook
> The scheduler fence with new hw fence. That way
> only the guilty job is dropped forever.
>
> Again same problem here.
>
> To be absolutely sure that everything works as
> expected we need to do it in the following order:
>
> 1. Stop the scheduler from feeding more jobs to the
> hardware when a jobs completes.
>
> 2. Then call hw_job_reset to remove the connection
> between job and hardware fence.
>
> 3. Test if job is now completed. It is entirely
> possible that the job completed while we started the
> work to reset the hardware.
>
> Removing the connection between the hardware fence and
> the job is the deadline, if the job completes after
> that it is lost.
>
> 4. Check the karma and kick out the job if we found it
> guilty.
>
> 5. Get the whole stuff working again, e.g. reset the
> hardware, restart the scheduler etc...
>
> Regards,
> Christian.
>
> Am 09.05.2017 um 04:45 schrieb Liu, Monk:
>
> >
> > - /* block scheduler */
> > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > - ring = adev->rings[i];
> > + /* we start from the ring trigger GPU hang */
> > + j = job ? job->ring->idx : 0;
> > +
> > + if (job)
> > + if
> (amd_sched_invalidate_job(&job->base,
> amdgpu_job_hang_limit))
> > + amd_sched_job_kickout(&job->base);
>
> Well that looks like the wrong order to me. We
> should probably stop the scheduler before trying
> to mess anything with the job.
>
> [ML] Really not necessary, we have spin_lock to
> protect the mirror-list, nothing will be messed up ...
>
>
> >
> > +void
> amdgpu_fence_driver_force_completion_ring(struct
> amdgpu_ring
> > +*ring) {
> > + if (ring)
> > + amdgpu_fence_write(ring,
> ring->fence_drv.sync_seq); }
> > +
>
> The coding style is completely off.
>
> [ML] I don't know why at email side it looks wrong
> coding style, but I'm sure it is correct at my
> side, check here:
> void
> amdgpu_fence_driver_force_completion_ring(struct
> amdgpu_ring *ring)
> {
> if (ring)
> amdgpu_fence_write(ring,
> ring->fence_drv.sync_seq);
> }
>
> Additional to that I don't think that this is a
> good idea. We should probably rather just signal
> all scheduler fences instead and don't touch the
> hardware fence at all.
>
> [ML] why don't touch hardware fence at all ? the
> original/bare-metal gpu reset also signal all
> ring's hardware fence first, I just follow the
> original logic ...
> Scheduler fence will be auto signaled after hw
> fence signaled, any problem with that ? what's the
> concern ?
>
>
> > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > - struct amdgpu_ring *ring =
> adev->rings[i];
> > + for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
> > + ring = adev->rings[i %
> AMDGPU_MAX_RINGS];
> > if (!ring || !ring->sched.thread)
> > continue;
> >
> > + if (job && j != i) {
> > + kthread_unpark(ring->sched.thread);
> > + continue;
> > + }
> > +
>
> Please split up that patch a bit further. E.g.
> first the handling to only hw_job_reset the ring
> in question, then the kickout handling.
>
> [ML] No I don't think so, the kickout must be
> prior to the hw_job_reset, otherwise the scheduler
> fence callback will be removed and you need
> manually install it later , which is not correct:
> For the guity job, we just kick it out before job
> reset, in job_reset we only reset other innocent
> jobs( and unbind the scheduler fence callback for
> them), after hw fence
> Forcely set to drv_seq, all hw fence are signaled
> (this is the way of original logic, I didn't
> change that). When go to sched_recovery(), it
> will recover all innocent job and hook
> The scheduler fence with new hw fence. That way
> only the guilty job is dropped forever.
>
> -----Original Message-----
> From: Christian König
> [mailto:deathsimple at vodafone.de]
> Sent: Monday, May 08, 2017 9:12 PM
> To: Liu, Monk <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> Cc: Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> Subject: Re: [PATCH 4/4]
> drm/amdgpu/SRIOV:implement guilty job TDR (V2)
>
> Am 08.05.2017 um 09:01 schrieb Liu, Monk:
> > @Christian
> >
> > This one is changed to guilty job scheme
> accordingly with your
> > response
> >
> > BR Monk
> >
> > -----Original Message-----
> > From: Monk Liu [mailto:Monk.Liu at amd.com]
> > Sent: Monday, May 08, 2017 3:00 PM
> > To: amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> > Cc: Liu, Monk <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>
> > Subject: [PATCH] drm/amdgpu/SRIOV:implement
> guilty job TDR (V2)
> >
> > 1,TDR will kickout guilty job if it hang exceed
> the threshold of the given one from kernel
> paramter "job_hang_limit", that way a bad command
> stream will not infinitly cause GPU hang.
> >
> > by default this threshold is 1 so a job will be
> kicked out after it hang.
> >
> > 2,if a job timeout TDR routine will not reset
> all sched/ring, instead if will only reset on the
> givn one which is indicated by @job of
> amdgpu_sriov_gpu_reset, that way we don't need to
> reset and recover each sched/ring if we already
> know which job cause GPU hang.
> >
> > 3,unblock sriov_gpu_reset for AI family.
> > 4,don't init entity for KIQ
> >
> > TODO:
> > when a job is considered as guilty, we should
> mark some flag in its fence status flag, and let
> UMD side aware that this fence signaling is not
> due to job complete but job hang.
> >
> > Change-Id: I7b89c19a3de93249db570d0a80522176b1525a09
> > Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c |
> 4 +++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |
> 36 ++++++++++++++++++++-------
> > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |
> 4 +++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |
> 6 +++++
> > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
> > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c |
> 11 +++++++-
> drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 7
> ++++++
> > 8 files changed, 60 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 90a69bf..93bcea2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -111,6 +111,7 @@ extern int
> amdgpu_prim_buf_per_se; extern int
> > amdgpu_pos_buf_per_se; extern int
> amdgpu_cntl_sb_buf_per_se; extern
> > int amdgpu_param_buf_per_se;
> > +extern int amdgpu_job_hang_limit;
> >
> > #define AMDGPU_DEFAULT_GTT_SIZE_MB
> 3072ULL /* 3GB by default */
> > #define
> AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS 3000
> > diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > index b4bbbb3..23afc58 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> > @@ -52,6 +52,10 @@ static int
> amdgpu_ctx_init(struct amdgpu_device *adev, struct
> amdgpu_ctx *ctx)
> > struct amd_sched_rq *rq;
> >
> > rq =
> &ring->sched.sched_rq[AMD_SCHED_PRIORITY_NORMAL];
> > +
> > + if (ring == &adev->gfx.kiq.ring)
> > + continue;
> > +
>
> That looks like a bug fix and should probably go
> into a separate patch.
>
> > r =
> amd_sched_entity_init(&ring->sched,
> &ctx->rings[i].entity,
> > rq, amdgpu_sched_jobs);
> > if (r)
> > diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 0e5f314..f3990fe 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2537,7 +2537,7 @@ static int
> amdgpu_recover_vram_from_shadow(struct
> amdgpu_device *adev,
> > */
> > int amdgpu_sriov_gpu_reset(struct
> amdgpu_device *adev, struct amdgpu_job *job) {
> > - int i, r = 0;
> > + int i, j, r = 0;
> > int resched;
> > struct amdgpu_bo *bo, *tmp;
> > struct amdgpu_ring *ring;
> > @@ -2550,19 +2550,30 @@ int
> amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
> struct amdgpu_job *job)
> > /* block TTM */
> > resched =
> ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
> >
> > - /* block scheduler */
> > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > - ring = adev->rings[i];
> > + /* we start from the ring trigger GPU hang */
> > + j = job ? job->ring->idx : 0;
> > +
> > + if (job)
> > + if
> (amd_sched_invalidate_job(&job->base,
> amdgpu_job_hang_limit))
> > + amd_sched_job_kickout(&job->base);
>
> Well that looks like the wrong order to me. We
> should probably stop the scheduler before trying
> to mess anything with the job.
>
> >
> > + /* block scheduler */
> > + for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
> > + ring = adev->rings[i %
> AMDGPU_MAX_RINGS];
> > if (!ring || !ring->sched.thread)
> > continue;
> >
> > kthread_park(ring->sched.thread);
> > +
> > + if (job && j != i)
> > + continue;
> > +
> > + /* only do job_reset on the hang
> ring if @job not NULL */
> > amd_sched_hw_job_reset(&ring->sched);
> > - }
> >
> > - /* after all hw jobs are reset, hw fence
> is meaningless, so force_completion */
> > - amdgpu_fence_driver_force_completion(adev);
> > + /* after all hw jobs are reset, hw
> fence is meaningless, so force_completion */
> > + amdgpu_fence_driver_force_completion_ring(ring);
> > + }
> >
> > /* request to take full control of GPU
> before re-initialization */
> > if (job)
> > @@ -2615,11 +2626,16 @@ int
> amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
> struct amdgpu_job *job)
> > }
> > fence_put(fence);
> >
> > - for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > - struct amdgpu_ring *ring =
> adev->rings[i];
> > + for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
> > + ring = adev->rings[i %
> AMDGPU_MAX_RINGS];
> > if (!ring || !ring->sched.thread)
> > continue;
> >
> > + if (job && j != i) {
> > + kthread_unpark(ring->sched.thread);
> > + continue;
> > + }
> > +
>
> Please split up that patch a bit further. E.g.
> first the handling to only hw_job_reset the ring
> in question, then the kickout handling.
>
> > amd_sched_job_recovery(&ring->sched);
> > kthread_unpark(ring->sched.thread);
> > }
> > @@ -2629,6 +2645,8 @@ int
> amdgpu_sriov_gpu_reset(struct amdgpu_device *adev,
> struct amdgpu_job *job)
> > if (r) {
> > /* bad news, how to tell it to
> userspace ? */
> > dev_info(adev->dev, "GPU reset
> failed\n");
> > + } else {
> > + dev_info(adev->dev, "GPU reset
> successed!\n");
> > }
> >
> > adev->gfx.in_reset = false;
> > diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 416908a..fd3691a8 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -112,6 +112,7 @@ int amdgpu_prim_buf_per_se =
> 0; int
> > amdgpu_pos_buf_per_se = 0; int
> amdgpu_cntl_sb_buf_per_se = 0; int
> > amdgpu_param_buf_per_se = 0;
> > +int amdgpu_job_hang_limit = 0;
> >
> > MODULE_PARM_DESC(vramlimit, "Restrict VRAM for
> testing, in
> > megabytes"); module_param_named(vramlimit,
> amdgpu_vram_limit, int,
> > 0600); @@ -237,6 +238,9 @@
> module_param_named(cntl_sb_buf_per_se,
> > amdgpu_cntl_sb_buf_per_se, int, 0444);
> > MODULE_PARM_DESC(param_buf_per_se, "the size of
> Off-Chip Pramater
> > Cache per Shader Engine (default depending on
> gfx)");
> > module_param_named(param_buf_per_se,
> amdgpu_param_buf_per_se, int,
> > 0444);
> >
> > +MODULE_PARM_DESC(job_hang_limit, "how much time
> allow a job hang and
> > +not drop it (default 0)");
> module_param_named(job_hang_limit,
> > +amdgpu_job_hang_limit, int ,0444);
> > +
> >
> > static const struct pci_device_id pciidlist[]
> = { #ifdef
> > CONFIG_DRM_AMDGPU_SI diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index d7523d1..8de3bd3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -541,6 +541,12 @@ void
> amdgpu_fence_driver_force_completion(struct
> amdgpu_device *adev)
> > }
> > }
> >
> > +void
> amdgpu_fence_driver_force_completion_ring(struct
> amdgpu_ring
> > +*ring) {
> > + if (ring)
> > + amdgpu_fence_write(ring,
> ring->fence_drv.sync_seq); }
> > +
>
> The coding style is completely off.
>
> Additional to that I don't think that this is a
> good idea. We should probably rather just signal
> all scheduler fences instead and don't touch the
> hardware fence at all.
>
> > /*
> > * Common fence implementation
> > */
> > diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > index 981ef08..03e88c6 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > @@ -76,6 +76,7 @@ struct amdgpu_fence_driver { int
> > amdgpu_fence_driver_init(struct amdgpu_device
> *adev); void
> > amdgpu_fence_driver_fini(struct amdgpu_device
> *adev); void
> > amdgpu_fence_driver_force_completion(struct
> amdgpu_device *adev);
> > +void
> amdgpu_fence_driver_force_completion_ring(struct
> amdgpu_ring
> > +*ring);
> >
> > int amdgpu_fence_driver_init_ring(struct
> amdgpu_ring *ring,
> > unsigned
> num_hw_submission);
> > diff --git
> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > index 6f4e31f..4e97e6d 100644
> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> > @@ -390,9 +390,18 @@ void
> amd_sched_hw_job_reset(struct amd_gpu_scheduler
> *sched)
> > &s_job->s_fence->cb)) {
> > fence_put(s_job->s_fence->parent);
> > s_job->s_fence->parent = NULL;
> > + atomic_dec(&sched->hw_rq_count);
> > }
> > }
> > - atomic_set(&sched->hw_rq_count, 0);
> > + spin_unlock(&sched->job_list_lock);
> > +}
> > +
> > +void amd_sched_job_kickout(struct amd_sched_job
> *s_job) {
> > + struct amd_gpu_scheduler *sched =
> s_job->sched;
> > +
> > + spin_lock(&sched->job_list_lock);
> > + list_del_init(&s_job->node);
> > spin_unlock(&sched->job_list_lock);
> > }
> >
> > diff --git
> a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > index 8cb41d3..59694f3 100644
> > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> > @@ -81,6 +81,7 @@ struct amd_sched_job {
> > struct list_head node;
> > struct delayed_work work_tdr;
> > uint64_t id;
> > + atomic_t karma;
> > };
> >
> > extern const struct fence_ops
> amd_sched_fence_ops_scheduled; @@ -96,6 +97,11 @@
> static inline struct amd_sched_fence
> *to_amd_sched_fence(struct fence *f)
> > return NULL;
> > }
> >
> > +static inline bool
> amd_sched_invalidate_job(struct amd_sched_job
> > +*s_job, int threshold) {
> > + return (s_job &&
> atomic_inc_return(&s_job->karma) > threshold); }
> > +
>
> Again coding style is completely off.
>
> Christian.
>
> > /**
> > * Define the backend operations called by the
> scheduler,
> > * these functions should be implemented in
> driver side @@ -158,4 +164,5 @@ int
> amd_sched_job_init(struct amd_sched_job *job,
> > void *owner);
> > void amd_sched_hw_job_reset(struct
> amd_gpu_scheduler *sched); void
> > amd_sched_job_recovery(struct amd_gpu_scheduler
> *sched);
> > +void amd_sched_job_kickout(struct amd_sched_job
> *s_job);
> > #endif
> > --
> > 2.7.4
> >
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> >
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
>
> amd-gfx mailing list
>
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
>
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170510/ab4ce273/attachment-0001.html>
More information about the amd-gfx
mailing list