[Bug 107762] [Intel GFX CI] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Thu Sep 6 15:16:07 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=107762

Michel Dänzer <michel at daenzer.net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ckoenig.leichtzumerken at gmai
                   |                            |l.com, dev at lynxeye.de

--- Comment #2 from Michel Dänzer <michel at daenzer.net> ---
(In reply to Martin Peres from comment #0)
> [  358.292609] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137
> [  358.292635] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=145, emitted seq=145

(In reply to Martin Peres from comment #1)
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=147, emitted seq=147

Hmm, signalled and emitted sequence numbers are always the same, meaning the
hardware hasn't actually timed out?

I can think of two possibilities:

* A GPU scheduler bug causing the job timeout handling to be triggered
spuriously. (Could something be stalling the system work queue, so the items
scheduled by drm_sched_job_finish_cb can't call drm_sched_job_finish in time?)

* A problem with the handling of the GPU's interrupts. Do the numbers on the
amdgpu line in /proc/interrupts still increase after these messages appeared,
or at least in the ten seconds before they appear?

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180906/2539fa93/attachment.html>


More information about the dri-devel mailing list