[BUG 4.17] etnaviv-gpu f1840000.gpu: recover hung GPU!
Russell King - ARM Linux
linux at armlinux.org.uk
Wed Jul 11 20:13:44 UTC 2018
On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
> See attached patch (which I apparently forgot to send out). The DRM GPU
> scheduler has some tracepoints, which might be helpful. The attached
> patch adds a drm_sched_job_run tracepoint when a job is queued in the
> hardware ring. Together with the existing drm_sched_process_job, this
> should get you an idea how long a job takes to process. Note that at
> any time up to 4 jobs are allowed in the hardware queue, so you need to
> match up the end times.
Here's the trace:
Xorg 1009 [000] 276.593018: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=144, fence=0xeb849580, ring=ef16bd00, job count:0, hw job count:0
gnome-software 1384 [000] 276.673634: gpu_scheduler:drm_sched_process_job: fence=0xeb849580 signaled
Xorg 1009 [000] 278.756538: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=145, fence=0xecaaef40, ring=ef16bd00, job count:0, hw job count:0
Xorg 1009 [000] 279.130196: gpu_scheduler:drm_sched_process_job: fence=0xecaaef40 signaled
Xorg 1009 [000] 279.856129: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=146, fence=0xecbdd7c0, ring=ef16bd00, job count:0, hw job count:0
Xorg 1009 [000] 279.856542: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=147, fence=0xeb849700, ring=ef16bd00, job count:0, hw job count:1
kswapd0 26 [000] 281.532808: gpu_scheduler:drm_sched_process_job: fence=0xecbdd7c0 signaled
gnome-software 1384 [000] 281.560666: gpu_scheduler:drm_sched_process_job: fence=0xeb849700 signaled
Xorg 1009 [000] 282.554920: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=148, fence=0xecad3a00, ring=ef16bd00, job count:0, hw job count:0
Xorg 1009 [000] 283.956216: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=149, fence=0xeb849700, ring=ef16bd00, job count:0, hw job count:1
Xorg 1009 [000] 284.149557: gpu_scheduler:drm_sched_process_job: fence=0xecad3a00 signaled
Xorg 1009 [000] 285.465863: gpu_scheduler:drm_sched_process_job: fence=0xeb849700 signaled
Xorg 1009 [000] 286.796692: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=150, fence=0xecbdd340, ring=ef16bd00, job count:0, hw job count:0
Xorg 1009 [000] 288.549507: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=151, fence=0xecbdd280, ring=ef16bd00, job count:0, hw job count:1
Xorg 1009 [000] 288.856277: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=152, fence=0xecaaee80, ring=ef16bd00, job count:0, hw job count:2
Xorg 1009 [000] 288.856713: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=153, fence=0xecad3040, ring=ef16bd00, job count:0, hw job count:3
gnome-software 1384 [000] 288.958764: gpu_scheduler:drm_sched_process_job: fence=0xecbdd340 signaled
Xorg 1009 [000] 290.052842: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=154, fence=0xecbdde80, ring=ef16bd00, job count:0, hw job count:3
Xorg 1009 [000] 291.682222: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=155, fence=0xecaaeb80, ring=ef16bd00, job count:0, hw job count:4
gnome-software 1384 [000] 291.760295: gpu_scheduler:drm_sched_process_job: fence=0xecbdd280 signaled
gnome-software 1384 [000] 292.018437: gpu_scheduler:drm_sched_process_job: fence=0xecaaee80 signaled
Xorg 1009 [000] 292.046721: gpu_scheduler:drm_sched_process_job: fence=0xecad3040 signaled
Xorg 1009 [000] 292.819475: gpu_scheduler:drm_sched_process_job: fence=0xecbdde80 signaled
Xorg 1009 [000] 293.149829: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=156, fence=0xecad3f40, ring=ef16bd00, job count:0, hw job count:1
gnome-software 1384 [000] 293.190607: gpu_scheduler:drm_sched_process_job: fence=0xecaaeb80 signaled
gnome-software 1384 [000] 293.190720: gpu_scheduler:drm_sched_process_job: fence=0xecad3f40 signaled
Xorg 1009 [000] 293.463414: gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=157, fence=0xecbddd00, ring=ef16bd00, job count:0, hw job count:0
Xorg 1009 [000] 293.463629: gpu_scheduler:drm_sched_process_job: fence=0xecbddd00 signaled
So:
Job Start End Time on queue:
144 276.593018 276.673634 80ms
145 278.756538 279.130196 374ms
146 279.856129 281.532808 1.7s
147 279.856542 281.560666 1.7s (overlaps 146)
148 282.554920 284.149557 1.6s
149 283.956216 285.465863 1.5s (overlaps 148)
150 286.796692 288.958764 2.2s
151 288.549507 291.760295 3.2s (overlaps 150)
152 288.856277 292.018437 3.2s (overlaps 150-151)
153 288.856713 292.046721 3.2s (overlaps 150-152)
154 290.052842 292.819475 2.8s (overlaps 151-153)
155 291.682222 293.190607 1.5s (overlaps 151-154)
156 293.149829 293.190720 40ms (overlaps 155)
157 293.463414 293.463629 215us
The time for jobs 147, 149, 151-155 are misleading because they are
delayed by the preceeding job. Correcting for that,
Time executing:
147 28ms
149 1.3s
151 2.8s
152 250ms
153 28ms
154 780ms
155 370ms
So, sorted by time apparently executing:
151 2.8s
150 286.796692 288.958764 2.2s
146 279.856129 281.532808 1.7s
148 282.554920 284.149557 1.6s
149 1.3s
154 780ms
145 278.756538 279.130196 374ms
155 370ms
152 250ms
144 276.593018 276.673634 80ms
156 293.149829 293.190720 40ms
153 28ms
147 28ms
157 293.463414 293.463629 215us
Of course, from the perf dump, we don't know what the GPU was being
asked to do.
I've applied your forward-progress patch, and with the 5s timeout
reverted, it seems to "behave" without causing a timeout.
--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 13.8Mbps down 630kbps up
According to speedtest.net: 13Mbps down 490kbps up
More information about the dri-devel
mailing list