[BUG 4.17] etnaviv-gpu f1840000.gpu: recover hung GPU!

Tue Jun 19 10:09:16 UTC 2018

Hi Russell,

Am Dienstag, den 19.06.2018, 10:43 +0100 schrieb Russell King - ARM Linux:
> It looks like a bug has crept in to etnaviv between 4.16 and 4.17,
> which causes etnaviv to misbehave with the GC600 GPU on Dove.  I
> don't think it's a GPU issue, I think it's a DRM issue.
> 
> I get multiple:
> 
> [  596.711482] etnaviv-gpu f1840000.gpu: recover hung GPU!
> [  597.732852] etnaviv-gpu f1840000.gpu: GPU failed to reset: FE not idle, 3D not idle, 2D not idle
> 
> while Xorg is starting up.  Ignore the "failed to reset", that
> just seems to be a property of the GC600, and of course is a
> subsequent issue after the primary problem.
> 
> Looking at the devcoredump:
> 
> 00000004 = 000000fe Idle: FE- DE+ PE+ SH+ PA+ SE+ RA+ TX+ VG- IM- FP- TS-
> 
> So, all units on the GC600 were idle except for the front end.
> 
> 00000660 = 00000812 Cmd: [wait DMA: idle Fetch: valid] Req idle Cal idle
> 00000664 = 102d06d8 Command DMA address
> 00000668 = 380000c8 FE fetched word 0
> 0000066c = 0000001f FE fetched word 1
> 
> The front end was basically idle at this point, at a WAIT 200 command.
> Digging through the ring:
> 
> 00688: 08010e01 00000040  LDST 0x3804=0x00000040
> 00690: 40000002 102d06a0  LINK 0x102d06a0
> 00698: 40000002 102d0690  LINK 0x102d0690
> 006a0: 08010e04 0000001f  LDST 0x3810=0x0000001f
> 006a8: 40000025 102d3000  LINK 0x102d3000
> 006b0: 08010e03 00000008  LDST 0x380c=0x00000008 Flush PE2D
> 006b8: 08010e02 00000701  LDST 0x3808=0x00000701 SEM FE -> PE
> 006c0: 48000000 00000701  STALL FE -> PE
> 006c8: 08010e01 00000041  LDST 0x3804=0x00000041
> 006d0: 380000c8(0000001f) WAIT 200
> > 006d8: 40000002 102d06d0  LINK 0x102d06d0	 <===========
> 
> We've basically come to the end of the currently issued command stream
> and hit the wait-link loop.  Everything else in the devcoredump looks
> normal.
> 
> So, I think etnaviv DRM has missed an event signalled from the GPU.

I don't see what would make us miss a event suddenly.

> This worked fine in 4.16, so seems to be a regression.

The only thing that comes to mind is that with the DRM scheduler we
enforce a job timeout of 500ms, without the previous logic to allow a
job to run indefinitely as long as it makes progress, as this is a
serious QoS issue.

This might bite you at this point, if Xorg manages to submit a really
big job. The coredump might be delayed enough that it captures the
state of the GPU when it has managed to finish the job after the job
timeout was hit.

Can you try if changing the timeout value to something large in
drm_sched_init() in etnaviv_sched.c makes any difference?

Regards,
Lucas