[BUG 4.17] etnaviv-gpu f1840000.gpu: recover hung GPU!

Tue Jun 19 11:00:21 UTC 2018

On Tue, Jun 19, 2018 at 12:09:16PM +0200, Lucas Stach wrote:
> Hi Russell,
> 
> Am Dienstag, den 19.06.2018, 10:43 +0100 schrieb Russell King - ARM Linux:
> > It looks like a bug has crept in to etnaviv between 4.16 and 4.17,
> > which causes etnaviv to misbehave with the GC600 GPU on Dove.  I
> > don't think it's a GPU issue, I think it's a DRM issue.
> > 
> > I get multiple:
> > 
> > [  596.711482] etnaviv-gpu f1840000.gpu: recover hung GPU!
> > [  597.732852] etnaviv-gpu f1840000.gpu: GPU failed to reset: FE not idle, 3D not idle, 2D not idle
> > 
> > while Xorg is starting up.  Ignore the "failed to reset", that
> > just seems to be a property of the GC600, and of course is a
> > subsequent issue after the primary problem.
> > 
> > Looking at the devcoredump:
> > 
> > 00000004 = 000000fe Idle: FE- DE+ PE+ SH+ PA+ SE+ RA+ TX+ VG- IM- FP- TS-
> > 
> > So, all units on the GC600 were idle except for the front end.
> > 
> > 00000660 = 00000812 Cmd: [wait DMA: idle Fetch: valid] Req idle Cal idle
> > 00000664 = 102d06d8 Command DMA address
> > 00000668 = 380000c8 FE fetched word 0
> > 0000066c = 0000001f FE fetched word 1
> > 
> > The front end was basically idle at this point, at a WAIT 200 command.
> > Digging through the ring:
> > 
> > 00688: 08010e01 00000040  LDST 0x3804=0x00000040
> > 00690: 40000002 102d06a0  LINK 0x102d06a0
> > 00698: 40000002 102d0690  LINK 0x102d0690
> > 006a0: 08010e04 0000001f  LDST 0x3810=0x0000001f
> > 006a8: 40000025 102d3000  LINK 0x102d3000
> > 006b0: 08010e03 00000008  LDST 0x380c=0x00000008 Flush PE2D
> > 006b8: 08010e02 00000701  LDST 0x3808=0x00000701 SEM FE -> PE
> > 006c0: 48000000 00000701  STALL FE -> PE
> > 006c8: 08010e01 00000041  LDST 0x3804=0x00000041
> > 006d0: 380000c8(0000001f) WAIT 200
> > > 006d8: 40000002 102d06d0  LINK 0x102d06d0	 <===========
> > 
> > We've basically come to the end of the currently issued command stream
> > and hit the wait-link loop.  Everything else in the devcoredump looks
> > normal.
> > 
> > So, I think etnaviv DRM has missed an event signalled from the GPU.
> 
> I don't see what would make us miss a event suddenly.
> 
> > This worked fine in 4.16, so seems to be a regression.
> 
> The only thing that comes to mind is that with the DRM scheduler we
> enforce a job timeout of 500ms, without the previous logic to allow a
> job to run indefinitely as long as it makes progress, as this is a
> serious QoS issue.

That is probably what's going on then - the GC600 is not particularly
fast when dealing with 1080p resolutions.

I think what your commit to use the DRM scheduler is missing is the
progress detection in the original scheme - we used to assume that if
the GPU FE DMA address had progressed, that the GPU was not hung.
Now it seems we merely do this by checking for events.

> This might bite you at this point, if Xorg manages to submit a really
> big job. The coredump might be delayed enough that it captures the
> state of the GPU when it has managed to finish the job after the job
> timeout was hit.

No, it's not "a really big job" - it's just that the Dove GC600 is not
fast enough to complete _two_ 1080p sized GPU operations within 500ms.
The preceeding job contained two blits - one of them a non-alphablend
copy of:

                00180000 04200780  0,24,1920,1056 -> 0,24,1920,1056

and one an alpha blended copy of:

                00000000 04380780  0,0,1920,1080 -> 0,0,1920,1080

This is (iirc) something I already fixed with the addition of the
progress detection back before etnaviv was merged into the mainline
kernel.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up