[BUG 4.17] etnaviv-gpu f1840000.gpu: recover hung GPU!

Tue Jun 19 11:11:29 UTC 2018

Am Dienstag, den 19.06.2018, 12:00 +0100 schrieb Russell King - ARM Linux:
> On Tue, Jun 19, 2018 at 12:09:16PM +0200, Lucas Stach wrote:
> > Hi Russell,
> > 
> > Am Dienstag, den 19.06.2018, 10:43 +0100 schrieb Russell King - ARM Linux:
> > > It looks like a bug has crept in to etnaviv between 4.16 and 4.17,
> > > which causes etnaviv to misbehave with the GC600 GPU on Dove.  I
> > > don't think it's a GPU issue, I think it's a DRM issue.
> > > 
> > > I get multiple:
> > > 
> > > [  596.711482] etnaviv-gpu f1840000.gpu: recover hung GPU!
> > > [  597.732852] etnaviv-gpu f1840000.gpu: GPU failed to reset: FE not idle, 3D not idle, 2D not idle
> > > 
> > > while Xorg is starting up.  Ignore the "failed to reset", that
> > > just seems to be a property of the GC600, and of course is a
> > > subsequent issue after the primary problem.
> > > 
> > > Looking at the devcoredump:
> > > 
> > > 00000004 = 000000fe Idle: FE- DE+ PE+ SH+ PA+ SE+ RA+ TX+ VG- IM- FP- TS-
> > > 
> > > So, all units on the GC600 were idle except for the front end.
> > > 
> > > 00000660 = 00000812 Cmd: [wait DMA: idle Fetch: valid] Req idle Cal idle
> > > 00000664 = 102d06d8 Command DMA address
> > > 00000668 = 380000c8 FE fetched word 0
> > > 0000066c = 0000001f FE fetched word 1
> > > 
> > > The front end was basically idle at this point, at a WAIT 200 command.
> > > Digging through the ring:
> > > 
> > > 00688: 08010e01 00000040  LDST 0x3804=0x00000040
> > > 00690: 40000002 102d06a0  LINK 0x102d06a0
> > > 00698: 40000002 102d0690  LINK 0x102d0690
> > > 006a0: 08010e04 0000001f  LDST 0x3810=0x0000001f
> > > 006a8: 40000025 102d3000  LINK 0x102d3000
> > > 006b0: 08010e03 00000008  LDST 0x380c=0x00000008 Flush PE2D
> > > 006b8: 08010e02 00000701  LDST 0x3808=0x00000701 SEM FE -> PE
> > > 006c0: 48000000 00000701  STALL FE -> PE
> > > 006c8: 08010e01 00000041  LDST 0x3804=0x00000041
> > > 006d0: 380000c8(0000001f) WAIT 200
> > > > 006d8: 40000002 102d06d0  LINK 0x102d06d0	 <===========
> > > 
> > > We've basically come to the end of the currently issued command stream
> > > and hit the wait-link loop.  Everything else in the devcoredump looks
> > > normal.
> > > 
> > > So, I think etnaviv DRM has missed an event signalled from the GPU.
> > 
> > I don't see what would make us miss a event suddenly.
> > 
> > > This worked fine in 4.16, so seems to be a regression.
> > 
> > The only thing that comes to mind is that with the DRM scheduler we
> > enforce a job timeout of 500ms, without the previous logic to allow a
> > job to run indefinitely as long as it makes progress, as this is a
> > serious QoS issue.
> 
> That is probably what's going on then - the GC600 is not particularly
> fast when dealing with 1080p resolutions.
> 
> I think what your commit to use the DRM scheduler is missing is the
> progress detection in the original scheme - we used to assume that if
> the GPU FE DMA address had progressed, that the GPU was not hung.
> Now it seems we merely do this by checking for events.

It was a deliberate decision to remove this, as it's a potential DoS
vector where a rogue client is able to basically starve all other
clients of GPU access by submitting a job that runs for a very long
time, as long as it's making some progress.

> > This might bite you at this point, if Xorg manages to submit a really
> > big job. The coredump might be delayed enough that it captures the
> > state of the GPU when it has managed to finish the job after the job
> > timeout was hit.
> 
> No, it's not "a really big job" - it's just that the Dove GC600 is not
> fast enough to complete _two_ 1080p sized GPU operations within 500ms.
> The preceeding job contained two blits - one of them a non-alphablend
> copy of:
> 
>                 00180000 04200780  0,24,1920,1056 -> 0,24,1920,1056
> 
> and one an alpha blended copy of:
> 
>                 00000000 04380780  0,0,1920,1080 -> 0,0,1920,1080
> 
> This is (iirc) something I already fixed with the addition of the
> progress detection back before etnaviv was merged into the mainline
> kernel.

I hadn't expected it to be this slow. I see that we might need to bring
back the progress detection to fix the userspace regression, but I'm
not fond of this, as it might lead to really bad QoS.

I would prefer userspace tracking the size of the blits and flushing
the cmdstream at an appropriate time, so we don't end up with really
long running jobs, but I'm not sure if this would be acceptable to
you...

Regards,
Lucas