[REGRESSION] drm/etnaviv: command buffer outside valid memory window

Lucas Stach l.stach at pengutronix.de
Thu Jun 27 14:49:30 UTC 2019


Am Donnerstag, den 27.06.2019, 15:32 +0100 schrieb Russell King - ARM Linux admin:
> On Thu, Jun 27, 2019 at 11:04:17AM +0100, Russell King - ARM Linux admin wrote:
> > On Thu, Jun 27, 2019 at 11:20:15AM +0200, Lucas Stach wrote:
> > > Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM Linux admin:
> > > > While updating my various systems for the TCP SACK issue, I notice
> > > > that while most platforms are happy, the Cubox-i4 is not.  During
> > > > boot, we get:
> > > > 
> > > > [    0.000000] cma: Reserved 256 MiB at 0x30000000
> > > > ...
> > > > [    0.000000] Kernel command line: console=ttymxc0,115200n8 console=tty1 video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M ahci_imx.hotplug=1 splash resume=/dev/sda1
> > > > [    0.000000] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
> > > > [    0.000000] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
> > > > [    0.000000] Memory: 1790972K/2097152K available (8471K kernel code, 693K rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, 262144K cma-reserved, 1310720K highmem)
> > > > ...
> > > > [   13.101098] etnaviv-gpu 130000.gpu: command buffer outside valid memory window
> > > > [   13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid memory window
> > > 
> > > Yes, that's a regression due to different default CMA area placement
> > > and etnaviv not being smart enough to move the linear window to the
> > > right offset.
> > 
> > As it's a user visible regression, it needs fixing, either by reverting
> > the changes that caused it or by some other issue.  In the kernel, the
> > policy is "if a bug fix causes a regression, the bug fix was itself
> > wrong".  We don't fix one person's bug if it causes a regression for
> > someone else.
> > 
> > Please resolve the acknowledged regression.

The regression is caused due to a different CMA placement, which is
outside of the control of etnaviv. If you can point to the commit
causing this change in placement we could work with the
authors/maintainers of this code to get rid of the regression.
Currently I don't have the bandwidth to pinpoint the offending code
change.

> > > > and shortly after the login prompt appears, the entire SoC appears to
> > > > lock up - it becomes unresponsive on the network, or via serial console
> > > > to sysrq requests.
> > > > 
> > > > I suspect the GPU ends up scribbling over the CPU's vector page/kernel
> > > > as a result of the above two etnaviv errors when Xorg attempts to start
> > > > using the GPU.
> > > 
> > > This should not be possible. The driver notices that the command buffer
> > > isn't accessible to the GPU, which aborts the GPU init. While the
> > > etnaviv DRM device is still accessible, it will not expose any
> > > enumerable GPU cores to userspace. So there is no way for userspace to
> > > actually submit GPU commands.
> > 
> > Yep, I came to that conclusion.  Nevertheless, if I allow Xorg to start
> > with 5.1, the system totally hangs shortly thereafter.  I need to try
> > without etnaviv loaded at all.
> 
> Well, it seems to get worse.  I just tried to unload etnaviv, and was
> greeted by this oops.  It's another regression; etnaviv used to unload
> perfectly fine.  Please can you add module unload testing to your
> workflow?

As you can see from the patch I've just sent, this is a missing error
cleanup. So it's really the same regression. A module unload after
successful init of all GPU cores doesn't show this crash. The issue is
only unmasked due to the CMA placement regression.

Regards,
Lucas


More information about the dri-devel mailing list