[PATCH 02/10] drm/etnaviv: mmuv2: don't map zero page

Mon Jan 7 09:13:24 UTC 2019

Hi,
On Mon, Jan 07, 2019 at 09:50:52AM +0100, Lucas Stach wrote:
> Hi Guido,
> 
> Am Sonntag, den 30.12.2018, 16:49 +0100 schrieb Guido Günther:
> > Hi Lucas,
> > On Wed, Dec 19, 2018 at 03:45:38PM +0100, Lucas Stach wrote:
> > > Keep the page at address 0 as faulting to catch any potential state
> > > setup issues early.
> > 
> > This is a nice idea! But applying this and making mesa hit that page
> > leads to the process hanging in D state over here on GC7000:
> > 
> > # [  242.726192] INFO: task kworker/u8:2:37 blocked for more than 120 seconds.
> > [  242.733010]       Not tainted 4.18.0-00129-gce2b21074b41 #504
> > [  242.738795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  242.746638] kworker/u8:2    D    0    37      2 0x00000028
> > [  242.752144] Workqueue: events_unbound commit_work
> > [  242.756860] Call trace:
> > [  242.759318]  __switch_to+0x94/0xd0
> > [  242.762741]  __schedule+0x1c0/0x6b8
> > [  242.766239]  schedule+0x40/0xa8
> > [  242.769380]  schedule_timeout+0x2f0/0x428
> > [  242.773410]  dma_fence_default_wait+0x1cc/0x2b8
> > [  242.777951]  dma_fence_wait_timeout+0x44/0x1b0
> > [  242.782403]  drm_atomic_helper_wait_for_fences+0x48/0x108
> > [  242.787819]  commit_tail+0x30/0x80
> > [  242.791229]  commit_work+0x20/0x30
> > [  242.794642]  process_one_work+0x1ec/0x458
> > [  242.798659]  worker_thread+0x48/0x430
> > [  242.802331]  kthread+0x130/0x138
> > [  242.805557]  ret_from_fork+0x10/0x1c
> > 
> > This is in dmesg showing that we hit the first page:
> > 
> >     [   65.907388] etnaviv-gpu 38000000.gpu: MMU fault status 0x00000002
> >     [   65.913497] etnaviv-gpu 38000000.gpu: MMU 0 fault addr 0x00000e40
> > 
> > Without that patch it's sampling random data from that page but does not hang.
> 
> GPU hangs after a MMU fault are expected or more accurately, we
> actively request the GPU to stop by setting the exception bit in the
> page table.

Yeah. I put that in to show that this the cause for the trouble above.

> 
> A hanging GPU should trigger the scheduler timeout handler, which then
> makes sure to get the GPU back into a working state. So if things don't
> progress after the fault for you either the timeout handler is buggy on
> GC7000, or the fence signaling is broken somehow. I'll take a look at
> this.

This isn't a top notch linux-next based tree yet so if you're not seeing this
let me forward port our stuff to that and report back again.

Cheers,
 -- Guido