[Bug 204407] New: Bad page state in process Xorg

Fri Aug 2 20:33:44 UTC 2019

On Fri, Aug 02, 2019 at 01:23:06PM -0700, Andrew Morton wrote:
> > [259701.387365] BUG: Bad page state in process Xorg  pfn:2a300
> > [259701.393593] page:ffffea0000a8c000 refcount:0 mapcount:-128
> > mapping:0000000000000000 index:0x0

mapcount -128 is PAGE_MAPCOUNT_RESERVE, aka PageBuddy.  I think somebody
called put_page() once more than they should have.  The one before this
caused it to be freed to the page allocator, which set PageBuddy.  Then
this one happened and we got a complaint.

> > [259701.402832] flags: 0x2000000000000000()
> > [259701.407426] raw: 2000000000000000 ffffffff822ab778 ffffea0000a8f208
> > 0000000000000000
> > [259701.415900] raw: 0000000000000000 0000000000000003 00000000ffffff7f
> > 0000000000000000
> > [259701.424373] page dumped because: nonzero mapcount

It occurs to me that when a page is freed, we could record some useful bits
of information in the page from the stack trace to help debug double-free 
situations.  Even just stashing __builtin_return_address in page->mapping
would be helpful, I think.

> > [259701.549382] Call Trace:
> > [259701.549382]  dump_stack+0x46/0x60
> > [259701.549382]  bad_page.cold.28+0x81/0xb4
> > [259701.549382]  __free_pages_ok+0x236/0x240
> > [259701.549382]  __ttm_dma_free_page+0x2f/0x40
> > [259701.549382]  ttm_dma_unpopulate+0x29b/0x370
> > [259701.549382]  ttm_tt_destroy.part.6+0x44/0x50
> > [259701.549382]  ttm_bo_cleanup_memtype_use+0x29/0x70
> > [259701.549382]  ttm_bo_put+0x225/0x280
> > [259701.549382]  ttm_bo_vm_close+0x10/0x20
> > [259701.549382]  remove_vma+0x20/0x40
> > [259701.549382]  __do_munmap+0x2da/0x420
> > [259701.549382]  __vm_munmap+0x66/0xc0
> > [259701.549382]  __x64_sys_munmap+0x22/0x30
> > [259701.549382]  do_syscall_64+0x5e/0x1a0
> > [259701.549382]  ? prepare_exit_to_usermode+0x75/0xa0
> > [259701.549382]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [259701.549382] RIP: 0033:0x7f504d0ec1d7
> > [259701.549382] Code: 10 e9 67 ff ff ff 0f 1f 44 00 00 48 8b 15 b1 6c 0c 00 f7
> > d8 64 89 02 48 c7 c0 ff ff ff ff e9 6b ff ff ff b8 0b 00 00 00 0f 05 <48> 3d 01
> > f0 ff ff 73 01 c3 48 8b 0d 89 6c 0c 00 f7 d8 64 89 01 48
> > [259701.549382] RSP: 002b:00007ffe529db138 EFLAGS: 00000206 ORIG_RAX:
> > 000000000000000b
> > [259701.549382] RAX: ffffffffffffffda RBX: 0000564a5eabce70 RCX:
> > 00007f504d0ec1d7
> > [259701.549382] RDX: 00007ffe529db140 RSI: 0000000000400000 RDI:
> > 00007f5044b65000
> > [259701.549382] RBP: 0000564a5eafe460 R08: 000000000000000b R09:
> > 000000010283e000
> > [259701.549382] R10: 0000000000000001 R11: 0000000000000206 R12:
> > 0000564a5e475b08
> > [259701.549382] R13: 0000564a5e475c80 R14: 00007ffe529db190 R15:
> > 0000000000000c80
> > [259701.707238] Disabling lock debugging due to kernel taint
> 
> I assume the above is misbehaviour in the DRM code?