[Intel-gfx] 5.3-rc3: Frozen graphics with kcompactd migrating i915 pages

Hillf Danton hdanton at sina.com
Fri Aug 9 14:31:27 UTC 2019


[off topic: plain text mail please]

On Fri, 9 Aug 2019 12:41:42 +0000 Martin Wilck wrote:
> 
> This happened to me today, running kernel 5.3.0-rc3-1.g571863b-default
> (5.3-rc3 with just a few patches on top), after starting a KVM virtual
> machine. The X screen was frozen. Remote login via ssh was still
> possible, thus I was able to retrieve basic logs.

Thanks for report.
> 
> sysrq-w showed two blocked processes (kcompactd0 and KVM). After a
> minute, the same two processes were still blocked. KVM seems to try to
> acquire a lock that kcompactd is holding. kcompactd is waiting for IO
> to complete on pages owned by the i915 driver.
> 
> kcompactd stack:
> 
> Aug 09 12:12:48 apollon.suse.de kernel: sysrq: Show Blocked State
> Aug 09 12:12:48 apollon.suse.de kernel: task                        PC stack   pid father
> Aug 09 12:12:48 apollon.suse.de kernel: kcompactd0      D    0    43      2 0x80004000
> Aug 09 12:12:48 apollon.suse.de kernel: Call Trace:
> Aug 09 12:12:48 apollon.suse.de kernel:  ? __schedule+0x2af/0x6a0
> Aug 09 12:12:48 apollon.suse.de kernel:  schedule+0x33/0x90
> Aug 09 12:12:48 apollon.suse.de kernel:  io_schedule+0x12/0x40
> Aug 09 12:12:48 apollon.suse.de kernel:  __lock_page+0x123/0x200
> Aug 09 12:12:48 apollon.suse.de kernel:  ? gen8_ppgtt_clear_pdp+0xc0/0x140 [i915]
> Aug 09 12:12:48 apollon.suse.de kernel:  ? file_fdatawait_range+0x20/0x20
> Aug 09 12:12:48 apollon.suse.de kernel:  set_page_dirty_lock+0x49/0x50
> Aug 09 12:12:48 apollon.suse.de kernel:  i915_gem_userptr_put_pages+0x13f/0x1c0 [i915]

The two lines above show commit aa56a292ce62 ("drm/i915/userptr: Acquire
the page lock around set_page_dirty()") is culprit.

> Aug 09 12:12:48 apollon.suse.de kernel:  __i915_gem_object_put_pages+0x5e/0xa0 [i915]
> Aug 09 12:12:48 apollon.suse.de kernel:  userptr_mn_invalidate_range_start+0x1ff/0x220 [i915]
> Aug 09 12:12:48 apollon.suse.de kernel:  __mmu_notifier_invalidate_range_start+0x57/0xa0
> Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap_one+0xa0b/0xae0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? __mod_lruvec_state+0x3f/0xf0
> Aug 09 12:12:48 apollon.suse.de kernel:  rmap_walk_file+0xf2/0x250
> Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap+0xa6/0xe0

Page is locked before try_to_unmap(), and dirty page table entry is
handled in try_to_unmap_one(), so what was added in aa56a292ce62 is
a bit of overaction in this call trace. A bigger pain is it can not
be reverted because of the Fixes tag in it.

> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_remove_rmap+0x290/0x290
> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_not_mapped+0x20/0x20
> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_get_anon_vma+0x80/0x80
> Aug 09 12:12:48 apollon.suse.de kernel:  migrate_pages+0x8cd/0xbc0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? fast_isolate_freepages+0x6b0/0x6b0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? move_freelist_tail+0xb0/0xb0
> Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone+0x669/0xc80
> Aug 09 12:12:48 apollon.suse.de kernel:  ? entry_SYSCALL_64_after_hwframe+0xb8/0xbe
> Aug 09 12:12:48 apollon.suse.de kernel:  kcompactd_do_work+0x120/0x290
> 
> KVM stack:
> 
> Aug 09 12:12:48 apollon.suse.de kernel: CPU 0/KVM       D    0 25189      1 0x00000320
> Aug 09 12:12:48 apollon.suse.de kernel: Call Trace:
> Aug 09 12:12:48 apollon.suse.de kernel:  ? __schedule+0x2af/0x6a0
> Aug 09 12:12:48 apollon.suse.de kernel:  schedule+0x33/0x90
> Aug 09 12:12:48 apollon.suse.de kernel:  schedule_preempt_disabled+0xa/0x10
> Aug 09 12:12:48 apollon.suse.de kernel:  __mutex_lock.isra.0+0x172/0x4d0
> Aug 09 12:12:48 apollon.suse.de kernel:  userptr_mn_invalidate_range_start+0x1bf/0x220 [i915]
> Aug 09 12:12:48 apollon.suse.de kernel:  __mmu_notifier_invalidate_range_start+0x57/0xa0
> Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap_one+0xa0b/0xae0
> Aug 09 12:12:48 apollon.suse.de kernel:  rmap_walk_file+0xf2/0x250
> Aug 09 12:12:48 apollon.suse.de kernel:  try_to_unmap+0xa6/0xe0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_remove_rmap+0x290/0x290
> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_not_mapped+0x20/0x20
> Aug 09 12:12:48 apollon.suse.de kernel:  ? page_get_anon_vma+0x80/0x80
> Aug 09 12:12:48 apollon.suse.de kernel:  migrate_pages+0x8cd/0xbc0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? fast_isolate_freepages+0x6b0/0x6b0
> Aug 09 12:12:48 apollon.suse.de kernel:  ? move_freelist_tail+0xb0/0xb0
> Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone+0x669/0xc80
> Aug 09 12:12:48 apollon.suse.de kernel:  compact_zone_order+0xc6/0xf0
> Aug 09 12:12:48 apollon.suse.de kernel:  try_to_compact_pages+0xcc/0x2a0
> Aug 09 12:12:48 apollon.suse.de kernel:  __alloc_pages_direct_compact+0x7c/0x150
> Aug 09 12:12:48 apollon.suse.de kernel:  __alloc_pages_slowpath+0x1ee/0xd00
> Aug 09 12:12:48 apollon.suse.de kernel:  ? vmx_vcpu_load+0x100/0x120 [kvm_intel]
> 
> Full logs can be found under https://pastebin.com/KJ6tccj4
> I haven't yet tried if this is reproducible.

Set page dirty unless someone else is taking care of it.

--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -663,7 +663,7 @@ i915_gem_userptr_put_pages(struct drm_i9
 	i915_gem_gtt_finish_pages(obj, pages);
 
 	for_each_sgt_page(page, sgt_iter, pages) {
-		if (obj->mm.dirty)
+		if (obj->mm.dirty) {
 			/*
 			 * As this may not be anonymous memory (e.g. shmem)
 			 * but exist on a real mapping, we have to lock
@@ -672,8 +672,15 @@ i915_gem_userptr_put_pages(struct drm_i9
 			 * prevent the inode from being truncated.
 			 * Play safe and take the lock.
 			 */
-			set_page_dirty_lock(page);
-
+			if (trylock_page(page)) {
+				set_page_dirty(page);
+				unlock_page(page);
+			}
+			/*
+			 * else someone else is taking care of page and
+			 * we can do nothing about it to avoid deadlock
+			 */
+		}
 		mark_page_accessed(page);
 		put_page(page);
 	}
--



More information about the Intel-gfx mailing list