[PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

Thu Oct 12 11:00:08 UTC 2017

On 12/10/17 10:05 AM, Christian König wrote:
> Am 11.10.2017 um 18:30 schrieb Michel Dänzer:
>> On 28/09/17 04:55 PM, Nicolai Hähnle wrote:
>>> From: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>>
>>> Highly concurrent Piglit runs can trigger a race condition where a
>>> pending
>>> SDMA job on a buffer object is never executed because the corresponding
>>> process is killed (perhaps due to a crash). Since the job's fences were
>>> never signaled, the buffer object was effectively leaked. Worse, the
>>> buffer was stuck wherever it happened to be at the time, possibly in
>>> VRAM.
>>>
>>> The symptom was user space processes stuck in interruptible waits with
>>> kernel stacks like:
>>>
>>>      [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
>>>      [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
>>>      [<ffffffffbc5e82d2>]
>>> reservation_object_wait_timeout_rcu+0x1c2/0x300
>>>      [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0
>>> [ttm]
>>>      [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
>>>      [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
>>>      [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
>>>      [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
>>>      [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470
>>> [amdgpu]
>>>      [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
>>>      [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu]
>>>      [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu]
>>>      [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
>>>      [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
>>>      [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
>>>      [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
>>>      [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
>>>      [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>> Acked-by: Christian König <christian.koenig at amd.com>
>> Since Christian's commit which introduced the problem (6af0883ed977
>> "drm/amdgpu: discard commands of killed processes") is in 4.14, we need
>> a solution for that. Should we backport Nicolai's five commits fixing
>> the problem, or revert 6af0883ed977?

BTW, any preference for this Christian or Nicolai?

>> While looking into this, I noticed that the following commits by
>> Christian in 4.14 each also cause hangs for me when running the piglit
>> gpu profile on Tonga:
>>
>> 457e0fee04b0 "drm/amdgpu: remove the GART copy hack"
>> 1d00402b4da2 "drm/amdgpu: fix amdgpu_ttm_bind"
>>
>> Are there fixes for these that can be backported to 4.14, or do they
>> need to be reverted there?
> Well I'm not aware that any of those two can cause problems.
> 
> For "drm/amdgpu: remove the GART copy hack" I also don't have the
> slightest idea how that could be an issue. It just removes an unused
> code path.

I also thought it's weird, and indeed I can no longer reproduce a hang
with only 457e0fee04b0; but I still can with only 1d00402b4da2. I guess
one of my bisections went wrong and incorrectly identified 457e0fee04b0
instead of 1d00402b4da2.

> Is amd-staging-drm-next stable for you?

It seemed stable before the changes you pushed this morning. :) As of
cfb6dee86711 "drm/ttm: add transparent huge page support for cached
allocations v2", I get a flood of

 [TTM] Erroneous page count. Leaking pages.

in dmesg while running piglit, and it eventually hangs[0].

Anyway, unless anyone knows which commits from amd-staging-drm-next are
needed to make 1d00402b4da2 stable in 4.14, the safe course of action
seems to be reverting it (and ac7afe6b3cf3, which depends on it)?

[0] I also got this, but I don't know yet if it's related:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000220
 IP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu]
 PGD 0 
 P4D 0 

 Oops: 0000 [#1] SMP
 Modules linked in: cpufreq_powersave cpufreq_userspace cpufreq_conservative amdkfd(O) edac_mce_amd kvm amdgpu(O) irqbypass crct10dif_pclmul crc32_pclmul chash snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_generic snd_hda_codec_hdmi pcbc binfmt_misc ttm(O) efi_pstore snd_hda_intel drm_kms_helper(O) snd_hda_codec nls_ascii drm(O) snd_hda_core nls_cp437 i2c_algo_bit aesni_intel snd_hwdep fb_sys_fops aes_x86_64 crypto_simd vfat syscopyarea glue_helper sysfillrect snd_pcm fat sysimgblt sp5100_tco wmi_bmof ppdev r8169 snd_timer cryptd pcspkr efivars mfd_core mii ccp i2c_piix4 snd soundcore rng_core sg wmi parport_pc parport i2c_designware_platform i2c_designware_core button acpi_cpufreq tcp_bbr sch_fq sunrpc nct6775 hwmon_vid efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache
  jbd2 fscrypto raid10 raid1 raid0 multipath linear md_mod dm_mod sd_mod evdev hid_generic usbhid hid crc32c_intel ahci libahci xhci_pci libata xhci_hcd scsi_mod usbcore shpchp gpio_amdpt gpio_generic
 CPU: 13 PID: 1075 Comm: max-texture-siz Tainted: G        W  O    4.13.0-rc5+ #28
 Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.80 09/13/2017
 task: ffff9d2982c75a00 task.stack: ffffb2744e9bc000
 RIP: 0010:amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu]
 RSP: 0018:ffffb2744e9bf6e8 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: ffff9d2848642820 RCX: ffff9d28c77fdae0
 RDX: 0000000000000001 RSI: ffff9d28c77fd800 RDI: ffff9d288f286008
 RBP: ffffb2744e9bf728 R08: 000000ffffffffff R09: 0000000000000000
 R10: 0000000000000078 R11: ffff9d298ba170a0 R12: ffff9d28c77fd800
 R13: 0000000000000001 R14: ffff9d288f286000 R15: ffff9d2848642800
 FS:  00007f809fc5c300(0000) GS:ffff9d298e940000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000220 CR3: 000000030e05a000 CR4: 00000000003406e0
 Call Trace:
  amdgpu_bo_move_notify+0x42/0xd0 [amdgpu]
  ttm_bo_unmap_virtual_locked+0x298/0xac0 [ttm]
  ? ttm_bo_mem_space+0x391/0x580 [ttm]
  ttm_bo_unmap_virtual_locked+0x737/0xac0 [ttm]
  ttm_bo_unmap_virtual_locked+0xa6f/0xac0 [ttm]
  ttm_bo_mem_space+0x306/0x580 [ttm]
  ttm_bo_validate+0xd4/0x150 [ttm]
  ttm_bo_init_reserved+0x22e/0x440 [ttm]
  amdgpu_ttm_placement_from_domain+0x33c/0x580 [amdgpu]
  ? amdgpu_fill_buffer+0x300/0x420 [amdgpu]
  amdgpu_bo_create+0x50/0x2b0 [amdgpu]
  amdgpu_gem_object_create+0x9f/0x110 [amdgpu]
  amdgpu_gem_create_ioctl+0x12f/0x270 [amdgpu]
  ? amdgpu_gem_object_close+0x210/0x210 [amdgpu]
  drm_ioctl_kernel+0x5d/0xf0 [drm]
  drm_ioctl+0x32a/0x630 [drm]
  ? amdgpu_gem_object_close+0x210/0x210 [amdgpu]
  ? lru_cache_add_active_or_unevictable+0x36/0xb0
  ? __handle_mm_fault+0x90d/0xff0
  amdgpu_drm_ioctl+0x4f/0x1c20 [amdgpu]
  do_vfs_ioctl+0xa5/0x600
  ? handle_mm_fault+0xd8/0x230
  ? __do_page_fault+0x267/0x4c0
  SyS_ioctl+0x79/0x90
  entry_SYSCALL_64_fastpath+0x1e/0xa9
 RIP: 0033:0x7f809c8f3dc7
 RSP: 002b:00007ffcc8c485f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
 RAX: ffffffffffffffda RBX: 00007f809cbaab00 RCX: 00007f809c8f3dc7
 RDX: 00007ffcc8c48640 RSI: 00000000c0206440 RDI: 0000000000000006
 RBP: 0000000040000010 R08: 00007f809cbaabe8 R09: 0000000000000060
 R10: 0000000000000004 R11: 0000000000000246 R12: 0000000040001000
 R13: 00007f809cbaab58 R14: 0000000000001000 R15: 00007f809cbaab00
 Code: 49 8b 47 10 48 39 45 d0 4c 8d 78 f0 0f 84 87 00 00 00 4d 8b 37 45 84 ed 41 c6 47 30 01 49 8d 5f 20 49 8d 7e 08 74 19 49 8b 46 58 <48> 8b 80 20 02 00 00 49 39 84 24 20 02 00 00 0f 84 ab 00 00 00 
 RIP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] RSP: ffffb2744e9bf6e8
 CR2: 0000000000000220

-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer