[Intel-gfx] [PATCH 1/2] drm/i915/gem: Replace reloc chain with terminator on error unwind

Pavel Machek pavel at ucw.cz
Wed Aug 19 19:33:26 UTC 2020


Hi!

> > > If we hit an error during construction of the reloc chain, we need to
> > > replace the chain into the next batch with the terminator so that upon
> > > flushing the relocations so far, we do not execute a hanging batch.
> > 
> > Thanks for the patches. I assume this should fix problem from
> > "5.9-rc1: graphics regression moved from -next to mainline" thread.
> > 
> > I have applied them over current -next, and my machine seems to be
> > working so far (but uptime is less than 30 minutes).
> > 
> > If the machine still works tommorow, I'll assume problem is solved.
> 
> Aye, best wait until we have to start competing with Chromium for
> memory... The suspicion is that it was the resource allocation failure
> path.

Yep, my machines are low on memory.

But ... test did not work that well. I have dead X and blinking
screen. Machine still works reasonably well over ssh, so I guess
that's an improvement.

Best regards,
							Pavel

[ 5604.909393] ACPI: EC: event unblocked
[ 5604.913590] usb usb2: root hub lost power or was reset
[ 5604.913812] usb usb3: root hub lost power or was reset
[ 5604.914046] usb usb4: root hub lost power or was reset
[ 5604.918812] ata6: port disabled--ignoring
[ 5604.925353] sd 0:0:0:0: [sda] Starting disk
[ 5605.150042] thinkpad_acpi: ACPI backlight control delay disabled
[ 5605.204955] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 5605.205931] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
[ 5605.205941] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
[ 5605.205949] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
[ 5605.207748] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
[ 5605.207757] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
[ 5605.207765] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
[ 5605.208227] ata1.00: configured for UDMA/133
[ 5605.281913] usb 5-2: reset full-speed USB device number 3 using uhci_hcd
[ 5605.569752] usb 5-1: reset full-speed USB device number 2 using uhci_hcd
[ 5609.082771] PM: resume devices took 4.192 seconds
[ 5609.083380] OOM killer enabled.
[ 5609.083387] Restarting tasks ... done.
[ 5609.103164] video LNXVIDEO:00: Restoring backlight state
[ 5609.150144] PM: suspend exit
[ 5609.190535] sdhci-pci 0000:15:00.2: Will use DMA mode even though HW doesn't fully claim to support it.
[ 5609.239495] sdhci-pci 0000:15:00.2: Will use DMA mode even though HW doesn't fully claim to support it.
[ 5609.287144] sdhci-pci 0000:15:00.2: Will use DMA mode even though HW doesn't fully claim to support it.
[ 5609.344497] sdhci-pci 0000:15:00.2: Will use DMA mode even though HW doesn't fully claim to support it.
[ 5611.426855] wlan0: authenticate with 5c:f4:ab:10:d2:bb
[ 5611.430609] wlan0: send auth to 5c:f4:ab:10:d2:bb (try 1/3)
[ 5611.432552] wlan0: authenticated
[ 5611.433705] wlan0: associate with 5c:f4:ab:10:d2:bb (try 1/3)
[ 5611.436440] wlan0: RX AssocResp from 5c:f4:ab:10:d2:bb (capab=0x411 status=0 aid=1)
[ 5611.439083] wlan0: associated
[ 7744.718473] BUG: unable to handle page fault for address: f8c00000
[ 7744.718484] #PF: supervisor write access in kernel mode
[ 7744.718487] #PF: error_code(0x0002) - not-present page
[ 7744.718491] *pdpt = 0000000031b0b001 *pde = 0000000000000000 
[ 7744.718500] Oops: 0002 [#1] PREEMPT SMP PTI
[ 7744.718506] CPU: 0 PID: 3004 Comm: Xorg Not tainted 5.9.0-rc1-next-20200819+ #134
[ 7744.718509] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 ) 03/31/2011
[ 7744.718518] EIP: eb_relocate_vma+0xdbf/0xf20
[ 7744.718523] Code: 48 74 8b 41 08 89 41 0c 8b 85 a4 fd ff ff 89 95 a0 fd ff ff e8 c2 12 6c 00 8b 95 a0 fd ff ff e9 03 fc ff ff 8b 85 d0 fd ff ff <c7> 03 01 00 40 10 89 43 04 8b 85 dc fd ff ff 89 43 08 e9 4a f6 ff
[ 7744.718527] EAX: 01397010 EBX: f8c00000 ECX: 01247000 EDX: 00000000
[ 7744.718531] ESI: f519cd80 EDI: f1ac1cd4 EBP: f1ac1c6c ESP: f1ac1a04
[ 7744.718535] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[ 7744.718539] CR0: 80050033 CR2: f8c00000 CR3: 31ac2000 CR4: 000006b0
[ 7744.718543] Call Trace:
[ 7744.718553]  ? shmem_read_mapping_page_gfp+0x32/0x70
[ 7744.718560]  ? eb_lookup_vmas+0x272/0x9f0
[ 7744.718565]  i915_gem_do_execbuffer+0xa7b/0x2730
[ 7744.718573]  ? intel_runtime_pm_put_unchecked+0xd/0x10
[ 7744.718578]  ? i915_gem_gtt_pwrite_fast+0xf6/0x520
[ 7744.718586]  ? __lock_acquire.isra.0+0x223/0x500
[ 7744.718592]  ? cache_alloc_debugcheck_after+0x151/0x180
[ 7744.718596]  ? kvmalloc_node+0x69/0x80
[ 7744.718600]  ? __kmalloc+0x92/0x120
[ 7744.718604]  ? kvmalloc_node+0x69/0x80
[ 7744.718608]  i915_gem_execbuffer2_ioctl+0xdd/0x350
[ 7744.718613]  ? i915_gem_execbuffer_ioctl+0x2a0/0x2a0
[ 7744.718619]  drm_ioctl_kernel+0x91/0xe0
[ 7744.718623]  ? i915_gem_execbuffer_ioctl+0x2a0/0x2a0
[ 7744.718627]  drm_ioctl+0x1fd/0x371
[ 7744.718631]  ? i915_gem_execbuffer_ioctl+0x2a0/0x2a0
[ 7744.718639]  ? posix_get_monotonic_timespec+0x1d/0x80
[ 7744.718645]  ? __sys_recvmsg+0x37/0x80
[ 7744.718649]  ? drm_ioctl_kernel+0xe0/0xe0
[ 7744.718654]  __ia32_sys_ioctl+0x14b/0x7c6
[ 7744.718661]  ? exit_to_user_mode_prepare+0x53/0x100
[ 7744.718667]  do_int80_syscall_32+0x2c/0x40
[ 7744.718674]  entry_INT80_32+0x111/0x111
[ 7744.718678] EIP: 0xb7fd3092
[ 7744.718683] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
[ 7744.718687] EAX: ffffffda EBX: 0000000a ECX: c0406469 EDX: bfe67abc
[ 7744.718691] ESI: b73c1000 EDI: c0406469 EBP: 0000000a ESP: bfe67a34
[ 7744.718695] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
[ 7744.718700]  ? asm_exc_nmi+0xcc/0x2bc
[ 7744.718703] Modules linked in:
[ 7744.718709] CR2: 00000000f8c00000
[ 7744.718714] ---[ end trace 121f748dd4d0d6ec ]---
[ 7744.718719] EIP: eb_relocate_vma+0xdbf/0xf20
[ 7744.718723] Code: 48 74 8b 41 08 89 41 0c 8b 85 a4 fd ff ff 89 95 a0 fd ff ff e8 c2 12 6c 00 8b 95 a0 fd ff ff e9 03 fc ff ff 8b 85 d0 fd ff ff <c7> 03 01 00 40 10 89 43 04 8b 85 dc fd ff ff 89 43 08 e9 4a f6 ff
[ 7744.718727] EAX: 01397010 EBX: f8c00000 ECX: 01247000 EDX: 00000000
[ 7744.718731] ESI: f519cd80 EDI: f1ac1cd4 EBP: f1ac1c6c ESP: f1ac1a04
[ 7744.718735] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[ 7744.718739] CR0: 80050033 CR2: f8c00000 CR3: 31ac2000 CR4: 000006b0
[ 7744.723687] BUG: unable to handle page fault for address: f8c02038
[ 7744.723695] #PF: supervisor write access in kernel mode
[ 7744.723699] #PF: error_code(0x0002) - not-present page
[ 7744.723702] *pdpt = 0000000031866001 *pde = 0000000000000000 
[ 7744.723711] Oops: 0002 [#2] PREEMPT SMP PTI
[ 7744.723717] CPU: 1 PID: 3004 Comm: Xorg Tainted: G      D           5.9.0-rc1-next-20200819+ #134
[ 7744.723720] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 ) 03/31/2011
[ 7744.723728] EIP: n_tty_open+0x26/0x80
[ 7744.723733] Code: 00 00 00 90 55 89 e5 56 53 89 c3 b8 f0 22 00 00 e8 4f 39 cb ff 85 c0 74 62 89 c6 a1 00 2d 27 c5 b9 e8 2a 77 c5 ba 85 83 12 c5 <89> 46 38 8d 86 58 22 00 00 e8 8c 12 c0 ff 8d 86 a4 22 00 00 b9 e0
[ 7744.723738] EAX: 001c65c0 EBX: f2339000 ECX: c5772ae8 EDX: c5128385
[ 7744.723741] ESI: f8c02000 EDI: 00000000 EBP: f1ac1ee4 ESP: f1ac1edc
[ 7744.723745] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210286
[ 7744.723751] CR0: 80050033 CR2: f8c02038 CR3: 31864000 CR4: 000006b0
[ 7744.723755] Call Trace:
[ 7744.723763]  tty_ldisc_open.isra.0+0x23/0x40
[ 7744.723768]  tty_ldisc_reinit+0x99/0xe0
[ 7744.723772]  tty_ldisc_hangup+0xc4/0x1e0
[ 7744.723776]  __tty_hangup.part.0+0x13f/0x250
[ 7744.723781]  tty_vhangup_session+0x11/0x20
[ 7744.723786]  disassociate_ctty.part.0+0x34/0x230
[ 7744.723790]  disassociate_ctty+0x28/0x30
[ 7744.723797]  do_exit+0x456/0x960
[ 7744.723803]  ? exit_to_user_mode_prepare+0x53/0x100
[ 7744.723808]  rewind_stack_do_exit+0x11/0x13
[ 7744.723812] EIP: 0xb7fd3092
[ 7744.723815] Code: Bad RIP value.
[ 7744.723819] EAX: ffffffda EBX: 0000000a ECX: c0406469 EDX: bfe67abc
[ 7744.723823] ESI: b73c1000 EDI: c0406469 EBP: 0000000a ESP: bfe67a34
[ 7744.723827] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
[ 7744.723837]  ? asm_exc_nmi+0xcc/0x2bc
[ 7744.723839] Modules linked in:
[ 7744.723845] CR2: 00000000f8c02038
[ 7744.723851] ---[ end trace 121f748dd4d0d6ed ]---
[ 7744.723857] EIP: eb_relocate_vma+0xdbf/0xf20
[ 7744.723861] Code: 48 74 8b 41 08 89 41 0c 8b 85 a4 fd ff ff 89 95 a0 fd ff ff e8 c2 12 6c 00 8b 95 a0 fd ff ff e9 03 fc ff ff 8b 85 d0 fd ff ff <c7> 03 01 00 40 10 89 43 04 8b 85 dc fd ff ff 89 43 08 e9 4a f6 ff
[ 7744.723865] EAX: 01397010 EBX: f8c00000 ECX: 01247000 EDX: 00000000
[ 7744.723869] ESI: f519cd80 EDI: f1ac1cd4 EBP: f1ac1c6c ESP: f1ac1a04
[ 7744.723873] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[ 7744.723877] CR0: 80050033 CR2: f8c02038 CR3: 31864000 CR4: 000006b0
[ 7744.723880] Fixing recursive fault but reboot is needed!
[ 7749.589011] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000
[ 7749.589024] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 7749.589030] Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new.
[ 7749.589036] Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
[ 7749.589041] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 7749.589047] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 7749.589053] GPU crash dump saved to /sys/class/drm/card0/error
[ 7749.909841] i915 0000:00:02.0: [drm] Resetting chip for no heartbeat on rcs0
[ 7756.504232] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000
[ 7756.817879] i915 0000:00:02.0: [drm] Resetting chip for no heartbeat on rcs0
[ 7763.672921] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000
[ 7763.985882] i915 0000:00:02.0: [drm] Resetting chip for no heartbeat on rcs0
[ 7770.580999] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000
[ 7770.897884] i915 0000:00:02.0: [drm] Resetting chip for no heartbeat on rcs0
[ 7777.497036] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000
[ 7777.825882] i915 0000:00:02.0: [drm] Resetting chip for no heartbeat on rcs0
[ 7784.664999] i915 0000:00:02.0: [drm] GPU HANG: ecode 3:0:00000000


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20200819/fbc40320/attachment.sig>


More information about the Intel-gfx mailing list