<div dir="ltr"><p dir="ltr">On Nov 22, 2014 11:45 AM, "Michael Marineau" <<a href="mailto:mike@marineau.org" target="_blank">mike@marineau.org</a>> wrote:<br>
><br>
><br>
> On Nov 22, 2014 8:56 AM, "Maarten Lankhorst" <<a href="mailto:maarten.lankhorst@canonical.com" target="_blank">maarten.lankhorst@canonical.com</a>> wrote:<br>
> ><br>
> > Hey,<br>
> ><br>
> > Op 22-11-14 om 01:19 schreef Michael Marineau:<br>
> > > On Thu, Nov 20, 2014 at 12:53 AM, Maarten Lankhorst<br>
> > > <<a href="mailto:maarten.lankhorst@canonical.com" target="_blank">maarten.lankhorst@canonical.com</a>> wrote:<br>
> > >> Op 20-11-14 om 05:06 schreef Michael Marineau:<br>
> > >>> On Wed, Nov 19, 2014 at 12:10 AM, Maarten Lankhorst<br>
> > >>> <<a href="mailto:maarten.lankhorst@canonical.com" target="_blank">maarten.lankhorst@canonical.com</a>> wrote:<br>
> > >>>> Hey,<br>
> > >>>><br>
> > >>>> On 19-11-14 07:43, Michael Marineau wrote:<br>
> > >>>>> On 3.18-rc kernel's I have been intermittently experiencing GPU<br>
> > >>>>> lockups shortly after startup, accompanied with one or both of the<br>
> > >>>>> following errors:<br>
> > >>>>><br>
> > >>>>> nouveau E[ PFIFO][0000:01:00.0] read fault at 0x000734a000 [PTE]<br>
> > >>>>> from PBDMA0/HOST_CPU on channel 0x007faa3000 [unknown]<br>
> > >>>>> nouveau E[ DRM] GPU lockup - switching to software fbcon<br>
> > >>>>><br>
> > >>>>> I was able to trace the issue with bisect to commit<br>
> > >>>>> 809e9447b92ffe1346b2d6ec390e212d5307f61c "drm/nouveau: use shared<br>
> > >>>>> fences for readable objects". The lockups appear to have cleared up<br>
> > >>>>> since reverting that and a few related followup commits:<br>
> > >>>>><br>
> > >>>>> 809e9447: "drm/nouveau: use shared fences for readable objects"<br>
> > >>>>> 055dffdf: "drm/nouveau: bump driver patchlevel to 1.2.1"<br>
> > >>>>> e3be4c23: "drm/nouveau: specify if interruptible wait is desired in<br>
> > >>>>> nouveau_fence_sync"<br>
> > >>>>> 15a996bb: "drm/nouveau: assign fence_chan->name correctly"<br>
> > >>>> Weird. I'm not sure yet what causes it.<br>
> > >>>><br>
> > >>>> <a href="http://cgit.freedesktop.org/~mlankhorst/linux/commit/?h=fixed-fences-for-bisect&id=86be4f216bbb9ea3339843a5658d4c21162c7ee2" target="_blank">http://cgit.freedesktop.org/~mlankhorst/linux/commit/?h=fixed-fences-for-bisect&id=86be4f216bbb9ea3339843a5658d4c21162c7ee2</a><br>
> > >>> Building a kernel from that commit gives me an entirely new behavior:<br>
> > >>> X hangs for at least 10-20 seconds at a time with brief moments of<br>
> > >>> responsiveness before hanging again while gitk on the kernel repo<br>
> > >>> loads. Otherwise the system is responsive. The head of that<br>
> > >>> fixed-fences-for-bisect branch (1c6aafb5) which is the "use shared<br>
> > >>> fences for readable objects" commit I originally bisected to does<br>
> > >>> feature the complete lockups I was seeing before.<br>
> > >> Ok for the sake of argument lets just assume they're separate bugs, and we should look at xorg<br>
> > >> hanging first.<br>
> > >><br>
> > >> Is there anything in the dmesg when the hanging happens?<br>
> > >><br>
> > >> And it's probably 15 seconds, if it's called through nouveau_fence_wait.<br>
> > >><br>
> > >> Try changing else if (!ret) to else if (WARN_ON(!ret)) in that function, and see if you get some dmesg spam. :)<br>
> > > Adding the WARN_ON to 86be4f21 repots the following:<br>
> > ><br>
> > > [ 1188.676073] ------------[ cut here ]------------<br>
> > > [ 1188.676161] WARNING: CPU: 1 PID: 474 at<br>
> > > drivers/gpu/drm/nouveau/nouveau_fence.c:359<br>
> > > nouveau_fence_wait.part.9+0x33/0x40 [nouveau]()<br>
> > > [ 1188.676166] Modules linked in: rndis_host cdc_ether usbnet mii bnep<br>
> > > ecb btusb bluetooth rfkill bridge stp llc hid_generic usb_storage<br>
> > > joydev mousedev hid_apple usbhid bcm5974 nls_iso8859_1 nls_cp437 vfat<br>
> > > fat nouveau snd_hda_codec_hdmi coretemp x86_pkg_temp_thermal<br>
> > > intel_powerclamp kvm_intel kvm iTCO_wdt crct10dif_pclmul<br>
> > > iTCO_vendor_support crc32c_intel evdev aesni_intel mac_hid aes_x86_64<br>
> > > lrw glue_helper ablk_helper applesmc snd_hda_codec_cirrus cryptd<br>
> > > input_polldev snd_hda_codec_generic mxm_wmi led_class wmi microcode<br>
> > > hwmon snd_hda_intel ttm snd_hda_controller lpc_ich i2c_i801 mfd_core<br>
> > > snd_hda_codec i2c_algo_bit snd_hwdep drm_kms_helper snd_pcm sbs drm<br>
> > > apple_gmux i2ccore snd_timer snd agpgart mei_me soundcore sbshc mei<br>
> > > video xhci_hcd usbcore usb_common apple_bl button battery ac efivars<br>
> > > autofs4<br>
> > > [ 1188.676300] efivarfs<br>
> > > [ 1188.676308] CPU: 1 PID: 474 Comm: Xorg Tainted: G W<br>
> > > 3.17.0-rc2-nvtest+ #147<br>
> > > [ 1188.676313] Hardware name: Apple Inc.<br>
> > > MacBookPro11,3/Mac-2BD1B31983FE1663, BIOS<br>
> > > MBP112.88Z.0138.B11.1408291503 08/29/2014<br>
> > > [ 1188.676316] 0000000000000009 ffff88045daebce8 ffffffff814f0c09<br>
> > > 0000000000000000<br>
> > > [ 1188.676325] ffff88045daebd20 ffffffff8104ea5d ffff88006a6c1468<br>
> > > 00000000fffffff0<br>
> > > [ 1188.676333] 0000000000000000 0000000000000000 ffff88006a6c1000<br>
> > > ffff88045daebd30<br>
> > > [ 1188.676341] Call Trace:<br>
> > > [ 1188.676356] [<ffffffff814f0c09>] dump_stack+0x4d/0x66<br>
> > > [ 1188.676369] [<ffffffff8104ea5d>] warn_slowpath_common+0x7d/0xa0<br>
> > > [ 1188.676377] [<ffffffff8104eb3a>] warn_slowpath_null+0x1a/0x20<br>
> > > [ 1188.676439] [<ffffffffc04dd523>]<br>
> > > nouveau_fence_wait.part.9+0x33/0x40 [nouveau]<br>
> > > [ 1188.676496] [<ffffffffc04ddfe6>] nouveau_fence_wait+0x16/0x30 [nouveau]<br>
> > > [ 1188.676552] [<ffffffffc04e598f>]<br>
> > > nouveau_gem_ioctl_cpu_prep+0xef/0x1f0 [nouveau]<br>
> > > [ 1188.676578] [<ffffffffc01c2f4c>] drm_ioctl+0x1ec/0x660 [drm]<br>
> > > [ 1188.676590] [<ffffffff814f9026>] ? _raw_spin_unlock_irqrestore+0x36/0x70<br>
> > > [ 1188.676600] [<ffffffff81094f6d>] ? trace_hardirqs_on+0xd/0x10<br>
> > > [ 1188.676655] [<ffffffffc04da5b4>] nouveau_drm_ioctl+0x54/0xc0 [nouveau]<br>
> > > [ 1188.676663] [<ffffffff811a8950>] do_vfs_ioctl+0x300/0x520<br>
> > > [ 1188.676671] [<ffffffff814f9e55>] ? sysret_check+0x22/0x5d<br>
> > > [ 1188.676677] [<ffffffff811a8bb1>] SyS_ioctl+0x41/0x80<br>
> > > [ 1188.676683] [<ffffffff814f9e29>] system_call_fastpath+0x16/0x1b<br>
> > > [ 1188.676688] ---[ end trace 6f7a510865b4674f ]---<br>
> > ><br>
> > > Here are the fence events that fired during that particular fence_wait:<br>
> > > Xorg 474 [004] 1173.667645: fence:fence_wait_start:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56910<br>
> > > Xorg 474 [004] 1173.667647: fence:fence_enable_signal:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56910<br>
> > > swapper 0 [007] 1173.667688: fence:fence_signaled:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56900<br>
> > > swapper 0 [007] 1173.667692: fence:fence_destroy:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56900<br>
> > > swapper 0 [007] 1173.667839: fence:fence_signaled:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56901<br>
> > > swapper 0 [007] 1173.667842: fence:fence_destroy:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56901<br>
> > > swapper 0 [007] 1173.668021: fence:fence_signaled:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56902<br>
> > > swapper 0 [007] 1173.668482: fence:fence_signaled:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56903<br>
> > > swapper 0 [007] 1173.668485: fence:fence_destroy:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56903<br>
> > > swapper 0 [007] 1173.668489: fence:fence_signaled:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56904<br>
> > > swapper 0 [007] 1173.668496: fence:fence_signaled:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56905<br>
> > > swapper 0 [007] 1173.668499: fence:fence_destroy:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56905<br>
> > > swapper 0 [007] 1173.668502: fence:fence_signaled:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56906<br>
> > > swapper 0 [007] 1173.668505: fence:fence_signaled:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56907<br>
> > > swapper 0 [007] 1173.668508: fence:fence_destroy:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56907<br>
> > > swapper 0 [007] 1173.668511: fence:fence_signaled:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56908<br>
> > > swapper 0 [007] 1173.668513: fence:fence_destroy:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56908<br>
> > > kworker/4:1 80 [004] 1173.676265: fence:fence_destroy:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56896<br>
> > > kworker/4:1 80 [004] 1173.676273: fence:fence_destroy:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56898<br>
> > > kworker/4:1 80 [004] 1173.676277: fence:fence_destroy:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56902<br>
> > > kworker/4:1 80 [004] 1173.676280: fence:fence_destroy:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56904<br>
> > > Xorg 474 [001] 1188.676067: fence:fence_wait_end:<br>
> > > driver=nouveau timeline=Xorg[474] context=2 seqno=56910<br>
> > ><br>
> > > I assume that excludes the context you really want so the full fence<br>
> > > event log and corresponding dmesg output are attached.<br>
> ><br>
> > Yep, the trace events are useful. The fence is emitted and presumably no event is fired after emission.<br>
> ><br>
> > Lets find out if the nvif crap is buggy or it's a result of some other issue, what happens when you change:<br>
> > .wait = fence_default_wait,<br>
> > to<br>
> > .wait = nouveau_fence_wait_legacy,<br>
> > in nouveau_fence.c?<br>
><br>
> That change works just fine.</p>
<p dir="ltr">The xorg hangs also appear to be resolved by db1cf46 "drm/nouveau: use rcu in nouveau_gem_ioctl_cpu_prep"<br>
</p>
</div>