[Intel-gfx] [PATCH 1/6] drm/i915: Use the full hammer when shutting down the rcu tasks

Mon Oct 31 21:05:51 UTC 2016

On Mon, Oct 31, 2016 at 05:15:45PM +0000, Tvrtko Ursulin wrote:
> 
> On 31/10/2016 10:26, Chris Wilson wrote:
> >To flush all call_rcu() tasks (here from i915_gem_free_object()) we need
> >to call rcu_barrier() (not synchronize_rcu()). If we don't then we may
> >still have objects being freed as we continue to teardown the driver -
> >in particular, the recently released rings may race with the memory
> >manager shutdown resulting in sporadic:
> >
> >[  142.217186] WARNING: CPU: 7 PID: 6185 at drivers/gpu/drm/drm_mm.c:932 drm_mm_takedown+0x2e/0x40
> >[  142.217187] Memory manager not clean during takedown.
> >[  142.217187] Modules linked in: i915(-) x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel lpc_ich snd_hda_codec_realtek snd_hda_codec_generic mei_me mei snd_hda_codec_hdmi snd_hda_codec snd_hwdep snd_hda_core snd_pcm e1000e ptp pps_core [last unloaded: snd_hda_intel]
> >[  142.217199] CPU: 7 PID: 6185 Comm: rmmod Not tainted 4.9.0-rc2-CI-Trybot_242+ #1
> >[  142.217199] Hardware name: LENOVO 10AGS00601/SHARKBAY, BIOS FBKT34AUS 04/24/2013
> >[  142.217200]  ffffc90002ecfce0 ffffffff8142dd65 ffffc90002ecfd30 0000000000000000
> >[  142.217202]  ffffc90002ecfd20 ffffffff8107e4e6 000003a40778c2a8 ffff880401355c48
> >[  142.217204]  ffff88040778c2a8 ffffffffa040f3c0 ffffffffa040f4a0 00005621fbf8b1f0
> >[  142.217206] Call Trace:
> >[  142.217209]  [<ffffffff8142dd65>] dump_stack+0x67/0x92
> >[  142.217211]  [<ffffffff8107e4e6>] __warn+0xc6/0xe0
> >[  142.217213]  [<ffffffff8107e54a>] warn_slowpath_fmt+0x4a/0x50
> >[  142.217214]  [<ffffffff81559e3e>] drm_mm_takedown+0x2e/0x40
> >[  142.217236]  [<ffffffffa035c02a>] i915_gem_cleanup_stolen+0x1a/0x20 [i915]
> >[  142.217246]  [<ffffffffa034c581>] i915_ggtt_cleanup_hw+0x31/0xb0 [i915]
> >[  142.217253]  [<ffffffffa0310311>] i915_driver_cleanup_hw+0x31/0x40 [i915]
> >[  142.217260]  [<ffffffffa0312001>] i915_driver_unload+0x141/0x1a0 [i915]
> >[  142.217268]  [<ffffffffa031c2c4>] i915_pci_remove+0x14/0x20 [i915]
> >[  142.217269]  [<ffffffff8147d214>] pci_device_remove+0x34/0xb0
> >[  142.217271]  [<ffffffff8157b14c>] __device_release_driver+0x9c/0x150
> >[  142.217272]  [<ffffffff8157bcc6>] driver_detach+0xb6/0xc0
> >[  142.217273]  [<ffffffff8157abe3>] bus_remove_driver+0x53/0xd0
> >[  142.217274]  [<ffffffff8157c787>] driver_unregister+0x27/0x50
> >[  142.217276]  [<ffffffff8147c265>] pci_unregister_driver+0x25/0x70
> >[  142.217287]  [<ffffffffa03d764c>] i915_exit+0x1a/0x71 [i915]
> >[  142.217289]  [<ffffffff811136b3>] SyS_delete_module+0x193/0x1e0
> >[  142.217291]  [<ffffffff818174ae>] entry_SYSCALL_64_fastpath+0x1c/0xb1
> >[  142.217292] ---[ end trace 6fd164859c154772 ]---
> >[  142.217505] [drm:show_leaks] *ERROR* node [6b6b6b6b6b6b6b6b + 6b6b6b6b6b6b6b6b]: inserted at
> >                [<ffffffff81559ff3>] save_stack.isra.1+0x53/0xa0
> >                [<ffffffff8155a98d>] drm_mm_insert_node_in_range_generic+0x2ad/0x360
> >                [<ffffffffa035bf23>] i915_gem_stolen_insert_node_in_range+0x93/0xe0 [i915]
> >                [<ffffffffa035c855>] i915_gem_object_create_stolen+0x75/0xb0 [i915]
> >                [<ffffffffa036a51a>] intel_engine_create_ring+0x9a/0x140 [i915]
> >                [<ffffffffa036a921>] intel_init_ring_buffer+0xf1/0x440 [i915]
> >                [<ffffffffa036be1b>] intel_init_render_ring_buffer+0xab/0x1b0 [i915]
> >                [<ffffffffa0363d08>] intel_engines_init+0xc8/0x210 [i915]
> >                [<ffffffffa0355d7c>] i915_gem_init+0xac/0xf0 [i915]
> >                [<ffffffffa0311454>] i915_driver_load+0x9c4/0x1430 [i915]
> >                [<ffffffffa031c2f8>] i915_pci_probe+0x28/0x40 [i915]
> >                [<ffffffff8147d315>] pci_device_probe+0x85/0xf0
> >                [<ffffffff8157b7ff>] driver_probe_device+0x21f/0x430
> >                [<ffffffff8157baee>] __driver_attach+0xde/0xe0
> >
> >In particular note that the node was being poisoned as we inspected the
> >list, a  clear indication that the object is being freed as we make the
> >assertion.
> >
> >Fixes: fbbd37b36fa5 ("drm/i915: Move object release to a freelist + worker")
> >Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> >---
> > drivers/gpu/drm/i915/i915_drv.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> >index 839ce2ae38fa..ed01421e3be7 100644
> >--- a/drivers/gpu/drm/i915/i915_drv.c
> >+++ b/drivers/gpu/drm/i915/i915_drv.c
> >@@ -544,8 +544,10 @@ static void i915_gem_fini(struct drm_i915_private *dev_priv)
> > 	i915_gem_context_fini(&dev_priv->drm);
> > 	mutex_unlock(&dev_priv->drm.struct_mutex);
> >
> >-	synchronize_rcu();
> >-	flush_work(&dev_priv->mm.free_work);
> >+	do {
> >+		rcu_barrier();
> >+		flush_work(&dev_priv->mm.free_work);
> >+	} while (!llist_empty(&dev_priv->mm.free_list));
> >
> > 	WARN_ON(!list_empty(&dev_priv->context_list));
> > }
> >
> 
> Not sure that the loop is required - after the rcu_barrier all the
> queued up additions have been added to the list and workers queued.
> So flush_work afterwards handles those and we are done, no?

We don't need the loop, it was an overreaction. A subsequent WARN would
be good enough to detect futher issues. That WARN would be better in
front of freeing the object kmem_cache.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre