[Intel-gfx] [PATCH v2] drm/i915/vma: Fix potential UAF on multi-tile platforms

Andi Shyti andi.shyti at linux.intel.com
Thu Nov 9 14:47:45 UTC 2023


Hi Janusz,

On Wed, Nov 08, 2023 at 05:29:06PM +0100, Janusz Krzysztofik wrote:
> Object debugging tools were sporadically reporting illegal attempts to
> free a still active i915 VMA object from when parking a GPU tile believed
> to be idle.
> 
> [161.359441] ODEBUG: free active (active state 0) object: ffff88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915]
> [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0
> ...
> [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1
> [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022
> [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915]
> [161.360592] RIP: 0010:debug_print_object+0x80/0xb0
> ...
> [161.361347] debug_object_free+0xeb/0x110
> [161.361362] i915_active_fini+0x14/0x130 [i915]
> [161.361866] release_references+0xfe/0x1f0 [i915]
> [161.362543] i915_vma_parked+0x1db/0x380 [i915]
> [161.363129] __gt_park+0x121/0x230 [i915]
> [161.363515] ____intel_wakeref_put_last+0x1f/0x70 [i915]
> 
> That has been tracked down to be happening when another thread was
> deactivating the VMA inside __active_retire() helper, after the VMA's
> active counter was already decremented to 0, but before deactivation of
> the VMA's object was reported to the object debugging tools.  Root cause
> has been identified as premature release of last wakeref for the GPU tile
> to which the active VMA belonged.
> 
> In case of single-tile platforms, an engine associated with a request that
> uses the VMA is usually keeping the tile's wakeref long enough for that
> VMA to be deactivated on time, before it is going to be freed on last put
> of that wakeref.  However, on multi-tile platforms, a request may use a
> VMA from a tile other than the one that hosts the request's engine, then,
> not protected with the engine's wakeref.
> 
> Get an extra wakeref for the VMA's tile when activating it, and put that
> wakeref only after the VMA is deactivated.  However, exclude GGTT from
> that processing path, otherwise the GPU never goes idle.  Since
> __i915_vma_retire() may be called from atomic contexts, use async variant
> of wakeref put.
> 
> CI reports indicate that single-tile platforms also suffer sporadically
> from the same race, however, unlike in case of multi-tile, exact scenario
> when that happens hasn't been discovered yet.  Then, while I submit this
> patch as fix for multi-tile cases, and in hope it also addresses single-
> tile, I'm not able to blame any particular commit for that issue.
> However, I'm going to ask i915 maintainers to include this fix, if
> accepted, in the current rc cycle (6.7-rc) as important for the first
> supported multi-tile platform -- Meteor Lake.
> 
> v2: Get the wakeref before vm mutex to avoid circular locking dependency,
>   - drop questionable Fixes: tag.
> 
> Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik at linux.intel.com>

Can it be:

Fixes: 12c255b5dad1 ("drm/i915: Provide an i915_active.acquire callback")
Cc: Chris Wilson <chris at chris-wilson.co.uk>
Cc: <stable at vger.kernel.org> # v5.4+

It's not exactly the one fix we need for as this patch is fixing
something on its own but not explicitely stated (maybe it was a
precautionary measure).

But if the fix has to be applied, it has to date back to that
period, I guess.

Reviewed-by: Andi Shyti <andi.shyti at linux.intel.com>

Andi


More information about the Intel-gfx mailing list