[Intel-gfx] [PATCH 02/42] drm/i915: Stop the machine whilst capturing the GPU crash dump

Joonas Lahtinen joonas.lahtinen at linux.intel.com
Fri Oct 7 10:11:54 UTC 2016


On pe, 2016-10-07 at 10:45 +0100, Chris Wilson wrote:
> The error state is purposefully racy as we expect it to be called at any
> time and so have avoided any locking whilst capturing the crash dump.
> However, with multi-engine GPUs and multiple CPUs, those races can
> manifest into OOPSes as we attempt to chase dangling pointers freed on
> other CPUs. Under discussion are lots of ways to slow down normal
> operation in order to protect the post-mortem error capture, but what it
> we take the opposite approach and freeze the machine whilst the error
> capture runs (note the GPU may still running, but as long as we don't
> process any of the results the driver's bookkeeping will be static).
> 
> Note that by of itself, this is not a complete fix. It also depends on
> the compiler barriers in list_add/list_del to prevent traversing the
> lists into the void. We also depend that we only require state from
> carefully controlled sources - i.e. all the state we require for
> post-mortem debugging should be reachable from the request itself so
> that we only have to worry about retrieving the request carefully. Once
> we have the request, we know that all pointers from it are intact.
> 
> v2: Avoid drm_clflush_pages() inside stop_machine() as it may use
> stop_machine() itself for its wbinvd fallback.
> 
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>

Reviewed-by: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>

CC'in Daniel to add A-b.

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation


More information about the Intel-gfx mailing list