[Intel-gfx] [RFC PATCH 160/162] drm/i915/dg1: Fix GPU hang due to shmemfs page drop

Chris Wilson chris at chris-wilson.co.uk
Fri Nov 27 14:44:02 UTC 2020


Quoting Matthew Auld (2020-11-27 12:07:16)
> From: Venkata Ramana Nayana <venkata.ramana.nayana at intel.com>
> 
> This is to fix a bug in upstream
> commit a6326a4f8ffb ("drm/i915/gt: Keep a no-frills swappable copy of the default context state")
> 
> We allocate context state obj ce->state from lmem, so in __engines_record_defaults(),
> we call shmem_create_from_object(). Because it is lmem object, this call will
> create a new shmemfs file, copy the contents into it, and return the file
> pointer and assign to engine->default_state. Of course ce->state lmem object
> is freed at the end of function __engines_record_redefaults().
> 
> Because a new shmemfs file is create for engine->default_state,
> and more importantly, we DON'T mark the pages dirty after we write into it,
> the OS page cache eviction will drop these pages.
> 
> Now with the test move forward, it will create new request/context, and will
> copy the saved engine->default_state into ce->state. If the default_state
> pages are dropped during page cache eviction, the copying will get new pages,
> and copy garbage from the new pages. Next, ce->state will have wrong
> instruction and causes GPU to hang.
> 
> The fixing is very simple, we just mark the shmemfs pages to be dirty when
> writing into it, and also mark the pages to accessed when read/write to them.
> 
> Fixes: a6326a4f8ffb("drm/i915/gt: Keep a no-frills swappable copy of the default context state")

A bug fix, send it. But please write a concise changelog first.

I missed setting the dirty bit, and so the contents were not being saved
on swap out as expected. Impact is severe; any context created after
resume may be gibberish.
-Chris


More information about the Intel-gfx mailing list