On Wed, Mar 23, 2022 at 8:14 AM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, 23 Mar 2022 at 15:07, Daniel Stone daniel@fooishbar.org wrote:
Hi,
On Mon, 21 Mar 2022 at 16:02, Rob Clark robdclark@gmail.com wrote:
On Mon, Mar 21, 2022 at 2:30 AM Christian König christian.koenig@amd.com wrote:
Well you can, it just means that their contexts are lost as well.
Which is rather inconvenient when deqp-egl reset tests, for example, take down your compositor ;-)
Yeah. Or anything WebGL.
System-wide collateral damage is definitely a non-starter. If that means that the userspace driver has to do what iris does and ensure everything's recreated and resubmitted, that works too, just as long as the response to 'my adblocker didn't detect a crypto miner ad' is something better than 'shoot the entire user session'.
Not sure where that idea came from, I thought at least I made it clear that legacy gl _has_ to recover. It's only vk and arb_robustness gl which should die without recovery attempt.
The entire discussion here is who should be responsible for replay and at least if you can decide the uapi, then punting that entirely to userspace is a good approach.
Ofc it'd be nice if the collateral damage is limited, i.e. requests not currently on the gpu, or on different engines and all that shouldn't be nuked, if possible.
Also ofc since msm uapi is that the kernel tries to recover there's not much we can do there, contexts cannot be shot. But still trying to replay them as much as possible feels a bit like overkill.
It would perhaps be nice if older gens which don't (yet) have per-process pgtables to have gone with the userspace-replays (although that would require a lot more tracking in userspace than what is done currently).. but fortunately those older gens don't use "state objects" which could potentially be corrupted, but instead re-emit state in cmdstream, so there is a lot less possibility for bad collateral damage. (On all the gens we also use gpu read-only buffers whenever the gpu does not need to be able to write them.)
For newer stuff, the process isolation works pretty well. In fact we recently changed MSM_PARAM_FAULTS to only report faults/hangs in the same address space, so the compositor is not even aware (and doesn't need to be aware).
BR, -R
-Daniel
Cheers, Daniel
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch