GPU-side memory protection landscape
Michel Dänzer
michel at daenzer.net
Tue Dec 1 11:03:25 UTC 2020
On 2020-11-30 3:07 p.m., Alexander Monakov wrote:
>
> My other concern is how easy it is to cause system instability or hangs
> by out-of-bounds writes from the GPU (via compute shaders or copy
> commands). In my experience of several years doing GPU computing with
> NVIDIA tech, I don't recall needing to lose time rebooting my PC after
> running a buggy CUDA "kernel". Heck, I could run the GCC C testsuite on
> the GPU without worrying about locking myself and others from the
> server. But now when I develop on a laptop with AMD's latest mobile SoC,
> every time I make a mistake in my GLSL code it more often than not forces
> a reboot. I hope you understand what a huge pain it is.
>
> What are the existing GPU hardware capabilities for memory protection
> (both in terms of preventing random accesses to system memory like with
> an IOMMU, and in terms of isolating different process contexts from each
> other), and to what extend Linux DRM drivers are taking advantage of
> them?
Modern (or more like non-ancient at this point, basically anything which
came out within the last decade) AMD GPUs have mostly perfect protection
between different execution contexts (i.e. different processes normally,
though it's not always a 1:1 mapping). Each context has its own virtual
GPU address space and cannot access any memory which isn't mapped into
that (which the kernel driver only does for memory belonging to a buffer
object which the context has permission to access and has explicitly
asked to be mapped into its address space).
The instability you're seeing likely isn't due to lack of memory
protection but due to any of a large number of other ways a GPU can end
up in a hanging state, and the drivers and wider ecosystem not being
very good at recovering from that yet.
--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and X developer
More information about the dri-devel
mailing list