[PATCH 1/1] drm/amdkfd: Add IPC API

Tue Jul 14 14:37:00 UTC 2020

On Tue, Jul 14, 2020 at 01:36:41PM +0200, Christian König wrote:
> Am 14.07.20 um 11:53 schrieb Daniel Vetter:
> > On Tue, Jul 14, 2020 at 11:28:12AM +0200, Christian König wrote:
> > > Am 14.07.20 um 10:58 schrieb Daniel Vetter:
> > > > On Tue, Jul 14, 2020 at 02:26:36PM +1000, Dave Airlie wrote:
> > > > > On Tue, 14 Jul 2020 at 14:09, Felix Kuehling <felix.kuehling at amd.com> wrote:
> > > > > > Am 2020-07-13 um 11:28 p.m. schrieb Dave Airlie:
> > > > > > > On Tue, 14 Jul 2020 at 13:14, Felix Kuehling <Felix.Kuehling at amd.com> wrote:
> > > > > > > > This allows exporting and importing buffers. The API generates handles
> > > > > > > > that can be used with the HIP IPC API, i.e. big numbers rather than
> > > > > > > > file descriptors.
> > > > > > > First up why? I get the how.
> > > > > > The "why" is compatibility with HIP code ported from CUDA. The
> > > > > > equivalent CUDA IPC API works with handles that can be communicated
> > > > > > through e.g. a pipe or shared memory. You can't do that with file
> > > > > > descriptors.
> > > > > Okay that sort of useful information should definitely be in the patch
> > > > > description.
> > > > > 
> > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvidia.com%2Fcuda%2Fcuda-runtime-api%2Fgroup__CUDART__DEVICE.html%23group__CUDART__DEVICE_1g8a37f7dfafaca652391d0758b3667539&data=02%7C01%7Cchristian.koenig%40amd.com%7C2e0b8a6d2aac49e0f6c908d827dbcb46%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637303172250574336&sdata=mDWVEDsP%2BKTvvhqYp%2BSstczPtEV9l7n%2B%2BuNj30de0sQ%3D&reserved=0
> > > > > > 
> > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvidia.com%2Fcuda%2Fcuda-runtime-api%2Fgroup__CUDART__DEVICE.html%23group__CUDART__DEVICE_1g01050a29fefde385b1042081ada4cde9&data=02%7C01%7Cchristian.koenig%40amd.com%7C2e0b8a6d2aac49e0f6c908d827dbcb46%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637303172250574336&sdata=yb80LA1csqkctuF1rAXBpYgJT%2BkS0Nmfilnd8%2BjQSW4%3D&reserved=0
> > > > > > 
> > > > > > > > + * @share_handle is a 128 bit random number generated with
> > > > > > > > + * @get_random_bytes. This number should be very hard to guess.
> > > > > > > > + * Knowledge of the @share_handle implies authorization to access
> > > > > > > > + * the shared memory. User mode should treat it like a secret key.
> > > > > > > > + * It can be used to import this BO in a different process context
> > > > > > > > + * for IPC buffer sharing. The handle will be valid as long as the
> > > > > > > > + * underlying BO exists. If the same BO is exported multiple times,
> > > > > > > Do we have any examples of any APIs in the kernel that operate like
> > > > > > > this? That don't at least layer some sort of file permissions  and
> > > > > > > access control on top?
> > > > > > SystemV shared memory APIs (shmget, shmat) work similarly. There are
> > > > > > some permissions that can be specified by the exporter in shmget.
> > > > > > However, the handles are just numbers and much easier to guess (they are
> > > > > > 32-bit integers). The most restrictive permissions would allow only the
> > > > > > exporting UID to attach to the shared memory segment.
> > > > > > 
> > > > > > I think DRM flink works similarly as well, with a global name IDR used
> > > > > > for looking up GEM objects using global object names.
> > > > > flink is why I asked, because flink was a mistake and not one I'd care
> > > > > to make again.
> > > > > shm is horrible also, but at least has some permissions on what users
> > > > > can attack it.
> > > > Yeah this smells way too much like flink. I had the same reaction, and
> > > > kinda sad that we have to do this because nvidia defines how this works
> > > > with 0 input from anyone else. Oh well, the world sucks.
> > > > 
> > > > > > > The reason fd's are good is that combined with unix sockets, you can't
> > > > > > > sniff it, you can't ptrace a process and find it, you can't write it
> > > > > > > out in a coredump and have someone access it later.
> > > > > > Arguably ptrace and core dumps give you access to all the memory
> > > > > > contents already. So you don't need the shared memory handle to access
> > > > > > memory in that case.
> > > > > core dumps might not dump this memory though, but yeah ptrace would
> > > > > likely already mean you have access.
> > > > > 
> > > > > > > Maybe someone who knows security can ack merging this sort of uAPI
> > > > > > > design, I'm not confident in what it's doing is in any ways a good
> > > > > > > idea. I might have to ask some people to take a closer look.
> > > > > > Please do. We have tried to make this API as secure as possible within
> > > > > > the constraints of the user mode API we needed to implement.
> > > > > I'll see if I hear back, but also if danvet has any input like I
> > > > > suppose it's UUID based buffer access, so maybe 128-bit is enough and
> > > > > you have enough entropy not to create anything insanely predictable.
> > > > So one idea that crossed my mind is if we don't want to do this as a
> > > > generic dma-buf handle converter.
> > > > 
> > > > Something like /dev/dri/cuda_is_nasty (maybe slightly nicer name) which
> > > > provides a generic dma-buf <-> cuda uuid converter. With separate access
> > > > restrictions, so admins can decide whether they want to allow this
> > > > silliness, or not. Anyone else who wants to reimplement cuda will need
> > > > this too, so that's another reason for splitting this out.
> > > > 
> > > > Wrt security: I think assuming that there's none and the lookup has a
> > > > side-channel you can use to efficiently scan the entire range is probably
> > > > the safe approach here. This is way out of my league, but I think people
> > > > who know how to do this won't have a much harder time scanning this than
> > > > the flink space.
> > > > 
> > > > Also, if we have one common uuid->dma-buf converter, we might actually
> > > > have a chance to proof the "it's not secure" assumption wrong. Also, we
> > > > might be able to tie this into cgroups or namespaces or similar that way.
> > > > 
> > > > Just some thoughts to give my initial "eek, why this" reaction a bit more
> > > > substance :-) No idea whether this would work or make more sense.
> > > Yeah, my initial reaction was the same. On the pro side is that we use more
> > > than the 32bits flink did as identifier.
> > flink started at 0, so in practice it was trivial to enumerate. Not even
> > randomized.
> > 
> > But the thing is if your uuid lookup isn't guaranteed to be const and
> > side-channel free, then you can use that to guess where ids are. Doesn't
> > need to be much until you can brute-force the remaining bits. Engineering
> > an implementation (not just the theory) that relies on the assumption that
> > full brute-force is the fastes option is very hard engineering. And I
> > think here in the gpu world we just don't have any of that expertise.
> 
> Well being able to look up up ids is not necessary a problem. See that root
> can see all buffers for debugging purposes is most likely not even a bad
> idea.
> 
> When we manage it as an fs we can just add/remove the executable bit from
> the directory to control enumeration.
> 
> And we can still have normal r/w permissions on the individual buffers as
> well.

Yeah id look-up isn't a problem, it's that the uuid comes with nothing
else like a fd or inode attached which would allows people to add some
security checks and namespacing to it. Which means that the entire
security boils down to "you can't guess the uuid", which means the
hashtable lookup must be sidechannel proof. Which we're not going to be
able to pull off I think :-)

> > > What we could maybe do to improve this is to link DMA-buf file descriptors
> > > into the file system from userspace. And then we could just do something
> > > like:
> > > 
> > > open("/tmp/dma-buf-0x0123-4567-89AB-CDEF-0123-4567-89AB-CDEF", "rw");
> > > 
> > > But to be honest I don't really know the fs code that well enough to judge
> > > if this is possible or not.
> > > 
> > > 
> > > Or we let DMA-bufs appear under some directory of /sys by their name so that
> > > applications can open and use them.
> > Yeah dmabuffs might be another option, but not sure how that would work,
> > since we want a creat() that takes in a dma-buf fd and then freezes those
> > two together forever. Maybe something like devpts, but I think that's also
> > somewhat considered a design mistake (because namespace or something else,
> > I dunno). Since if we link all dma-buf by default into the fs, we again
> > have the flink "free for everyone" security issues.
> > 
> > Hm I guess one option could be such a dmabufs, but still with an explicit
> > export step. And the export amounts to a creat() in dmabufs, using the
> > uid/gid and entire security context of the process that has done the
> > dmabuf2uuid export.
> > 
> > That would also give us namespacing for free, using fs namespaces. All
> > we'd need is multiple instances of this dmabuffs. Plus I guess we'd need a
> > canonical mount point for this thing, and then a special ioctl on the root
> > node for creating a file from a dma-buf fd.
> > 
> > Feels mildly overengineered, but I think this would at least make the cuda
> > uuid stuff fit reasonably well into the overall linux architecture we
> > have. Only bikeshed left would be to figure out where to mount this fs.
> > Maybe /dev/dma-buf-uuids or something like that as the canonical thing.
> > /sys feels misplaced, and we alread have /dev/pts for similar stuff.
> 
> Yeah a dmabuffs sounds overengineered to me as well.
> 
> It's a pity that we can't just open anonymous inodes under
> /proc/$pid/fd/$fd.
> 
> This way we would just need to encode the pid and fd number in the id
> transmitted to the other process and could solve it this way.
> 
> That an application changes the permissions of its own file descriptors
> because it needs to implement some questionable API design is not a problem
> of the kernel.

Hm could we fix this by making the inode non-anon? Maybe also quite a pile
of overkill ...

Another classic issue is lifetime of these names, I guess these are
supposed to be weak references, i.e. the uuid disappears with the last
buffer reference (like flink) and doesn't hold a reference of its own
(like dma-buf fd). Implementing that in a dmabuffs would be quite a pile
of fun.

So maybe it's a lot more nasty in the implementation, but dmabuffs feels a
pile cleaner for all this. Also gives us a clear thing to throw at the
next cuda reimplementation.
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > Wrt typing up an entire fs, I thought with all the libfs work that
> > shouldn't be too hard to pull off.
> > -Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch