[systemd-devel] deny access to GPU devices

Mon Nov 14 12:13:21 UTC 2016

On Friday 11 November 2016 21:09:14 Lennart Poettering wrote:
> On Mon, 07.11.16 16:15, Markus Koeberl (markus.koeberl at tugraz.at) wrote:
> 
> > hi!
> > 
> > I am using slurm to manage GPU resources. On a host with several
> > GPUs installed a user gets only access to the GPUs he asks slurm
> > for. This is implemented by using the devices cgroup controller. For
> > each job slurm starts, all devices which are not allowed get denied
> > using cgroup devices.deny.  But by default users get access to all
> > GPUs at login. As my users have ssh access to the host they can
> > bypass slurm and access all GPUs directly. Therefore I would like to
> > deny access to GPU devices for all user logins.
> 
> I have no idea what "slurm" is, but do note that the "devices" cgroup
> controller has no future, it is unlikely to ever become available in
> cgroupsv2.

That are bad news. Is there a place where I can read about the future of cgroups?
Slurm is a workload manager running on about 60% of the TOP500 supercomputers. 

> Device access to local users is normally managed through ACLs on the
> device node, via udev/logind's "uaccess" logic. Using the "devices"
> cgroup controller for this appears pretty misguided...

Using devices cgroup there is the possibility to extend this:
It is possible to grand one process of a user access to a device 
and at the same time deny an other process of the same user access to the same device.

In case of a batch system which should manage all resources this is a very welcome feature:
For example I manage hosts with 6 Nvidia Tesla K40 GPUs, 2 Intel Xeon 14 core CPUs and 256GB RAM.
To ensure that all resources are best utilized over time several users are allowed to use the same host at the same time as it is unlikely that only one user will be able to utilized it alone.
Therefore all users get permissions for accessing the GPUs (or any resource) but a system to manage the resources will be used (for example slurm) which knows about all the resource requirements of each individual process. This system traditionally monitors the used resources of all processes and in case a process violates the resource limits it asked for it gets terminated to ensure a stable system.
Using cgroups makes the monitoring job much easier for the resource management system and at the same time makes it easier to use for users because much more of there mistakes can be handled in a save manner without interfering with other users.
For example:
cpuset cgroup controller: pin the process and all sub processes to the same CPUs.
memory cgroup controller: very tight and secure limitation compared to monitoring it every 30 sec. and terminate processes.
device cgroup controller: deny access to a device so that it cannot be used by accident.
It also provides a very easy way to get accounting information.

For me it sounded like a very promising and clean solution to use the devices cgroup controller.
I am only a system administrator which has no insight in the development process of the Linux kernel.
Considering the information you provided I will stop wasting time on that and send the information to the slurm mailing list in case the developers do not know about that.

> > Basically what I want is for all users logins: 
> > echo "c 195:* rwm" > /sys/fs/cgroup/devices/... /devices.deny
> > Which should deny access to all Nvidia GPUs (this is what slurm does
> > in his own hierarchy which looks like
> > /sys/fs/cgroup/devices/slurm/uid_1044/job_555359/step_0).
> 
> Well, this is just broken. If you use systemd, then the cgroup tree in
> the hierarchies it manages are property of systemd, and if you want to
> make cgroups, you can do so only in subhierarchies of systemd's own
> tree, by setting on the Delegate=yes setting. The path above however
> indicates that this is not done here. hence you are really on your
> own, sorry.

I saw a posting about this earlier on this mailing list therefore I hope the developer already know about that.
But there seams to be no bug report in the slurm bug reporting system. I will create one to be sure.

> Also, were does 195 come from? is that a hardcoded major of the
> closed-source nvidia driver? Yuck, code really shouldn't hardcode
> major/minor numbers these days... And sec

I do not know what is going on in the closed-source nvidia driver. In the slurm source code I checked it is not hardcode.

> > I did not find anything in the documentation how to implement
> > this. It seams to me that there is no way at the moment to configure
> > sytemd to alter the cgroup device config when creating the session
> > for the user.  It would be nice if somebody could give me some hints
> > how to implement this or a link to an implementation or the right
> > documentation.
> 
> You can alter the DevicesAllow= property of the "user-1000.slice"
> (where 1000 is the uid of your user) unit. But do note that the whole
> "devices" cgroup controller is going away (as mentioned above), so
> this is not future proof. And in general ACL-based device access
> management is usually the better idea.

I had the impression that I need the opposite because using this to deny access wont work but I have to admit I did not test it.
Would "DeviceAllow=/dev/nvidia? " (omit rwm) remove r, w and m attributes form /dev/nvidia[0-9]

I also did not see a way to specify this for all users therefore this would mean to maintain the configuration on all hosts for each individual user which I do not like. Although I have a small number of users and hosts this sounds complicated to maintain especially in my case the environment is highly inhomogeneous.

Thank you very much for all the information!

regards
Markus Köberl
-- 
Markus Koeberl
Graz University of Technology
Signal Processing and Speech Communication Laboratory
E-mail: markus.koeberl at tugraz.at