[Intel-gfx] DRM cgroups integration (Was: Re: [PATCH v4 0/8] cgroup private data and DRM/i915 integration)

Thu Apr 5 17:32:16 UTC 2018

Adding Kenny, who recently started looking into cgroup support from the
AMD GPU computing side.

Regards,
  Felix

On 2018-04-05 11:48 AM, Matt Roper wrote:
> Just realized Joonas had some other comments down at the bottom that I
> missed originally...
>
> On Thu, Apr 05, 2018 at 08:06:23AM -0700, Matt Roper wrote:
>> On Thu, Apr 05, 2018 at 07:49:44AM -0700, Matt Roper wrote:
>>> On Thu, Apr 05, 2018 at 05:15:13PM +0300, Joonas Lahtinen wrote:
>>>> + Some more Cc's based on IRC discussion
>>>>
>>>> Quoting Joonas Lahtinen (2018-04-05 16:46:51)
>>>>> + Dave for commenting from DRM subsystem perspective. I strongly believe
>>>>> there would be benefit from agreeing on some foundation of DRM subsystem
>>>>> level program GPU niceness [-20,19] and memory limit [0,N] pages.
>>> +Chris Wilson
>> +more Cc's based on IRC discussion.
>>
>>
>> Matt
>>
>>> If we ignore backward compatibility and ABI issues for now and assume
>>> all drivers can move to [-20, 19] for application-accessible priority
>>> ranges (i.e., self opt-in via existing context parameter ioctls and
>>> such), I think we still need a much larger range for the priority offset
>>> assigned via cgroup.  One of the goals with the cgroup work is to give
>>> the system integrator the ability to define classes of applications with
>>> non-overlapping ranges (i.e., in some systems an app in the "very
>>> important" cgroup that self-lowers itself as far as possible should
>>> still be prioritized higher than an app in the "best effort" cgroup that
>>> self-boosts itself as far as possible).  Chris Wilson and I discussed
>>> that a bit on this thread if you want to see the context:
>>>
>>> https://lists.freedesktop.org/archives/intel-gfx/2018-March/158204.html
>>>
>>> That discussion was based on i915's current application-accessible range
>>> of [-1023,1023].  If we shift to a much smaller [-20,19] range for
>>> applications to use directly, then we might want to allow cgroup offset
>>> to be something like [-1000,1000] to ensure the system integrator has
>>> enough flexibility?
>>>
>>> Also note that "display boost" (i.e., a priority boost for contexts that
>>> are blocking a flip) may vary depending on system setup, so setting
>>> being able to define the display boost's effective priority via cgroup
>>> is important as well.
>>>
>>>
>>> Matt
>>>
>>>
>>>
>>>>> Quoting Matt Roper (2018-03-30 03:43:13)
>>>>>> On Mon, Mar 26, 2018 at 10:30:23AM +0300, Joonas Lahtinen wrote:
>>>>>>> Quoting Matt Roper (2018-03-23 17:46:16)
>>>>>>>> On Fri, Mar 23, 2018 at 02:15:38PM +0200, Joonas Lahtinen wrote:
>>>>>>>>> Quoting Matt Roper (2018-03-17 02:08:57)
>>>>>>>>>> This is the fourth iteration of the work previously posted here:
>>>>>>>>>>   (v1) https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
>>>>>>>>>>   (v2) https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg208170.html
>>>>>>>>>>   (v3) https://lists.freedesktop.org/archives/intel-gfx/2018-March/157928.html
>>>>>>>>>>
>>>>>>>>>> The high level goal of this work is to allow non-cgroup-controller parts
>>>>>>>>>> of the kernel (e.g., device drivers) to register their own private
>>>>>>>>>> policy data for specific cgroups.  That mechanism is then made use of in
>>>>>>>>>> the i915 graphics driver to allow GPU priority to be assigned according
>>>>>>>>>> to the cgroup membership of the owning process.  Please see the v1 cover
>>>>>>>>>> letter linked above for a more in-depth explanation and justification.
>>>>>>>>> Hi Matt,
>>>>>>>>>
>>>>>>>>> For cross-subsystem changes such as this, it makes sense to Cc all
>>>>>>>>> relevant maintainers, especially if there have been previous comments to
>>>>>>>>> earlier revisions.
>>>>>>>>>
>>>>>>>>> Please, do include and keep a reference to the userspace portion of the
>>>>>>>>> changes when you suggest new uAPI to be added. At least I have some trouble
>>>>>>>>> trying to track down the relevant interface consumer here.
>>>>>>>>>
>>>>>>>>> I'm unsure how much sense it makes to commence with detailed i915 review
>>>>>>>>> if we will be blocked by lack of userspace after that? I'm assuming
>>>>>>>>> you've read through [1] already.
>>>>>>>> Hi Joonas,
>>>>>>>>
>>>>>>>> I've sent the userspace code out a few times, but it looks like I forgot
>>>>>>>> to include a copy with v4/v4.5.  Here's the version I provided with v3:
>>>>>>>>   https://lists.freedesktop.org/archives/intel-gfx/2018-March/157935.html
>>>>>>> Thanks. Keeping that in the relevant commit message of the patch that
>>>>>>> introduces the new uAPI will make it harder to forget and easiest for
>>>>>>> git blame, too.
>>>>>>>
>>>>>>>> Usually we don't consider things like i-g-t to be sufficient userspace
>>>>>>>> consumers because we need a real-world consumer rather than a "toy"
>>>>>>>> userspace.  However in this case, the i-g-t tool, although very simple,
>>>>>>>> is really the only userspace consumer I expect there to ever be.
>>>>>>>> Ultimately the true consumer of this cgroups work are bash scripts, sysv
>>>>>>>> init scripts, systemd recipes, etc.  that just need a very simple tool
>>>>>>>> to assign the specific values that make sense on a given system.
>>>>>>>> There's no expectation that graphics clients or display servers would
>>>>>>>> ever need to make use of these interfaces.
>>>>>>> I was under the impression that a bit more generic GPU cgroups support
>>>>>>> was receiving a lot of support in the early discussion? A dedicated
>>>>>>> intel_cgroup sounds underwhelming, when comparing to idea of "gpu_nice",
>>>>>>> for user adoption :)
>>>>>> I'm open to moving the cgroup_priv registration/lookup to the DRM core
>>>>>> if other drivers are interested in using this mechanism and if we can
>>>>>> come to an agreement on a standard priority offset range to support, how
>>>>>> display boost should work for all drivers, etc.  There might be some
>>>>>> challenges mapping a DRM-defined priority range down to a different
>>>>>> range that makes sense for individual driver schedulers, especially
>>>>>> since some drivers already expose a different priority scheme to
>>>>>> userspace via other interfaces like i915 does with GEM context priority.
>>>>>>
>>>>>> So far I haven't really heard any interest outside the Intel camp, but
>>>>>> hopefully other driver teams can speak up if they're for/against this.
>>>>>> I don't want to try to artificially standardize this if other drivers
>>>>>> want to go a different direction with priority/scheduling that's too
>>>>>> different from the current Intel-driven design.
>>>>> I don't think there are that many directions to go about GPU context
>>>>> priority, considering we have the EGL_IMG_context_priority extension, so
>>>>> it'll only be about granularity of the scale.
>>>>>
>>>>> I would suggest to go with the nice like scale for easy user adoption,
>>>>> then just apply that as the N most significant bits.
>>>>>
>>>>> The contexts could then of course further adjust their priority from what
>>>>> is set by the "gpu_nice" application with the remaining bits.
>>>>>
>>>>> I'm strongly feeling this should be a DRM level "gpu_nice". And the
>>>>> binding to cgroups should come through DRM core. If it doesn't, limiting
>>>>> the amount of memory used becomes awkward as the allocation is
>>>>> centralized to DRM core.
> My earliest iterations of this work did add the interfaces to set
> cgroup-based GPU priority at the DRM core.  I didn't really get the
> sense at the time that anyone outside of Intel was interested in the
> cgroups work, so I moved the interfaces back into i915 for subequent
> updates, but I'm happy to move it back.  Note that even in that case
> this will still end up as a DRM core ioctl rather than a true cgroups
> controller; see below.
>
>>>>>>> Also, I might not be up-to-date about all things cgroups, but the way
>>>>>>> intel_cgroup works, feels bit forced. We create a userspace context just
>>>>>>> to communicate with the driver and the IOCTL will still have global
>>>>>>> effects. I can't but think that i915 reading from the cgroups subsystem
>>>>>>> for the current process would feel more intuitive to me.
>>>>>> I think you're referring to the earlier discussion about exposing
>>>>>> priority directly via the cgroups filesystem?  That would certainly be
>>>>>> simpler from a userspace perspective, but it's not the direction that
>>>>>> the cgroups maintainer wants to see things go.  Adding files directly to
>>>>>> the cgroups filesystem is supposed to be something that's reserved for
>>>>>> official cgroups controllers.  The GPU priority concept we're trying to
>>>>>> add here doesn't align with the requirements for creating a controller,
>>>>>> so the preferred approach is to create a custom interface (syscall or
>>>>>> ioctl) that simply takes a cgroup as a parameter.  There's precendent
>>>>>> with similar interfaces in areas like BPF (where the bpf() system call
>>>>>> can accept a cgroup as a parameter and then perform its own private
>>>>>> policy changes as it sees fit).
>>>>>>
>>>>>> Using a true cgroups controller and exposing settings via the filesystem
>>>>>> is likely still the way we'll want to go for some other types of
>>>>>> cgroups-based policy in the future (e.g., limiting GPU memory usage); it
>>>>>> just isn't the appropriate direction for priority.
>>>>> Might be just me but feels bit crazy to be setting GPU memory usage
>>>>> through another interface and then doing i915 specific IOCTLs to control
>>>>> the priority of that same cgroup.
>>>>>
>>>>> I don't feel comfortable adding custom cgroups dependent IOCTLs to i915
>>>>> where cgroups is only working as the variable carrier in background. We
>>>>> should really just be consuming a variable from cgroups and it should be
>>>>> set outside of of the i915 IOCTL interface.
>>>>>
>>>>> I'm still seeing that we should have a DRM cgroups controller and a DRM
>>>>> subsystem wide application to control the priority and memory usage
>>>>> to be fed to the drivers.
> A cgroups controller could be used for GPU memory management, but not
> for priority adjustment the way we're doing it right now.  Although
> there's some preexisting legacy stuff that doesn't quite follow the
> rules, the cgroups subsystem will only accept new cgroup controllers if
> they take the entire structure of the cgroups hierarchy into account
> (rather than just looking at values at a leaf node) and are based on one
> of the four resource distribution models described in the
> Documentation/cgroup-v2.txt (weights, limits, protections, allocations).
> GPU priority is somewhat close to a "weight" but not quite --- a true
> weight wouldn't allow unfair scheduling or intentional starvation which
> are very important to the use cases I'm focusing on.
>
> The cgroups maintainer has suggested that a separate interface (syscall,
> ioctl, etc.) be used for items like this that don't meet the criteria
> for a full cgroups controller, and there are other examples of this kind
> of design elsewhere in the kernel too.
>
>>>>> If we end up just supporting i915 apps, we could as well use LD_PRELOAD
>>>>> wrapper and alter the context priority at creation time for exactly the
>>>>> same effect and no extra interfaces to maintain.
> I think this is missing the point of using cgroups.  cgroups allow
> policy to be set independently by a privileged system integrator or
> system administrator, in a way that applications themselves don't have
> access to.  LD_PRELOAD would only have access to the interfaces
> available to the app itself and the app could simply undo whatever
> setting the LD_PRELOAD tried to use.
>
> The goal with cgroups isn't to provide a "gpu nice" where you
> self-classify yourself, but rather the ability for the person defining
> how the overall system works to classify and control the various types
> of applications running in a transparent and global way.
>
>>>>>>> Does the implementation mimic some existing cgroups tool or de-facto way
>>>>>>> of doing things in cgroups world?
>>>>>> The ioctl approach I took is similar to syscall approach that the BPF
>>>>>> guys use to attach BPF programs to a cgroup.  I'm not very familiar with
>>>>>> BPF or how it gets used from userspace, so I'm not sure whether the
>>>>>> interface is intended for one specific tool (like ours is), or whether
>>>>>> there's more variety for userspace consumers.
>>>>> Is the proposal to set the memory usage from similar interface, or is
>>>>> that still not implemented?
> GPU memory management is a very different resource model, so I'd expect
> memory to be managed by a real cgroups controller. I.e., the interface
> would be different.  I haven't personally done any work on cgroups-based
> memory management; I agree that would be nice to have, but none of the
> use cases my team is focused on have that as a requirement.
>
>>>>> I'm seeing a very close relation between time-slicing GPU time and
>>>>> allowed GPU buffer allocations, so having two completely different
>>>>> interfaces does just feel very hackish way of implementing this.
> If we were time slicing GPU time and giving out fractional shares of
> time to each client, that would be a very different situation than we
> have today; in that case I agree that a cgroups controller would be the
> way to go.  While there probably are a lot of cases where time slicing
> and fair scheduling would be beneficial (especially stuff like desktop
> or mobile settings), I'm mostly focused on embedded use cases that
> depend on today's unfair, strict priority scheduling.
>
>
> Matt
>
>>>>> Regards, Joonas
>>>>>
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>> Regards, Joonas
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>>>>>>> the body of a message to majordomo at vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> -- 
>>>>>> Matt Roper
>>>>>> Graphics Software Engineer
>>>>>> IoTG Platform Enabling & Development
>>>>>> Intel Corporation
>>>>>> (916) 356-2795
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>>>> the body of a message to majordomo at vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> -- 
>>> Matt Roper
>>> Graphics Software Engineer
>>> IoTG Platform Enabling & Development
>>> Intel Corporation
>>> (916) 356-2795
>> -- 
>> Matt Roper
>> Graphics Software Engineer
>> IoTG Platform Enabling & Development
>> Intel Corporation
>> (916) 356-2795