GEM memory DOS (WAS Re: [PATCH 3/3] drm/ttm: under memory pressure minimize the size of memory pool)

Thu Aug 14 15:29:15 PDT 2014

On Wed, 13 Aug 2014 17:13:56 +0200
Thomas Hellstrom <thellstrom at vmware.com> wrote:

> On 08/13/2014 03:01 PM, Daniel Vetter wrote:
> > On Wed, Aug 13, 2014 at 02:35:52PM +0200, Thomas Hellstrom wrote:
> >> On 08/13/2014 12:42 PM, Daniel Vetter wrote:
> >>> On Wed, Aug 13, 2014 at 11:06:25AM +0200, Thomas Hellstrom wrote:
> >>>> On 08/13/2014 05:52 AM, Jérôme Glisse wrote:
> >>>>> From: Jérôme Glisse <jglisse at redhat.com>
> >>>>>
> >>>>> When experiencing memory pressure we want to minimize pool size so that
> >>>>> memory we just shrinked is not added back again just as the next thing.
> >>>>>
> >>>>> This will divide by 2 the maximum pool size for each device each time
> >>>>> the pool have to shrink. The limit is bumped again is next allocation
> >>>>> happen after one second since the last shrink. The one second delay is
> >>>>> obviously an arbitrary choice.
> >>>> Jérôme,
> >>>>
> >>>> I don't like this patch. It adds extra complexity and its usefulness is
> >>>> highly questionable.
> >>>> There are a number of caches in the system, and if all of them added
> >>>> some sort of voluntary shrink heuristics like this, we'd end up with
> >>>> impossible-to-debug unpredictable performance issues.
> >>>>
> >>>> We should let the memory subsystem decide when to reclaim pages from
> >>>> caches and what caches to reclaim them from.
> >>> Yeah, artificially limiting your cache from growing when your shrinker
> >>> gets called will just break the equal-memory pressure the core mm uses to
> >>> rebalance between all caches when workload changes. In i915 we let
> >>> everything grow without artificial bounds and only rely upon the shrinker
> >>> callbacks to ensure we don't consume more than our fair share of available
> >>> memory overall.
> >>> -Daniel
> >> Now when you bring i915 memory usage up, Daniel,
> >> I can't refrain from bringing up the old user-space unreclaimable kernel
> >> memory issue, for which gem open is a good example ;) Each time
> >> user-space opens a gem handle, some un-reclaimable kernel memory is
> >> allocated, for which there is no accounting, so theoretically I think a
> >> user can bring a system to unusability this way.
> >>
> >> Typically there are various limits on unreclaimable objects like this,
> >> like open file descriptors, and IIRC the kernel even has an internal
> >> limit on the number of struct files you initialize, based on the
> >> available system memory, so dma-buf / prime should already have some
> >> sort of protection.
> > Oh yeah, we have zero cgroups limits or similar stuff for gem allocations,
> > so there's not really a way to isolate gpu memory usage in a sane way for
> > specific processes. But there's also zero limits on actual gpu usage
> > itself (timeslices or whatever) so I guess no one asked for this yet.
> 
> In its simplest form (like in TTM if correctly implemented by drivers)
> this type of accounting stops non-privileged malicious GPU-users from
> exhausting all system physical memory causing grief for other kernel
> systems but not from causing grief for other GPU users. I think that's
> the minimum level that's intended also for example also for the struct
> file accounting.
> 
> > My comment really was about balancing mm users under the assumption that
> > they're all unlimited.
> 
> Yeah, sorry for stealing the thread. I usually bring this up now and
> again but nowadays with an exponential backoff.

Yeah I agree we're missing some good limits stuff in i915 and DRM in
general probably.  Chris started looking at this awhile back, but I
haven't seen anything recently.  Tying into the ulimits/rlimits might
make sense, and at the very least we need to account for things
properly so we can add new limits where needed.

-- 
Jesse Barnes, Intel Open Source Technology Center