[Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

Fri Jun 22 16:42:43 UTC 2018

[Resnding with the CC list fixed]

On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > Hi,
> > > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > > what I need basically. It might be completely wrong but I would like
> > > > > to discuss what would be a better way if that is the case.
> > > > > 
> > > > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > > > it down. If you think I have forgot somebody, please let me know
> > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > > >         mo->attached = false;
> > > > >  }
> > > > >  
> > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > >                                                        struct mm_struct *mm,
> > > > >                                                        unsigned long start,
> > > > > -                                                      unsigned long end)
> > > > > +                                                      unsigned long end,
> > > > > +                                                      bool blockable)
> > > > >  {
> > > > >         struct i915_mmu_notifier *mn =
> > > > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > >         LIST_HEAD(cancelled);
> > > > >  
> > > > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > > > -               return;
> > > > > +               return 0;
> > > > 
> > > > The principle wait here is for the HW (even after fixing all the locks
> > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > access).
> > > 
> > > Is this wait bound or it can take basically arbitrary amount of time?
> > 
> > Arbitrary amount of time but in desktop use case you can assume that
> > it should never go above 16ms for a 60frame per second rendering of
> > your desktop (in GPU compute case this kind of assumption does not
> > hold). Is the process exit_state already updated by the time this mmu
> > notifier callbacks happen ?
> 
> What do you mean? The process is killed (by SIGKILL) at the time but we
> do not know much more than that. The task might be stuck anywhere in the
> kernel before handling that signal.
> 
> > > > The first pass would be then to not do anything here if
> > > > !blockable.
> > > 
> > > something like this? (incremental diff)
> > 
> > What i wanted to do with HMM and mmu notifier is split the invalidation
> > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
> > depends on the range and invalidate internal driver states (like clear
> > buffer object pages array in case of GPU but not GPU page table). While
> > the second callback would do the actual wait on the GPU to be done and
> > update the GPU page table.
> 
> What can you do after the first phase? Can I unmap the range?
> 
> > Now in this scheme in case the task is already in some exit state and
> > that all CPU threads are frozen/kill then we can probably find a way to
> > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
> > share userptr bo hence a uptr bo should only ever be access through
> > ioctl submited by the process.
> > 
> > The second call can then be delayed and ping from time to time to see
> > if GPU jobs are done.
> > 
> > 
> > Note that what you propose might still be useful as in case there is
> > no buffer object for a range then OOM can make progress in freeing a
> > range of memory. It is very likely that significant virtual address
> > range of a process and backing memory can be reclaim that way. This
> > assume OOM reclaim vma by vma or in some form of granularity like
> > reclaiming 1GB by 1GB. Or we could also update blocking callback to
> > return range that are blocking that way OOM can reclaim around.
> 
> Exactly my point. What we have right now is all or nothing which is
> obviously too coarse to be useful.

-- 
Michal Hocko
SUSE Labs