[PATCH v2 00/25] AMDKFD kernel driver

Wed Jul 23 13:59:31 PDT 2014

On Mon, 21 Jul 2014 19:05:46 +0200
daniel at ffwll.ch (Daniel Vetter) wrote:

> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> > > On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig wrote:
> > > > Am 21.07.2014 14:36, schrieb Oded Gabbay:
> > > > >On 20/07/14 20:46, Jerome Glisse wrote:

[snip!!]

> > > > 
> > > > The main questions here are if it's avoid able to pin down the memory and if
> > > > the memory is pinned down at driver load, by request from userspace or by
> > > > anything else.
> > > > 
> > > > As far as I can see only the "mqd per userspace queue" might be a bit
> > > > questionable, everything else sounds reasonable.
> > > 
> > > Aside, i915 perspective again (i.e. how we solved this): When scheduling
> > > away from contexts we unpin them and put them into the lru. And in the
> > > shrinker we have a last-ditch callback to switch to a default context
> > > (since you can't ever have no context once you've started) which means we
> > > can evict any context object if it's getting in the way.
> > 
> > So Intel hardware report through some interrupt or some channel when it is
> > not using a context ? ie kernel side get notification when some user context
> > is done executing ?
> 
> Yes, as long as we do the scheduling with the cpu we get interrupts for
> context switches. The mechanic is already published in the execlist
> patches currently floating around. We get a special context switch
> interrupt.
> 
> But we have this unpin logic already on the current code where we switch
> contexts through in-line cs commands from the kernel. There we obviously
> use the normal batch completion events.

Yeah and we can continue that going forward.  And of course if your hw
can do page faulting, you don't need to pin the normal data buffers.

Usually there are some special buffers that need to be pinned for
longer periods though, anytime the context could be active.  Sounds
like in this case the userland queues, which makes some sense.  But
maybe for smaller systems the size limit could be clamped to something
smaller than 128M.  Or tie it into the rlimit somehow, just like we do
for mlock() stuff.

> > The issue with radeon hardware AFAICT is that the hardware do not report any
> > thing about the userspace context running ie you do not get notification when
> > a context is not use. Well AFAICT. Maybe hardware do provide that.
> 
> I'm not sure whether we can do the same trick with the hw scheduler. But
> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> can just stop feeding the hw scheduler until it runs dry. And then unpin
> and evict.

Yeah we should have an idea which contexts have been fed to the
scheduler, at least with kernel based submission.  With userspace
submission we'll be in a tougher spot...  but as you say we can always
idle things and unpin everything under pressure.  That's a really big
hammer to apply though.

> > Like the VMID is a limited resources so you have to dynamicly bind them so
> > maybe we can only allocate pinned buffer for each VMID and then when binding
> > a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy.
> 
> Yeah, pasid assignment will be fun. Not sure whether Jesse's patches will
> do this already. We _do_ already have fun with ctx id assigments though
> since we move them around (and the hw id is the ggtt address afaik). So we
> need to remap them already. Not sure on the details for pasid mapping,
> iirc it's a separate field somewhere in the context struct. Jesse knows
> the details.

The PASID space is a bit bigger, 20 bits iirc.  So we probably won't
run out quickly or often.  But when we do I thought we could apply the
same trick Linux uses for ASID management on SPARC and ia64 (iirc on
sparc anyway, maybe MIPS too): "allocate" a PASID everytime you need
one, but don't tie it to the process at all, just use it as a counter
that lets you know when you need to do a full TLB flush, then start the
allocation process over.  This lets you minimize TLB flushing and
gracefully handles oversubscription.

My current code doesn't bother though; context creation will fail if we
run out of PASIDs on a given device.

-- 
Jesse Barnes, Intel Open Source Technology Center