[PATCH v2 00/25] AMDKFD kernel driver

Thu Jul 24 08:44:45 PDT 2014

On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
> On 24/07/14 00:46, Bridgman, John wrote:
> > 
> >> -----Original Message----- From: dri-devel
> >> [mailto:dri-devel-bounces at lists.freedesktop.org] On Behalf Of Jesse
> >> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
> >> dri-devel at lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
> >> AMDKFD kernel driver
> >> 
> >> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
> >> Vetter) wrote:
> >> 
> >>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> >>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> >>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
> >>>>> wrote:
> >>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >> 
> >> [snip!!]
> > My BlackBerry thumb thanks you ;)
> >> 
> >>>>>> 
> >>>>>> The main questions here are if it's avoid able to pin down
> >>>>>> the memory and if the memory is pinned down at driver load,
> >>>>>> by request from userspace or by anything else.
> >>>>>> 
> >>>>>> As far as I can see only the "mqd per userspace queue"
> >>>>>> might be a bit questionable, everything else sounds
> >>>>>> reasonable.
> >>>>> 
> >>>>> Aside, i915 perspective again (i.e. how we solved this):
> >>>>> When scheduling away from contexts we unpin them and put them
> >>>>> into the lru. And in the shrinker we have a last-ditch
> >>>>> callback to switch to a default context (since you can't ever
> >>>>> have no context once you've started) which means we can evict
> >>>>> any context object if it's
> >> getting in the way.
> >>>> 
> >>>> So Intel hardware report through some interrupt or some channel
> >>>> when it is not using a context ? ie kernel side get
> >>>> notification when some user context is done executing ?
> >>> 
> >>> Yes, as long as we do the scheduling with the cpu we get
> >>> interrupts for context switches. The mechanic is already
> >>> published in the execlist patches currently floating around. We
> >>> get a special context switch interrupt.
> >>> 
> >>> But we have this unpin logic already on the current code where
> >>> we switch contexts through in-line cs commands from the kernel.
> >>> There we obviously use the normal batch completion events.
> >> 
> >> Yeah and we can continue that going forward.  And of course if your
> >> hw can do page faulting, you don't need to pin the normal data
> >> buffers.
> >> 
> >> Usually there are some special buffers that need to be pinned for
> >> longer periods though, anytime the context could be active.  Sounds
> >> like in this case the userland queues, which makes some sense.  But
> >> maybe for smaller systems the size limit could be clamped to
> >> something smaller than 128M.  Or tie it into the rlimit somehow,
> >> just like we do for mlock() stuff.
> >> 
> > Yeah, even the queues are in pageable memory, it's just a ~256 byte
> > structure per queue (the Memory Queue Descriptor) that describes the
> > queue to hardware, plus a couple of pages for each process using HSA
> > to hold things like doorbells. Current thinking is to limit #
> > processes using HSA to ~256 and #queues per process to ~1024 by
> > default in the initial code, although my guess is that we could take
> > the #queues per process default limit even lower.
> > 
> 
> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
> on 256 boundary.
> I had in mind to reserve 64MB of gart by default, which translates to
> 512 queues per process, with 128 processes. Add 2 kernel module
> parameters, # of max-queues-per-process and # of max-processes (default
> is, as I said, 512 and 128) for better control of system admin.
> 

So as i said somewhere else in this thread, this should not be reserved
but use a special allocation. Any HSA GPU use virtual address space for
userspace so only issue is for kernel side GTT.

What i would like is seeing radeon pinned GTT allocation at bottom of
GTT space (ie all ring buffer and the ib pool buffer). Then have an
allocator that allocate new queue from top of GTT address space and
grow to the bottom.

It should not staticly reserved 64M or anything. When doing allocation
it should move any ttm buffer that are in the region it want to allocate
to a different location.

As this needs some work, i am not against reserving some small amount
(couple MB) as a first stage but anything more would need a proper solution
like the one i just described.

Cheers,
Jérôme