[PATCH v2 00/25] AMDKFD kernel driver

Thu Jul 24 10:35:53 PDT 2014

On Thu, Jul 24, 2014 at 11:44 AM, Jerome Glisse <j.glisse at gmail.com> wrote:
> On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
>> On 24/07/14 00:46, Bridgman, John wrote:
>> >
>> >> -----Original Message----- From: dri-devel
>> >> [mailto:dri-devel-bounces at lists.freedesktop.org] On Behalf Of Jesse
>> >> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
>> >> dri-devel at lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
>> >> AMDKFD kernel driver
>> >>
>> >> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
>> >> Vetter) wrote:
>> >>
>> >>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
>> >>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
>> >>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
>> >>>>> wrote:
>> >>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>> >>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>> >>
>> >> [snip!!]
>> > My BlackBerry thumb thanks you ;)
>> >>
>> >>>>>>
>> >>>>>> The main questions here are if it's avoid able to pin down
>> >>>>>> the memory and if the memory is pinned down at driver load,
>> >>>>>> by request from userspace or by anything else.
>> >>>>>>
>> >>>>>> As far as I can see only the "mqd per userspace queue"
>> >>>>>> might be a bit questionable, everything else sounds
>> >>>>>> reasonable.
>> >>>>>
>> >>>>> Aside, i915 perspective again (i.e. how we solved this):
>> >>>>> When scheduling away from contexts we unpin them and put them
>> >>>>> into the lru. And in the shrinker we have a last-ditch
>> >>>>> callback to switch to a default context (since you can't ever
>> >>>>> have no context once you've started) which means we can evict
>> >>>>> any context object if it's
>> >> getting in the way.
>> >>>>
>> >>>> So Intel hardware report through some interrupt or some channel
>> >>>> when it is not using a context ? ie kernel side get
>> >>>> notification when some user context is done executing ?
>> >>>
>> >>> Yes, as long as we do the scheduling with the cpu we get
>> >>> interrupts for context switches. The mechanic is already
>> >>> published in the execlist patches currently floating around. We
>> >>> get a special context switch interrupt.
>> >>>
>> >>> But we have this unpin logic already on the current code where
>> >>> we switch contexts through in-line cs commands from the kernel.
>> >>> There we obviously use the normal batch completion events.
>> >>
>> >> Yeah and we can continue that going forward.  And of course if your
>> >> hw can do page faulting, you don't need to pin the normal data
>> >> buffers.
>> >>
>> >> Usually there are some special buffers that need to be pinned for
>> >> longer periods though, anytime the context could be active.  Sounds
>> >> like in this case the userland queues, which makes some sense.  But
>> >> maybe for smaller systems the size limit could be clamped to
>> >> something smaller than 128M.  Or tie it into the rlimit somehow,
>> >> just like we do for mlock() stuff.
>> >>
>> > Yeah, even the queues are in pageable memory, it's just a ~256 byte
>> > structure per queue (the Memory Queue Descriptor) that describes the
>> > queue to hardware, plus a couple of pages for each process using HSA
>> > to hold things like doorbells. Current thinking is to limit #
>> > processes using HSA to ~256 and #queues per process to ~1024 by
>> > default in the initial code, although my guess is that we could take
>> > the #queues per process default limit even lower.
>> >
>>
>> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
>> on 256 boundary.
>> I had in mind to reserve 64MB of gart by default, which translates to
>> 512 queues per process, with 128 processes. Add 2 kernel module
>> parameters, # of max-queues-per-process and # of max-processes (default
>> is, as I said, 512 and 128) for better control of system admin.
>>
>
> So as i said somewhere else in this thread, this should not be reserved
> but use a special allocation. Any HSA GPU use virtual address space for
> userspace so only issue is for kernel side GTT.
>
> What i would like is seeing radeon pinned GTT allocation at bottom of
> GTT space (ie all ring buffer and the ib pool buffer). Then have an
> allocator that allocate new queue from top of GTT address space and
> grow to the bottom.
>
> It should not staticly reserved 64M or anything. When doing allocation
> it should move any ttm buffer that are in the region it want to allocate
> to a different location.
>
>
> As this needs some work, i am not against reserving some small amount
> (couple MB) as a first stage but anything more would need a proper solution
> like the one i just described.

It's still a trade off.  Even if we reserve a couple of megs it'll be
wasted if we are not running HSA apps. And even today if we run a
compute job using the current interfaces we could end up in the same
case.  So while I think it's definitely a good goal to come up with
some solution for fragmentation, I don't think it should be a
show-stopper right now.

A better solution to deal with fragmentation of GTT and provide a
better way to allocate larger buffers in vram would be to break up
vram <-> system pool transfers into multiple transfers depending on
the available GTT size.  Or use GPUVM dynamically for  vram <-> system
transfers.

Alex