New KFD ioctls: taking the skeletons out of the closet

Wed Mar 7 19:55:20 UTC 2018

On Wed, Mar 7, 2018 at 11:38 AM, Daniel Vetter <daniel at ffwll.ch> wrote:
> On Wed, Mar 07, 2018 at 09:38:03AM +0100, Christian K??nig wrote:
>> Am 07.03.2018 um 00:09 schrieb Dave Airlie:
>> > On 7 March 2018 at 08:44, Felix Kuehling <felix.kuehling at amd.com> wrote:
>> > > Hi all,
>> > >
>> > > Christian raised two potential issues in a recent KFD upstreaming code
>> > > review that are related to the KFD ioctl APIs:
>> > >
>> > >   1. behaviour of -ERESTARTSYS
>> > >   2. transactional nature of KFD ioctl definitions, or lack thereof
>> > >
>> > > I appreciate constructive feedback, but I also want to encourage an
>> > > open-minded rather than a dogmatic approach to API definitions. So let
>> > > me take all the skeletons out of my closet and get these APIs reviewed
>> > > in the appropriate forum before we commit to them upstream. See the end
>> > > of this email for reference.
>> > >
>> > > The controversial part at this point is kfd_ioctl_map_memory_to_gpu. If
>> > > any of the other APIs raise concerns or questions, please ask.
>> > >
>> > > Because of the HSA programming model, KFD memory management APIs are
>> > > synchronous. There is no pipelining. Command submission to GPUs through
>> > > user mode queues does not involve KFD. This means KFD doesn't know what
>> > > memory is used by the GPUs and when it's used. That means, when the
>> > > map_memory_to_gpu ioctl returns to user mode, all memory mapping
>> > > operations are complete and the memory can be used by the CPUs or GPUs
>> > > immediately.
>> > I've got a few opinions, but first up I still dislike user-mode queues
>> > and everything
>> > they entail. I still feel they are solving a Windows problem and not a
>> > Linux problem,
>> > and it would be nice if we had some Linux numbers on what they gain us over
>> > a dispatch ioctl, because they sure bring a lot of memory management issues.
>>
>> Well user-mode queues are a problem as long as you don't have recoverable
>> page faults on the GPU.
>>
>> As soon as you got recoverable page faults and push the memory management
>> towards things like HMM I don't see an advantage of using a IOCTL based
>> command submission any more.
>>
>> So I would say that this is a problem which is slowly going away as the
>> hardware improves.
>
> Yeah, but up to the point where the hw actually works (instead of promises
> that maybe it'll work next generation, trust us, for like a few
> generations) it's much easier to hack up an ioctl with workarounds than
> intercepting an mmap write fault all the time (those are slower than
> ioctls).
>
> I think userspace queues are fine once we have known-working hw. Before
> that I'm kinda agreeing with Dave and not seeing the point. At least to my
> knowledge we still haven't arrived in the wonderful promised land of hw
> recoverable (well, restartable really) page faults on any vendors platform
> ...

I think user space queues are a bit of a distraction.  The original
point of KFD and HSA was to have a consistent programming model across
CPU and other devices with relatively seamless access to the same
memory pools.  KFD was originally focused on APUs and when we have an
IOMMUv2 with ATC available, we have support for recoverable page
faults.  It's been working for 3 generations of hw and has been
expanded to GPUVM on newer hw which doesn't have the dependency on
IOMMU and also support vram.  We added support for KFD for older dGPUs
that don't have this capability, but that is certainly not the only
use case we need to consider.

Alex

>
>> > That said amdkfd is here.
>> >
>> > The first question you should ask (which you haven't asked here at all) is
>> > what should userspace do with the ioctl result.
>> >
>> > > HSA also uses a shared virtual memory model, so typically memory gets
>> > > mapped on multiple GPUs and CPUs at the same virtual address.
>> > >
>> > > The point of contention seems to be the ability to map memory to
>> > > multiple GPUs in a single ioctl and the behaviour in failure cases. I'll
>> > > discuss two main failure cases:
>> > >
>> > > 1: Failure after all mappings have been dispatched via SDMA, but a
>> > > signal interrupts the wait for completion and we return -ERESTARTSYS.
>> > > Documentation/kernel-hacking/hacking.rst only says "[...] you should be
>> > > prepared to process the restart, e.g. if you're in the middle of
>> > > manipulating some data structure." I think we do that by ensuring that
>> > > memory that's already mapped won't be mapped again. So the restart will
>> > > become a no-op and just end up waiting for all the previous mappings to
>> > > complete.
>> > -ERESTARTSYS at that late stage points to a badly synchronous API,
>> > I'd have said you should have two ioctls, one that returns a fence after
>> > starting the processes, and one that waits for the fence separately.
>>
>> That is exactly what I suggested as well, but also exactly what Felix tries
>> to avoid :)
>>
>> > The overhead of ioctls isn't your enemy until you've measured it and
>> > proven it's a problem.
>> >
>> > > Christian has a stricter requirement, and I'd like to know where that
>> > > comes from: "An interrupted IOCTL should never have a visible effect."
>> > Christian might be taking things a bit further but synchronous gpu access
>> > APIs are bad, but I don't think undoing a bunch of work is a good plan either
>> > just because you got ERESTARTSYS. If you get ERESTARTSYS can you
>> > handle it, if I've fired off 5 SDMAs and wait for them will I fire off 5 more?
>> > will I wait for the original SDMAs if I reenter?
>>
>> Well it's not only the waiting for the SDMAs. If I understood it correctly
>> the IOCTL proposed by Felix allows adding multiple mappings of buffer
>> objects on multiple devices with just one IOCTL.
>>
>> Now the problem is without a lot of redesign of the driver this can fail at
>> any place in between those operations. E.g. we could run out of memory or
>> hit a permission restriction or an invalid handle etc.. etc...
>>
>> What would happen is that we end up with a halve complete IOCTL.
>>
>> A possible solution might be that we could maybe add some kind of feedback
>> noting which operations are already complete and then only retrying the one
>> which failed.
>
> Atomic ioctl behaviour is hard. Like reeeeeeaaaaaaaaaaalllllllly hard.
>
> Look at atomic kms if you don't believe, or the v4l equivalent, and that
> doesn't even try to do cross device atomic. Also, it explicitly isn't
> atomic wrt memory management stuff (like pinning scanout buffers into
> vram), because that was too hard - we simply try to pin and then roll back
> if it happens to not work out and apologize to userspace for the mess.
>
> Except when your career plan is to spend the next few decades on
> prototyping this as an R&D project, I recommend to not try :-)
>
>> > > 2: Failure to map on some but not all GPUs. This comes down to the
>> > > question, do all ioctl APIs or system calls in general need to be
>> > > transactional? As a counter example I'd give incomplete read or write
>> > > system calls that return how much was actually read or written. Our
>> > > current implementation of map_memory_to_gpu doesn't do this, but it
>> > > could be modified to return to user mode how many of the mappings, or
>> > > which mappings specifically failed or succeeded.
>> > What should userspace do? if it only get mappings on 3 of the gpus instead
>> > of say 4? Is there a sane resolution other than calling the ioctl again with
>> > the single GPU? Would it drop the GPU from the working set from that point on?
>> >
>> > Need more info to do what can come out of the API doing incomplete
>> > operations.
>>
>> Felix argument that when a mapping operations fails the VM ranges in
>> question would have been undefined before and are undefined after that
>> operation failed as well.
>>
>> So we could just need to retry the operation until all of it succeeds, but
>> that feels kind of strange.
>
> +1 on make your gpu apis async, we have drm_syncobj/sync_file/dma_fence as
> a standard way for this now.
>
>> > > The alternative would be to break multi-GPU mappings, and the final wait
>> > > for completion, into multiple ioctl calls. That would result in
>> > > additional system call overhead. I'd argue that the end result is the
>> > > same for user mode, so I don't see why I'd use multiple ioctls over a
>> > > single one.
>> > Again stop worrying about ioctl overhead, this isn't Windows. If you
>> > can show the overhead as being a problem then address it, but I
>> > think it's premature worrying about it at this stage.
>>
>> Well you go from one IOCTL doing everything towards one IOCTL per device per
>> mapping which can be a huge difference.
>>
>> One the other hand we internally had exactly the same discussion when we
>> implemented support for partially resident textures. The result was that we
>> first implement it with individual IOCTLs and implement the mass mapping
>> IOCTL if we ever find an use case where we need it.
>>
>> So far we haven't found a use case for this mass mapping IOCTL.
>
> Aligns with my expectations/experience/planning for i915.ko stuff very
> much.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel