New KFD ioctls: taking the skeletons out of the closet

Mon Mar 12 19:37:11 UTC 2018

On Mon, Mar 12, 2018 at 7:17 PM, Felix Kuehling <felix.kuehling at amd.com> wrote:
> On 2018-03-07 03:34 PM, Felix Kuehling wrote:
>>> Again stop worrying about ioctl overhead, this isn't Windows. If you
>>> can show the overhead as being a problem then address it, but I
>>> think it's premature worrying about it at this stage.
>> I'd like syscall overhead to be small. But with recent kernel page table
>> isolation, NUMA systems and lots of GPUs, I think this may not be
>> negligible. For example we're working with some Intel NUMA systems and 8
>> GPUs for HPC or deep learning applications. I'll be measuring the
>> overhead on such systems and get back with results in a few days. I want
>> to have an API that can scale to such applications.
>
> I ran some tests on a 2-socket Xeon E5-2680 v4 with 56 CPU threads and 8
> Vega10 GPUs. The kernel was 4.16-rc1 based with KPTI enabled and a
> kernel config based on a standard Ubuntu kernel. No debug options were
> enabled. My test application measures KFD memory management API
> performance for allocating, mapping, unmapping and freeing 1000 buffers
> of different sizes (4K, 16K, 64K, 256K) and memory types (VRAM and
> system memory). The impact of ioctl overhead depended on whether the
> page table update was done by CPU or SDMA.
>
> I averaged 10 runs of the application and also calculated the standard
> deviation to see if my results were just random noise.
>
> With SDMA using a single ioctl was about 5% faster for mapping and 10%
> faster for unmapping. The standard deviation was 2.5% and 7.5% respectively.
>
> With CPU a single ioctl was 2.5% faster for mapping, 18% faster for
> unmapping. Standard deviation was 0.2% and 3% respectively.

btw for statistics student's t-distribution is usually the measure to
tell "is this the same distribution or not". Works much more robustly
if you're dealing with odd shapes of your measured distributions,
which can happen easily (e.g. if it bifurcates into a fast vs.
slowpath or similar stuff).

Also for my understanding: This was 1 ioctl to map 1 buffer on 8 gpus
vs. 8 ioctl to mape 1 buffer on 1 of the 8 gpus?

Do we have benchmarks that show overall impact? I'm assuming that your
workloads won't spend all day long mapping/unmapping stuff, but also
will do some computing :-)

Can you also give numbers without KPTI? Afaiui AMD mostly doesn't need
it, and Intel will eventually fix it too, so this overhead should
disappear again. Just want to get a full picture here.
-Daniel

> For unmapping the difference was bigger than mapping because unmapping
> is faster to begin with, so the system call overhead is bigger in
> proportion. Mapping of a single buffer to 8 GPUs takes about 220us with
> SDMA or 190us with CPU with only minor dependence on buffer size and
> memory type. Unmapping takes about 35us with SDMA or 13us with CPU.
>
>>
>> Regards,
>>   Felix
>>
>>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch