New KFD ioctls: taking the skeletons out of the closet

Mon Mar 12 20:20:20 UTC 2018

On 2018-03-12 03:37 PM, Daniel Vetter wrote:
> On Mon, Mar 12, 2018 at 7:17 PM, Felix Kuehling <felix.kuehling at amd.com> wrote:
>> On 2018-03-07 03:34 PM, Felix Kuehling wrote:
>>>> Again stop worrying about ioctl overhead, this isn't Windows. If you
>>>> can show the overhead as being a problem then address it, but I
>>>> think it's premature worrying about it at this stage.
>>> I'd like syscall overhead to be small. But with recent kernel page table
>>> isolation, NUMA systems and lots of GPUs, I think this may not be
>>> negligible. For example we're working with some Intel NUMA systems and 8
>>> GPUs for HPC or deep learning applications. I'll be measuring the
>>> overhead on such systems and get back with results in a few days. I want
>>> to have an API that can scale to such applications.
>> I ran some tests on a 2-socket Xeon E5-2680 v4 with 56 CPU threads and 8
>> Vega10 GPUs. The kernel was 4.16-rc1 based with KPTI enabled and a
>> kernel config based on a standard Ubuntu kernel. No debug options were
>> enabled. My test application measures KFD memory management API
>> performance for allocating, mapping, unmapping and freeing 1000 buffers
>> of different sizes (4K, 16K, 64K, 256K) and memory types (VRAM and
>> system memory). The impact of ioctl overhead depended on whether the
>> page table update was done by CPU or SDMA.
>>
>> I averaged 10 runs of the application and also calculated the standard
>> deviation to see if my results were just random noise.
>>
>> With SDMA using a single ioctl was about 5% faster for mapping and 10%
>> faster for unmapping. The standard deviation was 2.5% and 7.5% respectively.
>>
>> With CPU a single ioctl was 2.5% faster for mapping, 18% faster for
>> unmapping. Standard deviation was 0.2% and 3% respectively.
> btw for statistics student's t-distribution is usually the measure to
> tell "is this the same distribution or not". Works much more robustly
> if you're dealing with odd shapes of your measured distributions,
> which can happen easily (e.g. if it bifurcates into a fast vs.
> slowpath or similar stuff).
>
> Also for my understanding: This was 1 ioctl to map 1 buffer on 8 gpus
> vs. 8 ioctl to mape 1 buffer on 1 of the 8 gpus?

The task is the same in both cases: map one buffer on all 8 GPUs. In one
case it uses 9 ioctls (1 map call per GPU and 1 call to synchronize with
SDMA and flush GPU TLBs). In the other case it's 1 ioctl doing all those
things.

> Do we have benchmarks that show overall impact? I'm assuming that your
> workloads won't spend all day long mapping/unmapping stuff, but also
> will do some computing :-)

I don't. This was done with a micro benchmark. In real applications the
impact is going to be much smaller. I tested one application that I know
does a lot of memory mappings mixed in between computations (lulesh-cl
from https://github.com/AMDComputeLibraries/ComputeApps/). But it only
maps on one GPU, so the impact was minimal (maybe 1%) and probably not
statistically significant.

>
> Can you also give numbers without KPTI? Afaiui AMD mostly doesn't need
> it, and Intel will eventually fix it too, so this overhead should
> disappear again. Just want to get a full picture here.

Before I got time on the Intel system I ran less rigorous experiments on
an AMD Threadripper with KPTI off and KPTI forced on. I don't have exact
numbers from those tests. With KPTI off the ioctl overhead was not
measurable. With KPTI on it was about the same or slightly bigger than
on the Intel system.

Regards,
  Felix

> -Daniel
>
>> For unmapping the difference was bigger than mapping because unmapping
>> is faster to begin with, so the system call overhead is bigger in
>> proportion. Mapping of a single buffer to 8 GPUs takes about 220us with
>> SDMA or 190us with CPU with only minor dependence on buffer size and
>> memory type. Unmapping takes about 35us with SDMA or 13us with CPU.
>>
>>> Regards,
>>>   Felix
>>>
>>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>