[PATCH v2 00/25] AMDKFD kernel driver

Mon Jul 21 07:12:06 PDT 2014

On 21/07/14 16:39, Christian König wrote:
> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>> On 20/07/14 20:46, Jerome Glisse wrote:
>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>
>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>> restructured with a cleaner history and no totally-different-early-versions
>>>> of the code.
>>>>
>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>> There is no code going away or even modified between patches, only added.
>>>>
>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>> generic hsa framework being implemented in the future and in that case, we
>>>> will adjust amdkfd to work within that framework.
>>>>
>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>> is the radeon driver. Having said that, we will probably need to move it
>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>> drivers.
>>>>
>>>> For people who like to review using git, the v2 patch set is located at:
>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>
>>>> Written by Oded Gabbayh <oded.gabbay at amd.com>
>>>
>>> So quick comments before i finish going over all patches. There is many
>>> things that need more documentation espacialy as of right now there is
>>> no userspace i can go look at.
>> So quick comments on some of your questions but first of all, thanks for the
>> time you dedicated to review the code.
>>>
>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>> no, that would need serious arguments for any hope of convincing me on
>>> that side.
>> We only do gpu memory pinning for kernel objects. There are no userspace
>> objects that are pinned on the gpu memory in our driver. If that is the case,
>> is it still a show stopper ?
>>
>> The kernel objects are:
>> - pipelines (4 per device)
>> - mqd per hiq (only 1 per device)
>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>> 256 alignment. So total *possible* memory is 128MB
>> - kernel queue (only 1 per device)
>> - fence address for kernel queue
>> - runlists for the CP (1 or 2 per device)
>
> The main questions here are if it's avoid able to pin down the memory and if the
> memory is pinned down at driver load, by request from userspace or by anything
> else.
>
> As far as I can see only the "mqd per userspace queue" might be a bit
> questionable, everything else sounds reasonable.
>
> Christian.

Most of the pin downs are done on device initialization.
The "mqd per userspace" is done per userspace queue creation. However, as I 
said, it has an upper limit of 128MB on KV, and considering the 2G local memory, 
I think it is OK.
The runlists are also done on userspace queue creation/deletion, but we only 
have 1 or 2 runlists per device, so it is not that bad.

	Oded
>
>>>
>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>> stuff there.
>>>
>>> Given that this is not intended to be final HSA api AFAICT then i would
>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>> This would avoid crazy communication btw radeon and kfd.
>>>
>>> The whole aperture business needs some serious explanation. Especialy as
>>> you want to use userspace address there is nothing to prevent userspace
>>> program from allocating things at address you reserve for lds, scratch,
>>> ... only sane way would be to move those lds, scratch inside the virtual
>>> address reserved for kernel (see kernel memory map).
>>>
>>> The whole business of locking performance counter for exclusive per process
>>> access is a big NO. Which leads me to the questionable usefullness of user
>>> space command ring.
>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>> find it analogous to a situation where a network maintainer nacking a driver
>> for a network card, which is slower than a different network card. Doesn't
>> seem reasonable this situation is would happen. He would still put both the
>> drivers in the kernel because people want to use the H/W and its features. So,
>> I don't think this is a valid reason to NACK the driver.
>>
>>> I only see issues with that. First and foremost i would
>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>> overhead that is measurable in any meaning full way against a simple
>>> function call. I know the userspace command ring is a big marketing features
>>> that please ignorant userspace programmer. But really this only brings issues
>>> and for absolutely not upside afaict.
>> Really ? You think that doing a context switch to kernel space, with all its
>> overhead, is _not_ more expansive than just calling a function in userspace
>> which only puts a buffer on a ring and writes a doorbell ?
>>>
>>> So i would rather see a very simple ioctl that write the doorbell and might
>>> do more than that in case of ring/queue overcommit where it would first have
>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>>> implementation of things like performance counter that could be acquire by
>>> kernel for duration of a job submitted by userspace. While still not optimal
>>> this would be better that userspace locking.
>>>
>>>
>>> I might have more thoughts once i am done with all the patches.
>>>
>>> Cheers,
>>> Jérôme
>>>
>>>>
>>>> Original Cover Letter:
>>>>
>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>>>> for radeon-family GPUs.
>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>>> system resources more effectively via HW features including shared pageable
>>>> memory, userspace-accessible work queues, and platform-level atomics. In
>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>>>> Islands family of GPUs also performs HW-level validation of commands passed
>>>> in through the queues (aka rings).
>>>>
>>>> The code in this patch set is intended to serve both as a sample driver for
>>>> other HSA-compatible hardware devices and as a production driver for
>>>> radeon-family processors. The code is architected to support multiple CPUs
>>>> each with connected GPUs, although the current implementation focuses on a
>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>>>> graphics driver (kgd).
>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>>>> functionality between HSA compute and regular gfx/compute (memory,
>>>> interrupts, registers), while other functionality has been added
>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>>>> All shared hardware is owned by the radeon graphics driver, and an interface
>>>> between kfd and kgd allows the kfd to make use of those shared resources,
>>>> while HSA-specific functionality is managed directly by kfd by submitting
>>>> packets into an HSA-specific command queue (the "HIQ").
>>>>
>>>> During kfd module initialization a char device node (/dev/kfd) is created
>>>> (surviving until module exit), with ioctls for queue creation & management,
>>>> and data structures are initialized for managing HSA device topology.
>>>> The rest of the initialization is driven by calls from the radeon kgd at the
>>>> following points :
>>>>
>>>> - radeon_init (kfd_init)
>>>> - radeon_exit (kfd_fini)
>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>>
>>>> During the probe and init processing per-device data structures are
>>>> established which connect to the associated graphics kernel driver. This
>>>> information is exposed to userspace via sysfs, along with a version number
>>>> allowing userspace to determine if a topology change has occurred while it
>>>> was reading from sysfs.
>>>> The interface between kfd and kgd also allows the kfd to request buffer
>>>> management services from kgd, and allows kgd to route interrupt requests to
>>>> kfd code since the interrupt block is shared between regular
>>>> graphics/compute and HSA compute subsystems in the GPU.
>>>>
>>>> The kfd code works with an open source usermode library ("libhsakmt") which
>>>> is in the final stages of IP review and should be published in a separate
>>>> repo over the next few days.
>>>> The code operates in one of three modes, selectable via the sched_policy
>>>> module parameter :
>>>>
>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>>>> CP, and allows oversubscription (more queues than HW slots)
>>>> - sched_policy=1 also uses HW scheduling but does not allow
>>>> oversubscription, so create_queue requests fail when we run out of HW slots
>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>>>> queues to HW slots by programming registers
>>>>
>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>>>> has less test coverage than the other options. Default in the current code
>>>> is "HW scheduling without oversubscription" since that is where we have the
>>>> most test coverage but we expect to change the default to "HW scheduling
>>>> with oversubscription" after further testing. This effectively removes the
>>>> HW limit on the number of work queues available to applications.
>>>>
>>>> Programs running on the GPU are associated with an address space through the
>>>> VMID field, which is translated to a unique PASID at access time via a set
>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>>>> are partitioned (under control of the radeon kgd) between current
>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>>>> and by driver code if HW scheduling is not being used.
>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>>>> dedicated for this purpose, and pages within the doorbell aperture are
>>>> mapped to userspace (each page mapped to only one user address space).
>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>>>> userspace code to safely manage work queues (rings) without requiring a
>>>> kernel call for every ring update.
>>>> First step for an application process is to open the kfd device. Calls to
>>>> open create a kfd "process" structure only for the first thread of the
>>>> process. Subsequent open calls are checked to see if they are from processes
>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>>>> accessible to the GPU.
>>>> Next step is for the application to collect topology information via sysfs.
>>>> This gives userspace enough information to be able to identify specific
>>>> nodes (processors) in subsequent queue management calls. Application
>>>> processes can create queues on multiple processors, and processors support
>>>> queues from multiple processes.
>>>> At this point the application can create work queues in userspace memory and
>>>> pass them through the usermode library to kfd to have them mapped onto HW
>>>> queue slots so that commands written to the queues can be executed by the
>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
>>>> is device-specific.
>>>> Written by John Bridgman <John.Bridgman at amd.com>
>>>>
>>>>
>>>> Alexey Skidanov (1):
>>>>    amdkfd: Implement the Get Process Aperture IOCTL
>>>>
>>>> Andrew Lewycky (3):
>>>>    amdkfd: Add basic modules to amdkfd
>>>>    amdkfd: Add interrupt handling module
>>>>    amdkfd: Implement the Set Memory Policy IOCTL
>>>>
>>>> Ben Goz (8):
>>>>    amdkfd: Add queue module
>>>>    amdkfd: Add mqd_manager module
>>>>    amdkfd: Add kernel queue module
>>>>    amdkfd: Add module parameter of scheduling policy
>>>>    amdkfd: Add packet manager module
>>>>    amdkfd: Add process queue manager module
>>>>    amdkfd: Add device queue manager module
>>>>    amdkfd: Implement the create/destroy/update queue IOCTLs
>>>>
>>>> Evgeny Pinchuk (3):
>>>>    amdkfd: Add topology module to amdkfd
>>>>    amdkfd: Implement the Get Clock Counters IOCTL
>>>>    amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>>
>>>> Oded Gabbay (10):
>>>>    mm: Add kfd_process pointer to mm_struct
>>>>    drm/radeon: reduce number of free VMIDs and pipes in KV
>>>>    drm/radeon/cik: Don't touch int of pipes 1-7
>>>>    drm/radeon: Report doorbell configuration to amdkfd
>>>>    drm/radeon: adding synchronization for GRBM GFX
>>>>    drm/radeon: Add radeon <--> amdkfd interface
>>>>    Update MAINTAINERS and CREDITS files with amdkfd info
>>>>    amdkfd: Add IOCTL set definitions of amdkfd
>>>>    amdkfd: Add amdkfd skeleton driver
>>>>    amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>>
>>>>   CREDITS                                            |    7 +
>>>>   MAINTAINERS                                        |   10 +
>>>>   drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>>   drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>>   drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>>   drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>>   drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>>   .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
>>>> ++++++++++++++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>>   drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>>   drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>>   drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>>   drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>>   drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>>   drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>>   drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>>   drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>>   drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>>   include/linux/mm_types.h                           |   14 +
>>>>   include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>>   43 files changed, 9226 insertions(+), 95 deletions(-)
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>>   create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>>
>>>> --
>>>> 1.9.1
>>>>
>>
>