[PATCH 00/83] AMD HSA kernel driver

Sat Jul 12 04:10:32 PDT 2014

On Sat, Jul 12, 2014 at 11:24:49AM +0200, Christian König wrote:
> Am 11.07.2014 23:18, schrieb Jerome Glisse:
> >On Thu, Jul 10, 2014 at 10:51:29PM +0000, Gabbay, Oded wrote:
> >>On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
> >>>On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
> >>>>  This patch set implements a Heterogeneous System Architecture
> >>>>(HSA) driver
> >>>>  for radeon-family GPUs.
> >>>This is just quick comments on few things. Given size of this, people
> >>>will need to have time to review things.
> >>>>  HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to
> >>>>share
> >>>>  system resources more effectively via HW features including
> >>>>shared pageable
> >>>>  memory, userspace-accessible work queues, and platform-level
> >>>>atomics. In
> >>>>  addition to the memory protection mechanisms in GPUVM and
> >>>>IOMMUv2, the Sea
> >>>>  Islands family of GPUs also performs HW-level validation of
> >>>>commands passed
> >>>>  in through the queues (aka rings).
> >>>>  The code in this patch set is intended to serve both as a sample
> >>>>driver for
> >>>>  other HSA-compatible hardware devices and as a production driver
> >>>>for
> >>>>  radeon-family processors. The code is architected to support
> >>>>multiple CPUs
> >>>>  each with connected GPUs, although the current implementation
> >>>>focuses on a
> >>>>  single Kaveri/Berlin APU, and works alongside the existing radeon
> >>>>kernel
> >>>>  graphics driver (kgd).
> >>>>  AMD GPUs designed for use with HSA (Sea Islands and up) share
> >>>>some hardware
> >>>>  functionality between HSA compute and regular gfx/compute (memory,
> >>>>  interrupts, registers), while other functionality has been added
> >>>>  specifically for HSA compute  (hw scheduler for virtualized
> >>>>compute rings).
> >>>>  All shared hardware is owned by the radeon graphics driver, and
> >>>>an interface
> >>>>  between kfd and kgd allows the kfd to make use of those shared
> >>>>resources,
> >>>>  while HSA-specific functionality is managed directly by kfd by
> >>>>submitting
> >>>>  packets into an HSA-specific command queue (the "HIQ").
> >>>>  During kfd module initialization a char device node (/dev/kfd) is
> >>>>created
> >>>>  (surviving until module exit), with ioctls for queue creation &
> >>>>management,
> >>>>  and data structures are initialized for managing HSA device
> >>>>topology.
> >>>>  The rest of the initialization is driven by calls from the radeon
> >>>>kgd at
> >>>>  the following points :
> >>>>  - radeon_init (kfd_init)
> >>>>  - radeon_exit (kfd_fini)
> >>>>  - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> >>>>  - radeon_driver_unload_kms (kfd_device_fini)
> >>>>  During the probe and init processing per-device data structures
> >>>>are
> >>>>  established which connect to the associated graphics kernel
> >>>>driver. This
> >>>>  information is exposed to userspace via sysfs, along with a
> >>>>version number
> >>>>  allowing userspace to determine if a topology change has occurred
> >>>>while it
> >>>>  was reading from sysfs.
> >>>>  The interface between kfd and kgd also allows the kfd to request
> >>>>buffer
> >>>>  management services from kgd, and allows kgd to route interrupt
> >>>>requests to
> >>>>  kfd code since the interrupt block is shared between regular
> >>>>  graphics/compute and HSA compute subsystems in the GPU.
> >>>>  The kfd code works with an open source usermode library
> >>>>("libhsakmt") which
> >>>>  is in the final stages of IP review and should be published in a
> >>>>separate
> >>>>  repo over the next few days.
> >>>>  The code operates in one of three modes, selectable via the
> >>>>sched_policy
> >>>>  module parameter :
> >>>>  - sched_policy=0 uses a hardware scheduler running in the MEC
> >>>>block within
> >>>>  CP, and allows oversubscription (more queues than HW slots)
> >>>>  - sched_policy=1 also uses HW scheduling but does not allow
> >>>>  oversubscription, so create_queue requests fail when we run out
> >>>>of HW slots
> >>>>  - sched_policy=2 does not use HW scheduling, so the driver
> >>>>manually assigns
> >>>>  queues to HW slots by programming registers
> >>>>  The "no HW scheduling" option is for debug & new hardware bringup
> >>>>only, so
> >>>>  has less test coverage than the other options. Default in the
> >>>>current code
> >>>>  is "HW scheduling without oversubscription" since that is where
> >>>>we have the
> >>>>  most test coverage but we expect to change the default to "HW
> >>>>scheduling
> >>>>  with oversubscription" after further testing. This effectively
> >>>>removes the
> >>>>  HW limit on the number of work queues available to applications.
> >>>>  Programs running on the GPU are associated with an address space
> >>>>through the
> >>>>  VMID field, which is translated to a unique PASID at access time
> >>>>via a set
> >>>>  of 16 VMID-to-PASID mapping registers. The available VMIDs
> >>>>(currently 16)
> >>>>  are partitioned (under control of the radeon kgd) between current
> >>>>  gfx/compute and HSA compute, with each getting 8 in the current
> >>>>code. The
> >>>>  VMID-to-PASID mapping registers are updated by the HW scheduler
> >>>>when used,
> >>>>  and by driver code if HW scheduling is not being used.
> >>>>  The Sea Islands compute queues use a new "doorbell" mechanism
> >>>>instead of the
> >>>>  earlier kernel-managed write pointer registers. Doorbells use a
> >>>>separate BAR
> >>>>  dedicated for this purpose, and pages within the doorbell
> >>>>aperture are
> >>>>  mapped to userspace (each page mapped to only one user address
> >>>>space).
> >>>>  Writes to the doorbell aperture are intercepted by GPU hardware,
> >>>>allowing
> >>>>  userspace code to safely manage work queues (rings) without
> >>>>requiring a
> >>>>  kernel call for every ring update.
> >>>>  First step for an application process is to open the kfd device.
> >>>>Calls to
> >>>>  open create a kfd "process" structure only for the first thread
> >>>>of the
> >>>>  process. Subsequent open calls are checked to see if they are
> >>>>from processes
> >>>>  using the same mm_struct and, if so, don't do anything. The kfd
> >>>>per-process
> >>>>  data lives as long as the mm_struct exists. Each mm_struct is
> >>>>associated
> >>>>  with a unique PASID, allowing the IOMMUv2 to make userspace
> >>>>process memory
> >>>>  accessible to the GPU.
> >>>>  Next step is for the application to collect topology information
> >>>>via sysfs.
> >>>>  This gives userspace enough information to be able to identify
> >>>>specific
> >>>>  nodes (processors) in subsequent queue management calls.
> >>>>Application
> >>>>  processes can create queues on multiple processors, and
> >>>>processors support
> >>>>  queues from multiple processes.
> >>>I am not a fan to use sysfs to discover topoly.
> >>>>  At this point the application can create work queues in userspace
> >>>>memory and
> >>>>  pass them through the usermode library to kfd to have them mapped
> >>>>onto HW
> >>>>  queue slots so that commands written to the queues can be
> >>>>executed by the
> >>>>  GPU. Queue operations specify a processor node, and so the bulk
> >>>>of this code
> >>>>  is device-specific.
> >>>>  Written by John Bridgman <John.Bridgman at amd.com>
> >>>So general comment is you need to collapse many patches things like
> >>>58 fixing
> >>>kernel style should be collapsed ie fix all previous patch that have
> >>>broken
> >>>style.
> >>>Even worse is thing like 71, removing code you just added few patch
> >>>earlier
> >>>in the patchset.
> >>Not quite, the code was added on patch 11.
> >>
> >>>This means that if i start reviewing following
> >>>patch order
> >>>i might spend time on code that is later deleted/modified/fixed ie
> >>>time i
> >>>spend understanding and reading some code might be just wasted.
> >>Quick answer is that you are of course right, but having said that, I
> >>think it would be not simple at all to do that squashing.
> >>I squashed what I could, and probably I can do a little more (like
> >>58), but the major problem is that one of the main modules of the
> >>driver - the scheduler (QCM) - was completely re-written (patches
> >>46-53). Therefore, from patch 1 to 53, we use the old scheduler code
> >>and from 54 we use the new QCM (and the old scheduler code was
> >>completely remove at 71). So I could maybe squash 71 into 54, but that
> >>won't help you much, I think.
> >>
> >>So, the current advice I can give is to completely ignore the
> >>following files because they do not exist in the final commit:
> >>- kfd_sched_cik_static.c (stopped using in 54)
> >>- kfd_registers.c (stopped using in 81 because we moved all register
> >>writes to radeon)
> >>- kfd_hw_pointer_store.c (alive from 46 to 67)
> >>- kfd_hw_pointer_store.h (alive from 46 to 67)
> >>
> >>         Oded
> >Ok i try but this is driving me crazy, i am jungling btw full applied
> >patchset and individual patch going back and forth trying to find which
> >patch is the one i want to comment on. Not even mentioning that after
> >that people would need to ack bad patch just because they know a latter
> >good patch fix the badness.
> >
> >This patchset can be reduced to less then 10 big patches. I would first
> >do core patches that touch core things like iommu and are preparatory
> >to the patchset. Then kfd patches that do not touch anything in radeon
> >but implement the skeleton and core infrastructure. Then radeon specific
> >patches.
> >
> >This lead me to the design, all the core kfd helper should be in hsa
> >directory, and should be helper, while all specific bits to radeon
> >should most likely be part of the radeon gfx kernel module. I am not
> >against a radeon_kfd but realy i do not see the point. There should
> >be a hsa.ko like there is a drm.ko.
> 
> Yeah totally agree with Jerome here.
> 
> Maybe I can explain a bit further what the difference in the design is.
> First of all for other good examples see not only the DRM infrastructure,
> but also V4L or other Linux driver subsystems as well.
> 
> In those subsystems it's not the helper functions that control the driver,
> but instead the driver that controls the helper functions. E.g. if HSA wants
> to be a new subsystem with a standardized IOCTL interface it should provide
> functions that assists drivers with implementing the interface instead of
> implementing it themselves and then trying to talk to the drivers for
> necessary resources/operations.
> 
> Coming back to the structure of the patches one of the very first patches
> should be the IOCTL definition a driver needs to implement to be HSA
> conform.
> 
> I think it would be a good idea to split out changes to core structures like
> IOMMU in it's own separately submitted patches, only with the notice that
> this functionality is needed by the following "core HSA" and "HSA for
> radeon" patchsets.

Hm, so the hsa part is a completely new driver/subsystem, not just an
additional ioctl tacked onto radoen? The history of drm is littered with
"generic" ioctls that turned out to be useful for exactly one driver.
Which is why _all_ the command submission is now done with driver-private
ioctls.

I'd be quite a bit surprised if that suddenly works differently, so before
we bless a generic hsa interface I really want to see some implementation
from a different vendor (i.e. nvdidia or intel) using the same ioctls.
Otherwise we just repeat history and I'm not terribly inclined to keep on
cleanup up cruft forever - one drm legacy is enough ;-)

Jesse is the guy from our side to talk to about this.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch