[git pull] habanalabs for drm-next-6.4
Oded Gabbay
ogabbay at kernel.org
Mon Mar 20 15:40:26 UTC 2023
Hi Dave, Daniel.
First pull request for 6.4.
Changes are all over the place - new uAPI, new features, optimizations, bug
fixes, cleanups, etc.
Full details are in the signed tag.
Thanks,
Oded
The following changes since commit 8bf6e20253b2d2b614f2c0b491f840e956fa6b05:
Merge tag 'drm-intel-next-2023-03-07' of git://anongit.freedesktop.org/drm/drm-intel into drm-next (2023-03-15 14:59:31 +1000)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux.git tags/drm-habanalabs-next-2023-03-20
for you to fetch changes up to 75b445753047872a69709cfba7e3939660f0ecc1:
accel/habanalabs: remove redundant TODOs (2023-03-20 17:35:34 +0200)
----------------------------------------------------------------
This tag contains habanalabs driver and accel changes for v6.4:
- uAPI changes:
- Add opcodes to the CS ioctl to allow user to stall/resume specific engines
inside Gaudi2. This is to allow the user to perform power
testing/measurements when training different topologies.
- Expose in the INFO ioctl the amount of device memory that the driver
and f/w reserve for themselves.
- Expose in the INFO ioctl a bit-mask of the available rotator engines
in Gaudi2. This is to align with other engines that are already exposed.
- Expose in the INFO ioctl the register's address of the f/w that should
be used to trigger interrupts from within the user's code running in the
compute engines.
- Add a critical-event bit in the eventfd bitmask so the user will know the
event that was received was critical, and a reset will now occur
- Expose in the INFO ioctl two new opcodes to fetch information on h/w and
f/w events. The events recorded are the events that were reported in the
eventfd.
- New features and improvements:
- Add a dedicated interrupt ID in MSI-X in the device to the notification of
an unexpected user-related event in Gaudi2. Handle it in the driver by
reporting this event.
- Allow the user to fetch the device memory current usage even when the
device is undergoing compute-reset (a reset type that only clears the
compute engines).
- Enable graceful reset mechanism for compute-reset. This will give the
user a few seconds before the device is reset. For example, the user can,
during that time, perform certain device operations (dump data for debug)
or close the device in an orderly fashion.
- Align the decoder with the rest of the engines in regard to notification
to the user about interrupts and in regard to performing graceful reset
when needed (instead of immediate reset).
- Add support for assert interrupt from the TPC engine.
- Get the reset type that is necessary to perform per event from the
auto-generated irq_map array.
- Print the specific reason why a device is still in use when notifying to
the user about it (after the user closed the device's FD).
- Move to threaded IRQ when handling interrupts of workload completions.
- Firmware related fixes:
- Fix RAZWI event handler to match newest f/w version.
- Read error cause register in dma core events because the f/w doesn't
do that.
- Increase maximum time to wait for completion of Gaudi2 reset due to f/w
bug.
- Align to the latest firmware specs.
- Enforce the release order of the compute device and dma-buf.
i.e increment the device file refcount for any dma-buf that was exported
for that device. This will make sure the compute device release function
won't be called until the user closes all the FDs of the relevant
dma-bufs. Without this change, closing the device's FD before/without
closing the dma-buf's FD would always lead to hard-reset of the device.
- Fix a link in the drm documentation to correctly point to the accel section.
- Compilation warnings cleanups
- Misc bug fixes and code cleanups
----------------------------------------------------------------
Bagas Sanjaya (1):
accel: Link to compute accelerator subsystem intro
Bjorn Helgaas (1):
accel/habanalabs: Drop redundant pci_enable_pcie_error_reporting()
Colin Ian King (1):
accel/habanalabs: Fix spelling mistake "maped" -> "mapped"
Dafna Hirschfeld (12):
accel/habanalabs: tiny refactor of hl_device_reset for readability
accel/habanalabs: in hl_device_reset remove 'hard_instead_of_soft'
accel/habanalabs: in hl_device_reset small refactor for readabilty
accel/habanalabs: don't trace cpu accessible dma alloc/free
accel/habanalabs: change hw_fini to return int to indicate error
accel/habanalabs: assert return value of hw_fini
accel/habanalabs: allow getting HL_INFO_DRAM_USAGE during soft-reset
accel/habanalabs: unify err log of hw-fini failure in dirty state
accel/habanalabs: move soft-reset wait to soft-reset execute
accel/habanalabs: in hw_fini return error code if polling timed-out
accel/habanalabs: fix use of var reset_sleep_ms
accel/habanalabs: in {e/p}dma_core events read the err cause reg
Dani Liberman (3):
accel/habanalabs: fix address decode RAZWI handling
accel/habanalabs: fix page fault event clear
accel/habanalabs: change razwi handle after fw fix
Koby Elbaz (12):
accel/habanalabs: capture RAZWI info only if HW indication detected
accel/habanalabs: unsecure CFG_TPC_ID register
accel/habanalabs: disable PCI when escalating compute to hard-reset
accel/habanalabs: rename security function parameters
accel/habanalabs: break is_idle function into per-engine sub-routines
accel/habanalabs: verify return code after scrubbing ARCs DCCMs
accel/habanalabs: remove a useless is_idle TPC flag
accel/habanalabs: fix register address on PDMA/EDMA idle check
accel/habanalabs: use a mutex rather than a spinlock
accel/habanalabs: add uapi to stall/resume engine
accel/habanalabs: do not verify engine modes after being changed
accel/habanalabs: return tlb inv error code upon failure
Moti Haimovski (2):
accel/habanalabs: add critical-event bit in notifier
accel/habanalabs: minimize error prints when mem map fails
Oded Gabbay (6):
accel/habanalabs: split cdev creation to separate function
accel/habanalabs: save class in hdev
accel/habanalabs: refactor debugfs init
accel/habanalabs: make gaudi2_is_device_idle() static
accel/habanalabs: align to latest firmware specs
accel/habanalabs: fix field names in hl_info_hw_ip_info
Ofir Bitton (9):
accel/habanalabs: increase user interrupt grace time
accel/habanalabs: expose engine core int reg address
accel/habanalabs: capture interrupt timestamp in handler
accel/habanalabs: add support for TPC assert
accel/habanalabs: increase reset poll timeout
accel/habanalabs: expose dram reserved size by kmd
accel/habanalabs: expose rotator mask to userspace
accel/habanalabs: add handling for unexpected user event
accel/habanalabs: remove redundant TODOs
Ohad Sharabi (3):
accel/habanalabs: get reset type indication from irq_map
accel/habanalabs: modify events reset policy
accel/habanalabs: regenerate gaudi2 ids_map_extended
Sagiv Ozeri (2):
accel/habanalabs: organize hl_device structure comment
accel/habanalabs: add device id to all threads names
Tal Cohen (1):
accel/habanalabs: change user interrupt to threaded IRQ
Tom Rix (2):
accel/habanalabs: change unused extern decl of hdev to forward decl of hl_device
accel/habanalabs: set hl_capture_*_err storage-class-specifier to static
Tomer Tayar (15):
accel/habanalabs: use memhash_node_export_put() in hl_release_dmabuf()
accel/habanalabs: add info when FD released while device still in use
accel/habanalabs: enforce release order of compute device and dma-buf
accel/habanalabs: enable graceful reset mechanism for compute-reset
accel/habanalabs: fix print in hl_irq_handler_eq()
accel/habanalabs: remove hl_irq_handler_default()
accel/habanalabs: improve readability of engines idle mask print
accel/habanalabs: remove unneeded irq_handler variable
accel/habanalabs: add helper function to get vm hash node
accel/habanalabs: use notifications and graceful reset for decoder
accel/habanalabs: use scnprintf() in print_device_in_use_info()
accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release()
accel/habanalabs: remove '\n' when passing strings to gaudi2_print_event()
accel/habanalabs: fix a maybe-uninitialized compilation warnings
accel/habanalabs: fix a missing-braces compilation warning
farah kassabri (1):
accel/habanalabs: fix few misspelled words in the code
.../accel/habanalabs/common/command_submission.c | 130 +-
drivers/accel/habanalabs/common/debugfs.c | 142 +-
drivers/accel/habanalabs/common/decoder.c | 22 +-
drivers/accel/habanalabs/common/device.c | 315 +-
drivers/accel/habanalabs/common/firmware_if.c | 2 +-
drivers/accel/habanalabs/common/habanalabs.h | 129 +-
drivers/accel/habanalabs/common/habanalabs_drv.c | 14 +-
drivers/accel/habanalabs/common/habanalabs_ioctl.c | 60 +-
drivers/accel/habanalabs/common/irq.c | 73 +-
drivers/accel/habanalabs/common/memory.c | 133 +-
drivers/accel/habanalabs/common/memory_mgr.c | 15 +-
drivers/accel/habanalabs/common/mmu/mmu.c | 6 +-
drivers/accel/habanalabs/common/security.c | 6 +-
drivers/accel/habanalabs/common/security.h | 2 +-
drivers/accel/habanalabs/gaudi/gaudi.c | 65 +-
drivers/accel/habanalabs/gaudi2/gaudi2.c | 1543 ++++--
drivers/accel/habanalabs/gaudi2/gaudi2P.h | 9 +-
drivers/accel/habanalabs/gaudi2/gaudi2_coresight.c | 2 +-
drivers/accel/habanalabs/gaudi2/gaudi2_masks.h | 3 +-
drivers/accel/habanalabs/gaudi2/gaudi2_security.c | 1 +
drivers/accel/habanalabs/goya/goya.c | 21 +-
drivers/accel/habanalabs/include/common/cpucp_if.h | 9 +-
.../accel/habanalabs/include/common/hl_boot_if.h | 47 +-
.../include/gaudi2/asic_reg/gaudi2_regs.h | 5 +
drivers/accel/habanalabs/include/gaudi2/gaudi2.h | 2 +
.../include/gaudi2/gaudi2_async_events.h | 4 +-
.../include/gaudi2/gaudi2_async_ids_map_extended.h | 5294 ++++++++++----------
.../accel/habanalabs/include/gaudi2/gaudi2_fw_if.h | 5 +-
include/drm/drm_file.h | 3 +-
include/uapi/drm/habanalabs_accel.h | 102 +-
30 files changed, 4541 insertions(+), 3623 deletions(-)
More information about the dri-devel
mailing list