[PATCH v2 08/15] drm/panthor: Add the MMU/VM logical block

Tue Aug 29 15:33:17 UTC 2023

On Mon, 14 Aug 2023 16:53:09 +0100
Steven Price <steven.price at arm.com> wrote:

> > +
> > +/**
> > + * struct panthor_vm_op_ctx - VM operation context
> > + *
> > + * With VM operations potentially taking place in a dma-signaling path, we
> > + * need to make sure everything that might require resource allocation is
> > + * pre-allocated upfront. This is what this operation context is far.
> > + *
> > + * We also collect resources that have been freed, so we can release them
> > + * asynchronously, and let the VM_BIND scheduler process the next VM_BIND
> > + * request.
> > + */
> > +struct panthor_vm_op_ctx {
> > +	/** @rsvd_page_tables: Pages reserved for the MMU page table update. */
> > +	struct {
> > +		/** @count: Number of pages reserved. */
> > +		u32 count;
> > +
> > +		/** @ptr: Point to the first unused page in the @pages table. */
> > +		u32 ptr;
> > +
> > +		/**
> > +		 * @page: Array of pages that can be used for an MMU page table update.
> > +		 *
> > +		 * After an VM operation, there might be free pages left in this array.
> > +		 * They should be returned to the pt_cache as part of the op_ctx cleanup.
> > +		 */
> > +		void **pages;
> > +	} rsvd_page_tables;  
> 
> Two questions:
> 
> 1) Would a mempool simplify the implementation? It looks like a
> reasonable match.

Not sure what you mean by mempool, but I'm using a kmem_cache here for
all page table allocations. The pages that are passed to
panthor_vm_op_ctx::rsvd_page_tables::pages are allocated from this
pool. It's just that for each VM operation we pre-allocate page-tables,
and release those that were not used when the operation is done (we
over-allocate for the worst case scenario).

> 
> 2) Does it really make sense to have a separate pool of memory for every
> operation? Instead of having a separate pool for each operation, it
> would be possible to just keep track of the total number needed for all
> outstanding operations. Then a single (per device or maybe per-VM if
> necessary) mempool could be resized to ensure it has the right amount of
> space.

The pool is per-driver (see the global pt_cache). rsvd_page_tables just
holds pages needed for a specific VM operation. To be more specific, it
holds pages for the worst case (page table tree is empty, except for the
root page table).

> 
> I'm also a little wary that the VM_BIND infrastructure could potentially
> be abused to trigger a large amount of kernel allocation as it allocates
> up-front for the worst case but those pages are not charged to the
> process (AFAICT). But I haven't fully got my head round that yet.

Yep, that's problematic, indeed. I considered allocating page tables
as GEM objects, but the overhead of a GEM object is quite big
(hundreds of bytes of meta-data) compared to the size of a page table
(4k), and kmem_cache was just super convenient for this page table
cache :-).

> 
> > +
> > +	/** @flags: Combination of drm_panthor_vm_bind_op_flags. */
> > +	u32 flags;
> > +
> > +	/** @va: Virtual range targeted by the VM operation. */
> > +	struct {
> > +		/** @addr: Start address. */
> > +		u64 addr;
> > +
> > +		/** @range: Range size. */
> > +		u64 range;
> > +	} va;
> > +
> > +	/**
> > +	 * @returned_vmas: List of panthor_vma objects returned after a VM operation.
> > +	 *
> > +	 * For unmap operations, this will contain all VMAs that were covered by the
> > +	 * specified VA range.
> > +	 *
> > +	 * For map operations, this will contain all VMAs that previously mapped to
> > +	 * the specified VA range.
> > +	 *
> > +	 * Those VMAs, and the resources they point to will be released as part of
> > +	 * the op_ctx cleanup operation.
> > +	 */
> > +	struct list_head returned_vmas;
> > +
> > +	/** @map: Fields specific to a map operation. */
> > +	struct {
> > +		/** @gem: GEM object information. */
> > +		struct {
> > +			/** @obj: GEM object to map. */
> > +			struct drm_gem_object *obj;
> > +
> > +			/** @offset: Offset in the GEM object. */
> > +			u64 offset;
> > +		} gem;
> > +
> > +		/**
> > +		 * @sgt: sg-table pointing to pages backing the GEM object.
> > +		 *
> > +		 * This is gathered at job creation time, such that we don't have
> > +		 * to allocate in ::run_job().
> > +		 */
> > +		struct sg_table *sgt;
> > +
> > +		/**
> > +		 * @prev_vma: Pre-allocated VMA object to deal with a remap situation.
> > +		 *
> > +		 * If the map request covers a region that's inside another VMA, the
> > +		 * previous VMA will be split, requiring instantiation of a maximum of
> > +		 * two new VMA objects.
> > +		 */
> > +		struct panthor_vma *prev_vma;
> > +
> > +		/**
> > +		 * @new_vma: The new VMA object that will be inserted to the VA tree.
> > +		 */
> > +		struct panthor_vma *new_vma;
> > +
> > +		/**
> > +		 * @next_vma: Pre-allocated VMA object to deal with a remap situation.
> > +		 *
> > +		 * See @prev_vma.
> > +		 */
> > +		struct panthor_vma *next_vma;  
> 
> It's probably premature optimization, but it feels like having a cache
> of these VMA structures might be an idea.

If it's needed, I'll probably go for a kmem_cache, but I need to
check if it's worth it first (if the closest kmalloc cache is
significantly biffer than the struct size).

> I'm also struggling to
> understand how both a new prev and new next VMA are needed - but I
> haven't dug into the GPU VA manager.

prev/next are for mapping splits: an object is already mapped, and a new
object is mapped in the middle of this pre-existing mapping. In that
case, we need 2 vma object for the preceeding and succeeding mappings,
since the old mapping object will be released.

new_vma is for the new mapping.

> 
> > +	} map;
> > +};
> > +

[...]

> > +/**
> > + * panthor_vm_active() - Flag a VM as active
> > + * @VM: VM to flag as active.
> > + *
> > + * Assigns an address space to a VM so it can be used by the GPU/MCU.
> > + *
> > + * Return: 0 on success, a negative error code otherwise.
> > + */
> > +int panthor_vm_active(struct panthor_vm *vm)
> > +{
> > +	struct panthor_device *ptdev = vm->ptdev;
> > +	struct io_pgtable_cfg *cfg = &io_pgtable_ops_to_pgtable(vm->pgtbl_ops)->cfg;
> > +	int ret = 0, as, cookie;
> > +	u64 transtab, transcfg;
> > +
> > +	if (!drm_dev_enter(&ptdev->base, &cookie))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&ptdev->mmu->as.slots_lock);
> > +
> > +	as = vm->as.id;
> > +	if (as >= 0) {
> > +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
> > +
> > +		if (ptdev->mmu->as.faulty_mask & mask) {
> > +			/* Unhandled pagefault on this AS, the MMU was
> > +			 * disabled. We need to re-enable the MMU after
> > +			 * clearing+unmasking the AS interrupts.
> > +			 */
> > +			gpu_write(ptdev, MMU_INT_CLEAR, mask);
> > +			ptdev->mmu->as.faulty_mask &= ~mask;
> > +			gpu_write(ptdev, MMU_INT_MASK, ~ptdev->mmu->as.faulty_mask);
> > +			goto out_enable_as;
> > +		}
> > +
> > +		goto out_unlock;
> > +	}
> > +
> > +	/* Check for a free AS */
> > +	if (vm->for_mcu) {
> > +		drm_WARN_ON(&ptdev->base, ptdev->mmu->as.alloc_mask & BIT(0));
> > +		as = 0;
> > +	} else {
> > +		as = ffz(ptdev->mmu->as.alloc_mask | BIT(0));
> > +	}
> > +
> > +	if (!(BIT(as) & ptdev->gpu_info.as_present)) {
> > +		struct panthor_vm *lru_vm;
> > +
> > +		lru_vm = list_first_entry_or_null(&ptdev->mmu->as.lru_list,
> > +						  struct panthor_vm,
> > +						  as.lru_node);
> > +		if (drm_WARN_ON(&ptdev->base, !lru_vm)) {
> > +			ret = -EBUSY;
> > +			goto out_unlock;
> > +		}
> > +
> > +		list_del_init(&lru_vm->as.lru_node);
> > +		as = lru_vm->as.id;  
> 
> Should this not set lru_vm->as.id = -1, so that the code knows the VM no
> longer has an address space?

Good catch!

> 
> > +	} else {
> > +		set_bit(as, &ptdev->mmu->as.alloc_mask);
> > +	}
> > +
> > +	/* Assign the free or reclaimed AS to the FD */
> > +	vm->as.id = as;
> > +	ptdev->mmu->as.slots[as].vm = vm;
> > +
> > +out_enable_as:
> > +	transtab = cfg->arm_lpae_s1_cfg.ttbr;
> > +	transcfg = AS_TRANSCFG_PTW_MEMATTR_WB |
> > +		   AS_TRANSCFG_PTW_RA |
> > +		   AS_TRANSCFG_ADRMODE_AARCH64_4K;
> > +	if (ptdev->coherent)
> > +		transcfg |= AS_TRANSCFG_PTW_SH_OS;
> > +
> > +	ret = panthor_mmu_as_enable(vm->ptdev, vm->as.id, transtab, transcfg, vm->memattr);
> > +
> > +out_unlock:
> > +	mutex_unlock(&ptdev->mmu->as.slots_lock);
> > +	drm_dev_exit(cookie);
> > +	return ret;
> > +}
> > +

[...]

> > +
> > +static void panthor_mmu_irq_handler(struct panthor_device *ptdev, u32 status)
> > +{
> > +	status = panthor_mmu_fault_mask(ptdev, status);
> > +	while (status) {
> > +		u32 as = ffs(status | (status >> 16)) - 1;
> > +		u32 mask = panthor_mmu_as_fault_mask(ptdev, as);
> > +		u32 new_int_mask;
> > +		u64 addr;
> > +		u32 fault_status;
> > +		u32 exception_type;
> > +		u32 access_type;
> > +		u32 source_id;
> > +
> > +		fault_status = gpu_read(ptdev, AS_FAULTSTATUS(as));
> > +		addr = gpu_read(ptdev, AS_FAULTADDRESS_LO(as));
> > +		addr |= (u64)gpu_read(ptdev, AS_FAULTADDRESS_HI(as)) << 32;
> > +
> > +		/* decode the fault status */
> > +		exception_type = fault_status & 0xFF;
> > +		access_type = (fault_status >> 8) & 0x3;
> > +		source_id = (fault_status >> 16);
> > +
> > +		/* Page fault only */  
> 
> This comment makes no sense - it looks like it's copied over from panfrost.

Uh, it made sense before I dropped map/alloc-on-fault :-).

> 
> If I understand correctly we don't (currently) support growing on page
> fault - and it's not really needed now the MCU can handle the tiler heaps.

Exaclty. Map/alloc on fault is a bit challenging because of the whole
'we have to guarantee that a job is done in finite time, and we must
make sure fence signaling is not blocked on allocation'. Given
drm_gem_get_pages() doesn't do non-blocking allocations, I thought it'd
be preferable to postpone map-on-fault until we actually decide we need
it. Note that i915 seems to have some sort of non-blocking page
allocator in shmem_sg_alloc_table()[1].

> 
> > +		mutex_lock(&ptdev->mmu->as.slots_lock);
> > +
> > +		new_int_mask =
> > +			panthor_mmu_fault_mask(ptdev, ~ptdev->mmu->as.faulty_mask);
> > +
> > +		/* terminal fault, print info about the fault */
> > +		drm_err(&ptdev->base,
> > +			"Unhandled Page fault in AS%d at VA 0x%016llX\n"
> > +			"raw fault status: 0x%X\n"
> > +			"decoded fault status: %s\n"
> > +			"exception type 0x%X: %s\n"
> > +			"access type 0x%X: %s\n"
> > +			"source id 0x%X\n",
> > +			as, addr,
> > +			fault_status,
> > +			(fault_status & (1 << 10) ? "DECODER FAULT" : "SLAVE FAULT"),
> > +			exception_type, panthor_exception_name(ptdev, exception_type),
> > +			access_type, access_type_name(ptdev, fault_status),
> > +			source_id);
> > +
> > +		/* Ignore MMU interrupts on this AS until it's been
> > +		 * re-enabled.
> > +		 */
> > +		ptdev->mmu->irq.mask = new_int_mask;
> > +		gpu_write(ptdev, MMU_INT_MASK, new_int_mask);
> > +
> > +		/* Disable the MMU to kill jobs on this AS. */
> > +		panthor_mmu_as_disable(ptdev, as);
> > +		mutex_unlock(&ptdev->mmu->as.slots_lock);
> > +
> > +		status &= ~mask;
> > +	}
> > +}
> > +PANTHOR_IRQ_HANDLER(mmu, MMU, panthor_mmu_irq_handler);
> > +

[...]

> > +
> > +/**
> > + * panthor_mmu_unplug() - Unplug the MMU logic
> > + * @ptdev: Device.
> > + *
> > + * No access to the MMU regs should be done after this function is called.
> > + * We suspend the IRQ and disable all VMs to guarantee that.
> > + */
> > +void panthor_mmu_unplug(struct panthor_device *ptdev)
> > +{
> > +	if (ptdev->mmu->irq.irq > 0)  
> 
> In what situation is this not true? AFAICT the driver probe will fail if
> the IRQ can't be obtained.

Right, I'll drop this test.

[1]https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/i915/gem/i915_gem_shmem.c#L63