[PATCH] drm/radeon: add user pointer support v2

Tue Jul 1 08:35:13 PDT 2014

On Tue, Jul 01, 2014 at 11:49:18AM +0200, Christian König wrote:
> Am 30.06.2014 19:35, schrieb Jerome Glisse:
> >On Mon, Jun 30, 2014 at 03:04:16PM +0200, Christian König wrote:
> >>From: Christian König <christian.koenig at amd.com>
> >>
> >>This patch adds an IOCTL for turning a pointer supplied by
> >>userspace into a buffer object.
> >>
> >>It works similar to the recently added userptr support for i915,
> >>so it also imposes several restrictions upon the memory being mapped:
> >>
> >>1. It must be page aligned (both start/end addresses, i.e ptr and size).
> >>
> >>2. It must be normal system memory, not a pointer into another map of IO
> >>space (e.g. it must not be a GTT mmapping of another object).
> >>
> >>3. The BO is mapped into GTT, so the maximum amount of memory mapped at
> >>all times is still the GTT limit.
> >>
> >>Exporting and sharing as well as mapping of buffer objects created by
> >>this function is forbidden and results in an -EPERM.
> >>
> >>In difference to the i915 implementation we don't use an interval tree to
> >>manage all the buffers objects created and instead just register a separate
> >>MMU notifier callback for each BO so that buffer objects are invalidated
> >>whenever any modification is made to the processes page table. This probably
> >>needs further optimization.
> >>
> >>Please review and / or comment.
> >NAK, this is kind of an horror show. I should probably take a look at
> >Intel code too.
> 
> Yeah, I'm not to keen about the approach either and would rather
> prefer using the IOMMU, but unfortunately that isn't available under
> all use cases we need to be able to handle.
> 
> BTW: Here is the link to the initial commit of the Intel
> implementation as it is in Linus tree: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5cc9ed4b9a7ac579362ccebac67f7a4cdb36de06
> 
> >First registering one notifier per bo is a bad idea, it is not what you
> >want to do.
> 
> I know, the Intel guys use a single registration for each process
> and an interval tree to lockup all the BOs affected.
> 
> I just wanted to keep the implementation simple and stupid for now
> and first get feedback about the approach in general (to save time
> if the whole approach won't be accepted at all and your reaction
> indeed confirms that this was a good idea).
> 
> >>[SNIP]
> >>@@ -176,8 +181,15 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
> >>  	if (p->cs_flags & RADEON_CS_USE_VM)
> >>  		p->vm_bos = radeon_vm_get_bos(p->rdev, p->ib.vm,
> >>  					      &p->validated);
> >>+	if (need_mmap_lock)
> >>+		down_read(&current->mm->mmap_sem);
> >>+
> >>+	r = radeon_bo_list_validate(p->rdev, &p->ticket, &p->validated, p->ring);
> >>-	return radeon_bo_list_validate(p->rdev, &p->ticket, &p->validated, p->ring);
> >>+	if (need_mmap_lock)
> >>+		up_read(&current->mm->mmap_sem);
> >>+
> >>+	return r;
> >>  }
> >So the mmap lock protect almost nothing here. Once things are validated
> >nothing forbid the userspace to unmap the range containing the user bo.
> 
> As far as I understand it (and I'm probably not so deep into the MM
> code as you, so correct me if I'm wrong) unmapping the range would
> result in an invalidate_range_start callback and this callback in
> turn validates the BO into the CPU domain again. So it actually
> blocks for the GPU operation to be completed, marks the pages in
> question as dirty and then unpins them.
> 
> This should protected us anything nasty to happen in the case of a
> unmap(), fork() etc...

Thing is nothing forbid from a new cs ioctl to happen right after
the radeon range_start callback but before the core mm code that
was about to do something. The mmap_sem will protect you from fork
or munmap but not from otherthing.

> 
> >>  static int radeon_cs_get_ring(struct radeon_cs_parser *p, u32 ring, s32 priority)
> >>diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
> >>index 6e30174..355aa67 100644
> >>--- a/drivers/gpu/drm/radeon/radeon_drv.c
> >>+++ b/drivers/gpu/drm/radeon/radeon_drv.c
> >>@@ -112,6 +112,9 @@ int radeon_gem_object_open(struct drm_gem_object *obj,
> >>  				struct drm_file *file_priv);
> >>  void radeon_gem_object_close(struct drm_gem_object *obj,
> >>  				struct drm_file *file_priv);
> >>+struct dma_buf *radeon_gem_prime_export(struct drm_device *dev,
> >>+                                        struct drm_gem_object *gobj,
> >>+                                        int flags);
> >Use tab not space there is already enough broken tab/space in radeon
> >kernel as it is. I always tried to keep it consistant.
> 
> Fixed, thanks for the hint.
> 
> >>[SNIP]
> >>+static void radeon_bo_mn_release(struct mmu_notifier *mn,
> >>+				 struct mm_struct *mm)
> >>+{
> >>+	struct radeon_bo *bo = container_of(mn, struct radeon_bo, mn);
> >>+
> >>+	bo->mn.ops = NULL;
> >>+}
> >>+
> >>+static void radeon_bo_mn_invalidate_range_start(struct mmu_notifier *mn,
> >>+						struct mm_struct *mm,
> >>+						unsigned long start,
> >>+						unsigned long end)
> >>+{
> >>+	struct radeon_bo *bo = container_of(mn, struct radeon_bo, mn);
> >>+	int r;
> >>+
> >>+	/* TODO: optimize! */
> >>+
> >>+	r = radeon_bo_reserve(bo, true);
> >>+	if (r) {
> >>+		dev_err(bo->rdev->dev, "(%d) failed to reserve user bo\n", r);
> >>+		return;
> >>+	}
> >>+
> >>+	radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_CPU);
> >>+	r = ttm_bo_validate(&bo->tbo, &bo->placement, false, false);
> >>+	if (r) {
> >>+		dev_err(bo->rdev->dev, "(%d) failed to validate user bo\n", r);
> >>+		return;
> >>+	}
> >>+
> >>+	radeon_bo_unreserve(bo);
> >>+}
> >So you most likely want ttm_bo_validate to be interruptible but you also
> >want to call something like io_schedule so you are not just wasting time
> >slice of the process for the own selfishness of the gpu driver.
> >
> >Also the invalidate_range is callback that is a common call during process
> >lifetime.
> 
> I know that this is horrible inefficient and error prone, but we
> can't make ttm_bo_validate interruptible because
> invalidate_range_start doesn't allows us to return an error number
> and so we can't rely on userspace repeating the syscall causing the
> mapping change in case of a signal.

My thinking was to restart inside the notifier callback with an io_schedule
in btw.

> The Intel code does it the same way and it actually send a cold
> shiver down my back that this got upstream, but hey I'm not the one
> who reviewed it and actually don't have a better idea either.
> 
> >Moreover nothing prevent a cs that is call after the range_start callback
> >to pin again the buffer before the range_end callback is made. Thus this
> >is extremly racy.
> 
> Oh really? Crap, I thought for all the cases that we are interested
> in the mmap_sem would be locked between range_start and range_end.

It can be taken in read mode. Not all call to range_start/end happen
with mmap_sem in write mode.

> >>+
> >>+static const struct mmu_notifier_ops radeon_bo_notifier = {
> >>+	.release = radeon_bo_mn_release,
> >>+	.invalidate_range_start = radeon_bo_mn_invalidate_range_start,
> >>+};
> >What about invalidate_page callback ? You are breaking write back to
> >disk. Which is not a welcome move.
> 
> Why? Keep in mind that the pages are pinned down as soon as we make
> the first CS which uses them and marked as dirty when we are done
> with them. In between those two events the page shouldn't be written
> back to disk.

Page could be dirty prior to be pin, or dirtied after being (CPU access).
I am just pointing out that the GPU might be writting to the page while
the page is use as source for disk I/O.

> 
> >>[SNIP]
> >>+static int radeon_ttm_tt_pin_userptr(struct ttm_tt *ttm)
> >>+{
> >>+	struct radeon_device *rdev = radeon_get_rdev(ttm->bdev);
> >>+	struct radeon_ttm_tt *gtt = (void *)ttm;
> >>+	unsigned pinned = 0, nents;
> >>+	int r;
> >>+
> >>+	/* prepare the sg table with the user pages */
> >>+	if (current->mm != gtt->usermm)
> >>+		return -EPERM;
> >>+
> >>+	do {
> >>+		unsigned num_pages = ttm->num_pages - pinned;
> >>+		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
> >>+		struct page **pages = ttm->pages + pinned;
> >>+
> >>+		r = get_user_pages(current, current->mm, userptr, num_pages,
> >>+				   1, 0, pages, NULL);
> >>+		if (r < 0) {
> >>+			release_pages(ttm->pages, pinned, 0);
> >>+			return r;
> >>+		}
> >>+		pinned += r;
> >>+
> >>+	} while (pinned < ttm->num_pages);
> >>+
> >>+	r = -ENOMEM;
> >>+	ttm->sg = kcalloc(1, sizeof(struct sg_table), GFP_KERNEL);
> >>+	if (!ttm->sg)
> >>+		goto release_pages;
> >>+
> >>+	r = sg_alloc_table_from_pages(ttm->sg, ttm->pages, ttm->num_pages, 0,
> >>+				      ttm->num_pages << PAGE_SHIFT,
> >>+				      GFP_KERNEL);
> >>+	if (r)
> >>+		goto release_sg;
> >>+
> >>+	r = -ENOMEM;
> >>+	nents = dma_map_sg(rdev->dev, ttm->sg->sgl, ttm->sg->nents,
> >>+			   DMA_BIDIRECTIONAL);
> >>+	if (nents != ttm->sg->nents)
> >>+		goto release_sg;
> >>+
> >>+	return 0;
> >>+
> >>+release_sg:
> >>+	kfree(ttm->sg);
> >>+
> >>+release_pages:
> >>+	release_pages(ttm->pages, pinned, 0);
> >>+	return r;
> >>+}
> >So as said above page back the user mapping might change. As the gup
> >comment explain there is no garanty that page returned by gup are the
> >one in use to back the address. Well for thing that will be use as
> >read source by the gpu this is not much of an issue as you can probably
> >assume that by the time you pin the bo the user mapping is done writting
> >to it.
> >
> >But the case of GPU writing to it and then userspace expecting to read
> >back through its user space mapping is just a game of luck. You will be
> >lucky lot of time because the kernel does not migrate page just for the
> >kick of it. But in case it does you will have inconsistency and user of
> >that API will not understand why the hell they are having an issue.
> 
> Yeah, completely agree. I was aware of those limitations even before
> starting to work on it, but hoped that registering an
> invalidate_range_start callback like the Intel codes does would
> solve the issue (well this code indeed got upstream).
> 
> Unfortunately I'm not the one who decides if we do it or not, I'm
> just the one who volunteered to figure out how it is possible to do
> it in a clean way.
> 
> >This kind of API is realy a no go the way the mm code works does not allow
> >for such thing. I will go look at the intel code but i fear the worst.
> 
> Please keep me updated on this, if we can't get a solution about
> this upstream I really need to be able to explain why.
> 

The biggest issue with it is the pining of memory, this violate the memcg.
Direct-io pin memory but limit itself to a reasonable amount at a time.
A GPU buffer object might be several hundred mega byte which obviously
can be nasty from memory starvation point of view.

Most other issue can be worked out.

Cheers,
Jérôme