[PATCH] drm/radeon: add user pointer support v2
Jerome Glisse
j.glisse at gmail.com
Tue Jul 1 08:35:13 PDT 2014
On Tue, Jul 01, 2014 at 11:49:18AM +0200, Christian König wrote:
> Am 30.06.2014 19:35, schrieb Jerome Glisse:
> >On Mon, Jun 30, 2014 at 03:04:16PM +0200, Christian König wrote:
> >>From: Christian König <christian.koenig at amd.com>
> >>
> >>This patch adds an IOCTL for turning a pointer supplied by
> >>userspace into a buffer object.
> >>
> >>It works similar to the recently added userptr support for i915,
> >>so it also imposes several restrictions upon the memory being mapped:
> >>
> >>1. It must be page aligned (both start/end addresses, i.e ptr and size).
> >>
> >>2. It must be normal system memory, not a pointer into another map of IO
> >>space (e.g. it must not be a GTT mmapping of another object).
> >>
> >>3. The BO is mapped into GTT, so the maximum amount of memory mapped at
> >>all times is still the GTT limit.
> >>
> >>Exporting and sharing as well as mapping of buffer objects created by
> >>this function is forbidden and results in an -EPERM.
> >>
> >>In difference to the i915 implementation we don't use an interval tree to
> >>manage all the buffers objects created and instead just register a separate
> >>MMU notifier callback for each BO so that buffer objects are invalidated
> >>whenever any modification is made to the processes page table. This probably
> >>needs further optimization.
> >>
> >>Please review and / or comment.
> >NAK, this is kind of an horror show. I should probably take a look at
> >Intel code too.
>
> Yeah, I'm not to keen about the approach either and would rather
> prefer using the IOMMU, but unfortunately that isn't available under
> all use cases we need to be able to handle.
>
> BTW: Here is the link to the initial commit of the Intel
> implementation as it is in Linus tree: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5cc9ed4b9a7ac579362ccebac67f7a4cdb36de06
>
> >First registering one notifier per bo is a bad idea, it is not what you
> >want to do.
>
> I know, the Intel guys use a single registration for each process
> and an interval tree to lockup all the BOs affected.
>
> I just wanted to keep the implementation simple and stupid for now
> and first get feedback about the approach in general (to save time
> if the whole approach won't be accepted at all and your reaction
> indeed confirms that this was a good idea).
>
> >>[SNIP]
> >>@@ -176,8 +181,15 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
> >> if (p->cs_flags & RADEON_CS_USE_VM)
> >> p->vm_bos = radeon_vm_get_bos(p->rdev, p->ib.vm,
> >> &p->validated);
> >>+ if (need_mmap_lock)
> >>+ down_read(¤t->mm->mmap_sem);
> >>+
> >>+ r = radeon_bo_list_validate(p->rdev, &p->ticket, &p->validated, p->ring);
> >>- return radeon_bo_list_validate(p->rdev, &p->ticket, &p->validated, p->ring);
> >>+ if (need_mmap_lock)
> >>+ up_read(¤t->mm->mmap_sem);
> >>+
> >>+ return r;
> >> }
> >So the mmap lock protect almost nothing here. Once things are validated
> >nothing forbid the userspace to unmap the range containing the user bo.
>
> As far as I understand it (and I'm probably not so deep into the MM
> code as you, so correct me if I'm wrong) unmapping the range would
> result in an invalidate_range_start callback and this callback in
> turn validates the BO into the CPU domain again. So it actually
> blocks for the GPU operation to be completed, marks the pages in
> question as dirty and then unpins them.
>
> This should protected us anything nasty to happen in the case of a
> unmap(), fork() etc...
Thing is nothing forbid from a new cs ioctl to happen right after
the radeon range_start callback but before the core mm code that
was about to do something. The mmap_sem will protect you from fork
or munmap but not from otherthing.
>
> >> static int radeon_cs_get_ring(struct radeon_cs_parser *p, u32 ring, s32 priority)
> >>diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
> >>index 6e30174..355aa67 100644
> >>--- a/drivers/gpu/drm/radeon/radeon_drv.c
> >>+++ b/drivers/gpu/drm/radeon/radeon_drv.c
> >>@@ -112,6 +112,9 @@ int radeon_gem_object_open(struct drm_gem_object *obj,
> >> struct drm_file *file_priv);
> >> void radeon_gem_object_close(struct drm_gem_object *obj,
> >> struct drm_file *file_priv);
> >>+struct dma_buf *radeon_gem_prime_export(struct drm_device *dev,
> >>+ struct drm_gem_object *gobj,
> >>+ int flags);
> >Use tab not space there is already enough broken tab/space in radeon
> >kernel as it is. I always tried to keep it consistant.
>
> Fixed, thanks for the hint.
>
> >>[SNIP]
> >>+static void radeon_bo_mn_release(struct mmu_notifier *mn,
> >>+ struct mm_struct *mm)
> >>+{
> >>+ struct radeon_bo *bo = container_of(mn, struct radeon_bo, mn);
> >>+
> >>+ bo->mn.ops = NULL;
> >>+}
> >>+
> >>+static void radeon_bo_mn_invalidate_range_start(struct mmu_notifier *mn,
> >>+ struct mm_struct *mm,
> >>+ unsigned long start,
> >>+ unsigned long end)
> >>+{
> >>+ struct radeon_bo *bo = container_of(mn, struct radeon_bo, mn);
> >>+ int r;
> >>+
> >>+ /* TODO: optimize! */
> >>+
> >>+ r = radeon_bo_reserve(bo, true);
> >>+ if (r) {
> >>+ dev_err(bo->rdev->dev, "(%d) failed to reserve user bo\n", r);
> >>+ return;
> >>+ }
> >>+
> >>+ radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_CPU);
> >>+ r = ttm_bo_validate(&bo->tbo, &bo->placement, false, false);
> >>+ if (r) {
> >>+ dev_err(bo->rdev->dev, "(%d) failed to validate user bo\n", r);
> >>+ return;
> >>+ }
> >>+
> >>+ radeon_bo_unreserve(bo);
> >>+}
> >So you most likely want ttm_bo_validate to be interruptible but you also
> >want to call something like io_schedule so you are not just wasting time
> >slice of the process for the own selfishness of the gpu driver.
> >
> >Also the invalidate_range is callback that is a common call during process
> >lifetime.
>
> I know that this is horrible inefficient and error prone, but we
> can't make ttm_bo_validate interruptible because
> invalidate_range_start doesn't allows us to return an error number
> and so we can't rely on userspace repeating the syscall causing the
> mapping change in case of a signal.
My thinking was to restart inside the notifier callback with an io_schedule
in btw.
> The Intel code does it the same way and it actually send a cold
> shiver down my back that this got upstream, but hey I'm not the one
> who reviewed it and actually don't have a better idea either.
>
> >Moreover nothing prevent a cs that is call after the range_start callback
> >to pin again the buffer before the range_end callback is made. Thus this
> >is extremly racy.
>
> Oh really? Crap, I thought for all the cases that we are interested
> in the mmap_sem would be locked between range_start and range_end.
It can be taken in read mode. Not all call to range_start/end happen
with mmap_sem in write mode.
> >>+
> >>+static const struct mmu_notifier_ops radeon_bo_notifier = {
> >>+ .release = radeon_bo_mn_release,
> >>+ .invalidate_range_start = radeon_bo_mn_invalidate_range_start,
> >>+};
> >What about invalidate_page callback ? You are breaking write back to
> >disk. Which is not a welcome move.
>
> Why? Keep in mind that the pages are pinned down as soon as we make
> the first CS which uses them and marked as dirty when we are done
> with them. In between those two events the page shouldn't be written
> back to disk.
Page could be dirty prior to be pin, or dirtied after being (CPU access).
I am just pointing out that the GPU might be writting to the page while
the page is use as source for disk I/O.
>
> >>[SNIP]
> >>+static int radeon_ttm_tt_pin_userptr(struct ttm_tt *ttm)
> >>+{
> >>+ struct radeon_device *rdev = radeon_get_rdev(ttm->bdev);
> >>+ struct radeon_ttm_tt *gtt = (void *)ttm;
> >>+ unsigned pinned = 0, nents;
> >>+ int r;
> >>+
> >>+ /* prepare the sg table with the user pages */
> >>+ if (current->mm != gtt->usermm)
> >>+ return -EPERM;
> >>+
> >>+ do {
> >>+ unsigned num_pages = ttm->num_pages - pinned;
> >>+ uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
> >>+ struct page **pages = ttm->pages + pinned;
> >>+
> >>+ r = get_user_pages(current, current->mm, userptr, num_pages,
> >>+ 1, 0, pages, NULL);
> >>+ if (r < 0) {
> >>+ release_pages(ttm->pages, pinned, 0);
> >>+ return r;
> >>+ }
> >>+ pinned += r;
> >>+
> >>+ } while (pinned < ttm->num_pages);
> >>+
> >>+ r = -ENOMEM;
> >>+ ttm->sg = kcalloc(1, sizeof(struct sg_table), GFP_KERNEL);
> >>+ if (!ttm->sg)
> >>+ goto release_pages;
> >>+
> >>+ r = sg_alloc_table_from_pages(ttm->sg, ttm->pages, ttm->num_pages, 0,
> >>+ ttm->num_pages << PAGE_SHIFT,
> >>+ GFP_KERNEL);
> >>+ if (r)
> >>+ goto release_sg;
> >>+
> >>+ r = -ENOMEM;
> >>+ nents = dma_map_sg(rdev->dev, ttm->sg->sgl, ttm->sg->nents,
> >>+ DMA_BIDIRECTIONAL);
> >>+ if (nents != ttm->sg->nents)
> >>+ goto release_sg;
> >>+
> >>+ return 0;
> >>+
> >>+release_sg:
> >>+ kfree(ttm->sg);
> >>+
> >>+release_pages:
> >>+ release_pages(ttm->pages, pinned, 0);
> >>+ return r;
> >>+}
> >So as said above page back the user mapping might change. As the gup
> >comment explain there is no garanty that page returned by gup are the
> >one in use to back the address. Well for thing that will be use as
> >read source by the gpu this is not much of an issue as you can probably
> >assume that by the time you pin the bo the user mapping is done writting
> >to it.
> >
> >But the case of GPU writing to it and then userspace expecting to read
> >back through its user space mapping is just a game of luck. You will be
> >lucky lot of time because the kernel does not migrate page just for the
> >kick of it. But in case it does you will have inconsistency and user of
> >that API will not understand why the hell they are having an issue.
>
> Yeah, completely agree. I was aware of those limitations even before
> starting to work on it, but hoped that registering an
> invalidate_range_start callback like the Intel codes does would
> solve the issue (well this code indeed got upstream).
>
> Unfortunately I'm not the one who decides if we do it or not, I'm
> just the one who volunteered to figure out how it is possible to do
> it in a clean way.
>
> >This kind of API is realy a no go the way the mm code works does not allow
> >for such thing. I will go look at the intel code but i fear the worst.
>
> Please keep me updated on this, if we can't get a solution about
> this upstream I really need to be able to explain why.
>
The biggest issue with it is the pining of memory, this violate the memcg.
Direct-io pin memory but limit itself to a reasonable amount at a time.
A GPU buffer object might be several hundred mega byte which obviously
can be nasty from memory starvation point of view.
Most other issue can be worked out.
Cheers,
Jérôme
More information about the dri-devel
mailing list