[PATCH] drm/radeon: add user pointer support v2

Tue Jul 1 02:49:18 PDT 2014

Am 30.06.2014 19:35, schrieb Jerome Glisse:
> On Mon, Jun 30, 2014 at 03:04:16PM +0200, Christian König wrote:
>> From: Christian König <christian.koenig at amd.com>
>>
>> This patch adds an IOCTL for turning a pointer supplied by
>> userspace into a buffer object.
>>
>> It works similar to the recently added userptr support for i915,
>> so it also imposes several restrictions upon the memory being mapped:
>>
>> 1. It must be page aligned (both start/end addresses, i.e ptr and size).
>>
>> 2. It must be normal system memory, not a pointer into another map of IO
>> space (e.g. it must not be a GTT mmapping of another object).
>>
>> 3. The BO is mapped into GTT, so the maximum amount of memory mapped at
>> all times is still the GTT limit.
>>
>> Exporting and sharing as well as mapping of buffer objects created by
>> this function is forbidden and results in an -EPERM.
>>
>> In difference to the i915 implementation we don't use an interval tree to
>> manage all the buffers objects created and instead just register a separate
>> MMU notifier callback for each BO so that buffer objects are invalidated
>> whenever any modification is made to the processes page table. This probably
>> needs further optimization.
>>
>> Please review and / or comment.
> NAK, this is kind of an horror show. I should probably take a look at
> Intel code too.

Yeah, I'm not to keen about the approach either and would rather prefer 
using the IOMMU, but unfortunately that isn't available under all use 
cases we need to be able to handle.

BTW: Here is the link to the initial commit of the Intel implementation 
as it is in Linus tree: 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5cc9ed4b9a7ac579362ccebac67f7a4cdb36de06

> First registering one notifier per bo is a bad idea, it is not what you
> want to do.

I know, the Intel guys use a single registration for each process and an 
interval tree to lockup all the BOs affected.

I just wanted to keep the implementation simple and stupid for now and 
first get feedback about the approach in general (to save time if the 
whole approach won't be accepted at all and your reaction indeed 
confirms that this was a good idea).

>> [SNIP]
>> @@ -176,8 +181,15 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
>>   	if (p->cs_flags & RADEON_CS_USE_VM)
>>   		p->vm_bos = radeon_vm_get_bos(p->rdev, p->ib.vm,
>>   					      &p->validated);
>> +	if (need_mmap_lock)
>> +		down_read(&current->mm->mmap_sem);
>> +
>> +	r = radeon_bo_list_validate(p->rdev, &p->ticket, &p->validated, p->ring);
>>   
>> -	return radeon_bo_list_validate(p->rdev, &p->ticket, &p->validated, p->ring);
>> +	if (need_mmap_lock)
>> +		up_read(&current->mm->mmap_sem);
>> +
>> +	return r;
>>   }
> So the mmap lock protect almost nothing here. Once things are validated
> nothing forbid the userspace to unmap the range containing the user bo.

As far as I understand it (and I'm probably not so deep into the MM code 
as you, so correct me if I'm wrong) unmapping the range would result in 
an invalidate_range_start callback and this callback in turn validates 
the BO into the CPU domain again. So it actually blocks for the GPU 
operation to be completed, marks the pages in question as dirty and then 
unpins them.

This should protected us anything nasty to happen in the case of a 
unmap(), fork() etc...

>>   
>>   static int radeon_cs_get_ring(struct radeon_cs_parser *p, u32 ring, s32 priority)
>> diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
>> index 6e30174..355aa67 100644
>> --- a/drivers/gpu/drm/radeon/radeon_drv.c
>> +++ b/drivers/gpu/drm/radeon/radeon_drv.c
>> @@ -112,6 +112,9 @@ int radeon_gem_object_open(struct drm_gem_object *obj,
>>   				struct drm_file *file_priv);
>>   void radeon_gem_object_close(struct drm_gem_object *obj,
>>   				struct drm_file *file_priv);
>> +struct dma_buf *radeon_gem_prime_export(struct drm_device *dev,
>> +                                        struct drm_gem_object *gobj,
>> +                                        int flags);
> Use tab not space there is already enough broken tab/space in radeon
> kernel as it is. I always tried to keep it consistant.

Fixed, thanks for the hint.

>> [SNIP]
>> +static void radeon_bo_mn_release(struct mmu_notifier *mn,
>> +				 struct mm_struct *mm)
>> +{
>> +	struct radeon_bo *bo = container_of(mn, struct radeon_bo, mn);
>> +
>> +	bo->mn.ops = NULL;
>> +}
>> +
>> +static void radeon_bo_mn_invalidate_range_start(struct mmu_notifier *mn,
>> +						struct mm_struct *mm,
>> +						unsigned long start,
>> +						unsigned long end)
>> +{
>> +	struct radeon_bo *bo = container_of(mn, struct radeon_bo, mn);
>> +	int r;
>> +
>> +	/* TODO: optimize! */
>> +
>> +	r = radeon_bo_reserve(bo, true);
>> +	if (r) {
>> +		dev_err(bo->rdev->dev, "(%d) failed to reserve user bo\n", r);
>> +		return;
>> +	}
>> +
>> +	radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_CPU);
>> +	r = ttm_bo_validate(&bo->tbo, &bo->placement, false, false);
>> +	if (r) {
>> +		dev_err(bo->rdev->dev, "(%d) failed to validate user bo\n", r);
>> +		return;
>> +	}
>> +
>> +	radeon_bo_unreserve(bo);
>> +}
> So you most likely want ttm_bo_validate to be interruptible but you also
> want to call something like io_schedule so you are not just wasting time
> slice of the process for the own selfishness of the gpu driver.
>
> Also the invalidate_range is callback that is a common call during process
> lifetime.

I know that this is horrible inefficient and error prone, but we can't 
make ttm_bo_validate interruptible because invalidate_range_start 
doesn't allows us to return an error number and so we can't rely on 
userspace repeating the syscall causing the mapping change in case of a 
signal.

The Intel code does it the same way and it actually send a cold shiver 
down my back that this got upstream, but hey I'm not the one who 
reviewed it and actually don't have a better idea either.

> Moreover nothing prevent a cs that is call after the range_start callback
> to pin again the buffer before the range_end callback is made. Thus this
> is extremly racy.

Oh really? Crap, I thought for all the cases that we are interested in 
the mmap_sem would be locked between range_start and range_end.

>> +
>> +static const struct mmu_notifier_ops radeon_bo_notifier = {
>> +	.release = radeon_bo_mn_release,
>> +	.invalidate_range_start = radeon_bo_mn_invalidate_range_start,
>> +};
> What about invalidate_page callback ? You are breaking write back to
> disk. Which is not a welcome move.

Why? Keep in mind that the pages are pinned down as soon as we make the 
first CS which uses them and marked as dirty when we are done with them. 
In between those two events the page shouldn't be written back to disk.

>> [SNIP]
>> +static int radeon_ttm_tt_pin_userptr(struct ttm_tt *ttm)
>> +{
>> +	struct radeon_device *rdev = radeon_get_rdev(ttm->bdev);
>> +	struct radeon_ttm_tt *gtt = (void *)ttm;
>> +	unsigned pinned = 0, nents;
>> +	int r;
>> +
>> +	/* prepare the sg table with the user pages */
>> +	if (current->mm != gtt->usermm)
>> +		return -EPERM;
>> +
>> +	do {
>> +		unsigned num_pages = ttm->num_pages - pinned;
>> +		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
>> +		struct page **pages = ttm->pages + pinned;
>> +
>> +		r = get_user_pages(current, current->mm, userptr, num_pages,
>> +				   1, 0, pages, NULL);
>> +		if (r < 0) {
>> +			release_pages(ttm->pages, pinned, 0);
>> +			return r;
>> +		}
>> +		pinned += r;
>> +
>> +	} while (pinned < ttm->num_pages);
>> +
>> +	r = -ENOMEM;
>> +	ttm->sg = kcalloc(1, sizeof(struct sg_table), GFP_KERNEL);
>> +	if (!ttm->sg)
>> +		goto release_pages;
>> +
>> +	r = sg_alloc_table_from_pages(ttm->sg, ttm->pages, ttm->num_pages, 0,
>> +				      ttm->num_pages << PAGE_SHIFT,
>> +				      GFP_KERNEL);
>> +	if (r)
>> +		goto release_sg;
>> +
>> +	r = -ENOMEM;
>> +	nents = dma_map_sg(rdev->dev, ttm->sg->sgl, ttm->sg->nents,
>> +			   DMA_BIDIRECTIONAL);
>> +	if (nents != ttm->sg->nents)
>> +		goto release_sg;
>> +
>> +	return 0;
>> +
>> +release_sg:
>> +	kfree(ttm->sg);
>> +
>> +release_pages:
>> +	release_pages(ttm->pages, pinned, 0);
>> +	return r;
>> +}
> So as said above page back the user mapping might change. As the gup
> comment explain there is no garanty that page returned by gup are the
> one in use to back the address. Well for thing that will be use as
> read source by the gpu this is not much of an issue as you can probably
> assume that by the time you pin the bo the user mapping is done writting
> to it.
>
> But the case of GPU writing to it and then userspace expecting to read
> back through its user space mapping is just a game of luck. You will be
> lucky lot of time because the kernel does not migrate page just for the
> kick of it. But in case it does you will have inconsistency and user of
> that API will not understand why the hell they are having an issue.

Yeah, completely agree. I was aware of those limitations even before 
starting to work on it, but hoped that registering an 
invalidate_range_start callback like the Intel codes does would solve 
the issue (well this code indeed got upstream).

Unfortunately I'm not the one who decides if we do it or not, I'm just 
the one who volunteered to figure out how it is possible to do it in a 
clean way.

> This kind of API is realy a no go the way the mm code works does not allow
> for such thing. I will go look at the intel code but i fear the worst.

Please keep me updated on this, if we can't get a solution about this 
upstream I really need to be able to explain why.

Thanks for the comments,
Christian.

>
> Cheers,
> Jérôme
>
>> +
>> +static void radeon_ttm_tt_unpin_userptr(struct ttm_tt *ttm)
>> +{
>> +	struct radeon_device *rdev = radeon_get_rdev(ttm->bdev);
>> +	unsigned i;
>> +
>> +	/* free the sg table and pages again */
>> +	dma_unmap_sg(rdev->dev, ttm->sg->sgl,
>> +		     ttm->sg->nents, DMA_BIDIRECTIONAL);
>> +
>> +	sg_free_table(ttm->sg);
>> +	kfree(ttm->sg);
>> +
>> +	for (i = 0; i < ttm->num_pages; ++i)
>> +		set_page_dirty(ttm->pages[i]);
>> +
>> +	release_pages(ttm->pages, ttm->num_pages, 0);
>> +}
>> +
>>   static int radeon_ttm_tt_populate(struct ttm_tt *ttm)
>>   {
>>   	struct radeon_device *rdev;
>> @@ -599,6 +673,13 @@ static int radeon_ttm_tt_populate(struct ttm_tt *ttm)
>>   	if (ttm->state != tt_unpopulated)
>>   		return 0;
>>   
>> +	if (gtt->userptr) {
>> +		r = radeon_ttm_tt_pin_userptr(ttm);
>> +		if (r)
>> +			return r;
>> +		slave = true;
>> +	}
>> +
>>   	if (slave && ttm->sg) {
>>   		drm_prime_sg_to_page_addr_arrays(ttm->sg, ttm->pages,
>>   						 gtt->ttm.dma_address, ttm->num_pages);
>> @@ -648,6 +729,11 @@ static void radeon_ttm_tt_unpopulate(struct ttm_tt *ttm)
>>   	unsigned i;
>>   	bool slave = !!(ttm->page_flags & TTM_PAGE_FLAG_SG);
>>   
>> +	if (gtt->userptr) {
>> +		radeon_ttm_tt_unpin_userptr(ttm);
>> +		return;
>> +	}
>> +
>>   	if (slave)
>>   		return;
>>   
>> @@ -676,6 +762,28 @@ static void radeon_ttm_tt_unpopulate(struct ttm_tt *ttm)
>>   	ttm_pool_unpopulate(ttm);
>>   }
>>   
>> +int radeon_ttm_tt_set_userptr(struct ttm_tt *ttm, uint64_t userptr)
>> +{
>> +	struct radeon_ttm_tt *gtt = (void *)ttm;
>> +
>> +	if (gtt == NULL)
>> +		return -EINVAL;
>> +
>> +	gtt->userptr = userptr;
>> +	gtt->usermm = current->mm;
>> +	return 0;
>> +}
>> +
>> +bool radeon_ttm_tt_has_userptr(struct ttm_tt *ttm)
>> +{
>> +	struct radeon_ttm_tt *gtt = (void *)ttm;
>> +
>> +	if (gtt == NULL)
>> +		return false;
>> +
>> +	return !!gtt->userptr;
>> +}
>> +
>>   static struct ttm_bo_driver radeon_bo_driver = {
>>   	.ttm_tt_create = &radeon_ttm_tt_create,
>>   	.ttm_tt_populate = &radeon_ttm_tt_populate,
>> diff --git a/include/uapi/drm/radeon_drm.h b/include/uapi/drm/radeon_drm.h
>> index 1cc0b61..54e7421 100644
>> --- a/include/uapi/drm/radeon_drm.h
>> +++ b/include/uapi/drm/radeon_drm.h
>> @@ -511,6 +511,7 @@ typedef struct {
>>   #define DRM_RADEON_GEM_BUSY		0x2a
>>   #define DRM_RADEON_GEM_VA		0x2b
>>   #define DRM_RADEON_GEM_OP		0x2c
>> +#define DRM_RADEON_GEM_IMPORT		0x2d
>>   
>>   #define DRM_IOCTL_RADEON_CP_INIT    DRM_IOW( DRM_COMMAND_BASE + DRM_RADEON_CP_INIT, drm_radeon_init_t)
>>   #define DRM_IOCTL_RADEON_CP_START   DRM_IO(  DRM_COMMAND_BASE + DRM_RADEON_CP_START)
>> @@ -554,6 +555,7 @@ typedef struct {
>>   #define DRM_IOCTL_RADEON_GEM_BUSY	DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_BUSY, struct drm_radeon_gem_busy)
>>   #define DRM_IOCTL_RADEON_GEM_VA		DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_VA, struct drm_radeon_gem_va)
>>   #define DRM_IOCTL_RADEON_GEM_OP		DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_OP, struct drm_radeon_gem_op)
>> +#define DRM_IOCTL_RADEON_GEM_IMPORT	DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_IMPORT, struct drm_radeon_gem_import)
>>   
>>   typedef struct drm_radeon_init {
>>   	enum {
>> @@ -806,6 +808,13 @@ struct drm_radeon_gem_create {
>>   	uint32_t	flags;
>>   };
>>   
>> +struct drm_radeon_gem_import {
>> +	uint64_t		addr;
>> +	uint64_t		size;
>> +	uint32_t		flags;
>> +	uint32_t		handle;
>> +};
>> +
>>   #define RADEON_TILING_MACRO				0x1
>>   #define RADEON_TILING_MICRO				0x2
>>   #define RADEON_TILING_SWAP_16BIT			0x4
>> -- 
>> 1.9.1
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel