[RFC PATCH 2/2] drm: etnaviv: Unmap gems on gem_close

Fri Oct 30 14:17:35 UTC 2020

Hi,
On Fri, Oct 30, 2020 at 10:19:54AM +0100, Lucas Stach wrote:
> Am Donnerstag, den 29.10.2020, 19:20 +0100 schrieb Daniel Vetter:
> > On Thu, Oct 29, 2020 at 03:38:21PM +0100, Lucas Stach wrote:
> > > Hi Guido,
> > > 
> > > Am Donnerstag, den 29.10.2020, 15:20 +0100 schrieb Guido Günther:
> > > > So far the unmap from gpu address space only happened when dropping the
> > > > last ref in gem_free_object_unlocked, however that is skipped if there's
> > > > still multiple handles to the same GEM object.
> > > > 
> > > > Since userspace (here mesa) in the case of softpin hands back the memory
> > > > region to the pool of available GPU virtual memory closing the handle
> > > > via DRM_IOCTL_GEM_CLOSE this can lead to etnaviv_iommu_insert_exact
> > > > failing later since userspace thinks the vaddr is available while the
> > > > kernel thinks it isn't making the submit fail like
> > > > 
> > > >   [E] submit failed: -14 (No space left on device) (etna_cmd_stream_flush:244)
> > > > 
> > > > Fix this by unmapping the memory via the .gem_close_object callback.
> > > > 
> > > > Signed-off-by: Guido Günther <agx at sigxcpu.org>
> > > > ---
> > > >  drivers/gpu/drm/etnaviv/etnaviv_drv.c |  1 +
> > > >  drivers/gpu/drm/etnaviv/etnaviv_drv.h |  1 +
> > > >  drivers/gpu/drm/etnaviv/etnaviv_gem.c | 32 +++++++++++++++++++++++++++
> > > >  3 files changed, 34 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_drv.c b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > > > index a9a3afaef9a1..fdcc6405068c 100644
> > > > --- a/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > > > +++ b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > > > @@ -491,6 +491,7 @@ static struct drm_driver etnaviv_drm_driver = {
> > > >  	.open               = etnaviv_open,
> > > >  	.postclose           = etnaviv_postclose,
> > > >  	.gem_free_object_unlocked = etnaviv_gem_free_object,
> > > > +	.gem_close_object   = etnaviv_gem_close_object,
> > > >  	.gem_vm_ops         = &vm_ops,
> > > >  	.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
> > > >  	.prime_fd_to_handle = drm_gem_prime_fd_to_handle,
> > > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_drv.h b/drivers/gpu/drm/etnaviv/etnaviv_drv.h
> > > > index 4d8dc9236e5f..2226a9af0d63 100644
> > > > --- a/drivers/gpu/drm/etnaviv/etnaviv_drv.h
> > > > +++ b/drivers/gpu/drm/etnaviv/etnaviv_drv.h
> > > > @@ -65,6 +65,7 @@ int etnaviv_gem_cpu_prep(struct drm_gem_object *obj, u32 op,
> > > >  		struct drm_etnaviv_timespec *timeout);
> > > >  int etnaviv_gem_cpu_fini(struct drm_gem_object *obj);
> > > >  void etnaviv_gem_free_object(struct drm_gem_object *obj);
> > > > +void etnaviv_gem_close_object(struct drm_gem_object *obj, struct drm_file *file);
> > > >  int etnaviv_gem_new_handle(struct drm_device *dev, struct drm_file *file,
> > > >  		u32 size, u32 flags, u32 *handle);
> > > >  int etnaviv_gem_new_userptr(struct drm_device *dev, struct drm_file *file,
> > > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> > > > index f06e19e7be04..5aec4a23c77e 100644
> > > > --- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> > > > +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> > > > @@ -515,6 +515,38 @@ static const struct etnaviv_gem_ops etnaviv_gem_shmem_ops = {
> > > >  	.mmap = etnaviv_gem_mmap_obj,
> > > >  };
> > > >  
> > > > +void etnaviv_gem_close_object(struct drm_gem_object *obj, struct drm_file *unused)
> > > > +{
> > > > +	struct etnaviv_gem_object *etnaviv_obj = to_etnaviv_bo(obj);
> > > > +	struct etnaviv_vram_mapping *mapping, *tmp;
> > > > +
> > > > +	/* Handle this via etnaviv_gem_free_object */
> > > > +	if (obj->handle_count == 1)
> > > > +		return;
> > > > +
> > > > +	WARN_ON(is_active(etnaviv_obj));
> > > > +
> > > > +	/*
> > > > +	 * userspace wants to release the handle so we need to remove
> > > > +	 * the mapping from the gpu's virtual address space to stay
> > > > +	 * in sync.
> > > > +	 */
> > > > +	list_for_each_entry_safe(mapping, tmp, &etnaviv_obj->vram_list,
> > > > +				 obj_node) {
> > > > +		struct etnaviv_iommu_context *context = mapping->context;
> > > > +
> > > > +		WARN_ON(mapping->use);
> > > > +
> > > > +		if (context) {
> > > > +			etnaviv_iommu_unmap_gem(context, mapping);
> > > > +			etnaviv_iommu_context_put(context);
> > > 
> > > I see the issue you are trying to fix here, but this is not a viable
> > > fix. While userspace may close the handle, the GPU may still be
> > > processing jobs referening the BO, so we can't just unmap it from the
> > > address space.
> > > 
> > > I think what we need to do here is waiting for the current jobs/fences
> > > of the BO when we hit the case where userspace tries to assign a new
> > > address to the BO. Basically wait for current jobs -> unamp from the
> > > address space -> map at new userspace assigned address.
> > 
> > Yeah was about to say the same. There's two solutions to this:
> > - let the kernel manage the VA space. This is what amdgpu does in some
> >   cases (but still no relocations)
> > - pipeline the VA/PTE updates in your driver, because userspace has a
> >   somewhat hard time figuring out when a buffer is done. Doing that would
> >   either at complexity or stalls when the kernel is doing all the tracking
> >   already anyway. Minimal fix is to do what Lucas explained above, but
> >   importantly with the kernel solution we have the option to fully
> >   pipeline everything and avoid stalls. I think this is what everyone else
> >   who lets userspace manage VA does in their kernel side.
> 
> I thought a bit more about this and the issue is only about userspace
> reusing regions of its _private_ address space, before they are GPU
> inactive. So the problem of tracking BO active for figuring out when it
> is safe to close the handle (and thus forget about the last placement
> of the BO in the AS) goes from "is this BO globally active anywhere?"
> to "is this BO still active in one of the jobs I submitted?", which is
> not hard at all for the userspace to figure out.

It's about private address space, yes.

> With this in mind I think we can reasonably declare the kernel behavior
> of rejecting the submit as okay and just add the tiny bit of userspace
> fencing needed for the userspace driver to not close the handle until
> the last submit referencing this BO has finished.

Sounds reasonable, i'll have a look.
Cheers,
 -- Guido

> 
> Regards,
> Lucas
>