vm binding interfaces and parallel with mmap

Wed Aug 24 16:14:26 UTC 2022

On Mon, Aug 22, 2022 at 8:27 AM Christian König <christian.koenig at amd.com>
wrote:

> Am 22.08.22 um 10:34 schrieb Bas Nieuwenhuizen:
> > On Mon, Aug 22, 2022 at 9:28 AM Dave Airlie <airlied at gmail.com> wrote:
> >> On Mon, 22 Aug 2022 at 17:05, Dave Airlie <airlied at gmail.com> wrote:
> >>> Hey,
> >>>
> >>> I've just been looking at the vm bind type interfaces and wanted to at
> >>> least document how we think the unmapping API should work. I know I've
> >>> talked on irc before about this, but wanted to solidify things a bit
> >>> more around what is required vs what is a nice to have.
> >>>
> >>> My main concerns/thoughts are around the unbind interfaces and how
> >>> close to munmap they should be.
> >>>
> >>> I think the mapping operation is mostly consistent
> >>> MAP(bo handle, offset into bo, range, VM offset, VM flags)
> >>> which puts the range inside to bo at the offset in the current VM
> >>> (maybe take an optional vm_id).
> >>>
> >>> now the simplest unmap I can see if one that parallel munmap
> >>> UNMAP(vmaddr, range);
> >>>
> >>> But it begs the question on then how much the kernel needs to deal
> >>> with here, if we support random vmaddr,range then we really need to be
> >>> able to do everything munmap does for CPU VMA, which means splitting
> >>> ranges, joining ranges etc.
> >>>
> >>> like
> >>> MAP(1, 0, 0x8000, 0xc0000)
> >>> UNMAP(0xc1000, 0x1000)
> >>> should that be possible?
> >>>
> >>> Do we have any API usage (across Vulkan/CL/CUDA/ROCm etc) that
> >>> requires this sort of control, or should we be fine with only
> >>> unmapping objects exactly like how they were mapped in the first
> >>> place, and not have any splitting/joining?
> > Vulkan allows for this, though I haven't checked to what extent apps use
> it.
>
> This is massively used for partial resident textures under OpenGL as far
> as I know.
>
> E.g. you map a range like 1->10 as PRT and then then map real textures
> at 2, 5 and 7 or something like that.
>
> Saying that a functionality to map/enable PRT for a range is necessary
> as well. On amdgpu we have a special flag for that and in this case the
> BO to map can be NULL.
>

NVIDIA has similar bits.  I don't remember if intel does or not.  Handling
this as a map with BO=0 and a PRT flag of some sort seems like a perfectly
reasonable way to handle it.

> > We could technically split all mapping/unmapping to be per single tile
> > in the userspace driver, which avoids the need for splitting/merging,
> > but that could very much be a pessimization.
>
> That would be pretty much a NAK from my side. A couple of hardware
> optimizations require mappings to be as large as possible.
>
> Otherwise we wouldn't be able to use huge/giant (2MiB, 1GiB) pages,
> power of two TLB reach optimizations (8KiB, 16KiB, 32KiB.....) as well
> as texture fetcher optimizations.
>

Agreed.  Intel has such optimizations as well and they really do matter.
IDK about nvidia but I'd be surprised if they don't at least have a 2M
variant or something.  Reducing page-table depth matters a lot for latency.

> >> I suppose it also asks the question around paralleling
> >>
> >> fd = open()
> >> ptr = mmap(fd,)
> >> close(fd)
> >> the mapping is still valid.
> >>
> >> I suppose our equiv is
> >> handle = bo_alloc()
> >> gpu_addr = vm_bind(handle,)
> >> gem_close(handle)
> >> is the gpu_addr still valid does the VM hold a reference on the kernel
> >> bo internally.
> > For Vulkan it looks like this is undefined and the above is not
> necessary:
> >
> > "It is important to note that freeing a VkDeviceMemory object with
> > vkFreeMemory will not cause resources (or resource regions) bound to
> > the memory object to become unbound. Applications must not access
> > resources bound to memory that has been freed."
> > (32.7.6)
>

I'm not sure about this particular question.  We need to be sure that maps
get cleaned up eventually.  On the one hand, I think it's probably a valid
API implementation to have each mapped page hold a reference similar to
mmap and have vkDestroyImage or vkDestroyBuffer do an unmap to clean up the
range.  However, clients may be surprised when they destroy a large memory
object and can't reap the memory because of extra BO references they don't
know about.  If BOs unmap themselves on close or if we had a way to take a
VM+BO and say "unmap this BO from everywhere, please", we can clean up the
memory pretty easily.  Without that, it's a giant PITA to do entirely
inside the userspace driver because it requires us to globally track every
mapping and that means data structures and locks.  Yes, such an ioctl would
require the kernel to track things but the kernel is already tracking
everything that's bound, so hopefully it doesn't add much.

--Jason

> Additional to what was discussed here so far we need an array on in and
> out drm_syncobj for both map as well as unmap.
>
> E.g. when the mapping/unmapping should happen and when it is completed
> etc...
>
> Christian.
>
> >
> >
> >> Dave.
> >>> Dave.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20220824/4a682550/attachment.htm>