[Mesa-dev] [PATCH 00/11] [RFC v2] Solve the mapping bug

Mon Jun 23 13:30:46 PDT 2014

On Sun, Jun 22, 2014 at 04:05:49PM +0200, Francisco Jerez wrote:
> Bruno Jimenez <brunojimen at gmail.com> writes:
> 
> > On Sat, 2014-06-21 at 17:39 +0200, Francisco Jerez wrote:
> >[...]
> >> The implementation of PIPE_TRANSFER_MAP_DIRECTLY introduced in PATCH 10
> >> has somewhat worrying semantics: A mapping with this flag might become
> >> stale unpredictably if a kernel is run, maybe from a different command
> >> queue.  Clover's transfer functions don't hit that path right now on
> >> single-threaded applications, but they might in the future as we start
> >> accelerating the APIs we currently implement with soft_copy_op().  This
> >> is a bug IMHO: even direct mappings should last until the corresponding
> >> unmap call.
> >
> > I think I'm not fully understanding you here. I tried to use
> > PIPE_TRANSFER_MAP_DIRECTLY only with clEnqueue{Write,Read} functions,
> > which map the resource, copy it and unmap it when finished. Is it
> > possible for another kernel to access the memory of a buffer that is
> > being read/written?
> >
> AFAICT, yes.  All command queues created on the same device share the
> same memory pool, so a kernel being queued for execution in one could
> invalidate a concurrent mapping done with PIPE_TRANSFER_MAP_DIRECTLY by
> one of the transfer functions.  On top of that the transfer functions
> might start queuing kernels themselves in the future to accelerate
> certain operations we currently do on the CPU, which would make this
> scenario more likely.
> 
> > I had no intention of having user mappings made with that flag.
> > [Although a possible solution, with a lot of warnings of course, for the
> > avobe problem could be to allow a user to use this flag]
> >
> >> I'm not advocating a revert of the series because it fixes a serious
> >> bug, but please don't push patches 10-11, we should probably start
> >> looking for a different solution.  Some suggestions are:
> >
> > I also asked for them to not to be pushed. And with your reasons, until
> > we find a better way or we change how buffers are handled, I won't
> > propose them again.
> >
> >>  - Why do you even need a pool?  Wouldn't it be possible to create a
> >>    huge RAT, e.g. covering a 4GB portion of the GPU memory and then use
> >>    a special memory domain or some sort of flag to tell the kernel to
> >>    allocate a buffer from that region (or relocate if it's already been
> >>    allocated elsewhere)?  This is especially easy on hardware with
> >>    virtual memory, as you could simply reserve an arbitrarily large
> >>    block of virtual memory, bind it as e.g. RAT0, and then map other
> >>    buffer objects into the block on-demand as they're bound to the
> >>    compute pipeline -- There would be no need to move the actual bits
> >>    around.  This is similar to the approach I used in my original
> >>    proof-of-concept implementation of the compute API on nv50.
> >
> > This is one of the things I have been wondering recently, given that
> > radeonsi doesn't use a pool, why r600 needs one? I still have to
> > understand AMD docs and how *exactly* everything works.
> >
> 
> Probably because on SI compute kernels can access random locations of
> memory without going through an RAT?  I have little actual experience
> with radeons, Tom should know the low-level details.
>

The reason there is no memory pool in radeonsi is because SI and newer support
virtual memory, so there is already one contiguous address space and also
because there is no limit to the number of resources that can be accessed by
a shader.

-Tom

> > 4GB seems like a big amount of memory for me, my little cedar has only
> > 512MB :)
> >
> >>  - If you insist on using a pool, you could (mostly) avoid the storage
> >>    duplication and the mapping copies by allocating buffer objects
> >>    directly from the pool as it was before this series, and then keep
> >>    some sort of reference count specific to the pool storage that would
> >>    be incremented on map and decremented on unmap.  Once you need to
> >>    grow the pool you'd keep the old storage around temporarily and
> >>    migrate buffers to the new storage lazily as they are required or
> >>    unmapped.  Once the reference count drops to zero you'd be free to
> >>    release the backing BO to the system.  The fact that you'd keep both
> >>    storage buffers around for a bit means that you'd be able to use DMA
> >>    to migrate the pool contents instead of the CPU copies you're doing
> >>    now, which is likely to be substantially more efficient.
> >
> > I see how this would solve the slow mappings problem, but I think that
> > it could mean a higher memory usage. In the case of a user creating some
> > buffers, mapping one of them and them adding more so that the pool has
> > to grow, we would have to keep the full size of the old pool just for a
> > buffer, plus the new pool.
> >
> 
> That's a fair point, this solution would only get rid of the extra
> copying but it wouldn't solve memory usage problem in some situations
> (long-lived mappings).  IMHO the former is more worrying because it has
> an impact on every map operation no matter what, while the increased
> memory usage will only have an impact in situations of heavy memory
> pressure -- And even in those cases the kernel might be able to avoid an
> OOM situation by evicting the old pool storage from graphics memory
> (which, I must admit, might be rather punishing too), as we can be sure
> that there will be no further references from the command stream to the
> old storage once it's reallocated and its useful contents are copied
> over to the new storage.
> 
> I think there's another variant of this solution that would solve the
> storage duplication problem assuming that most long-lived mappings are
> read-only (long-lived RW mappings aren't very useful AFAICT, because
> concurrent access to them from the device has unspecified effects
> according to the spec).  You'd keep a reference count of write mappings,
> when the pool is reallocated you'd migrate unmapped and read-mapped
> memory objects to the new pool using DMA as before, but once the
> reference count drops to zero you'd call mmap() with MAP_FIXED providing
> the same CPU address you had before to swap the pages of any active read
> mappings atomically with the new pool, then you'd be able to release the
> old pool storage immediately.  It may not be worth the hassle though...
> 
> > The problem of growing the pool using resource_copy_region instead of
> > mappings was already in my mind and it was something that I wanted to do
> > after solving this bug. :)
> >
> >>  - On hardware with virtual memory, I guess you could also devise a
> >>    copy-less mechanism for growing a given buffer object (used as pool
> >>    storage) by mapping additional pages at the end of the same region
> >>    without changing its virtual address, assuming you know the maximum
> >>    size of the pool beforehand.
> >
> > The final size of the pool is (will be when I implement the defrag
> > function, for now you can have a low and high limit) known when all the
> > needed items are going to be added, just before launching the kernel.
> >
> Well, or you could just reserve a virtual memory block of the maximum
> RAT size, if your address space is large enough.
> 
> >[...]
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev