[Mesa-dev] [PATCH 00/11] [RFC v2] Solve the mapping bug

Sun Jun 22 07:05:49 PDT 2014

Bruno Jimenez <brunojimen at gmail.com> writes:

> On Sat, 2014-06-21 at 17:39 +0200, Francisco Jerez wrote:
>[...]
>> The implementation of PIPE_TRANSFER_MAP_DIRECTLY introduced in PATCH 10
>> has somewhat worrying semantics: A mapping with this flag might become
>> stale unpredictably if a kernel is run, maybe from a different command
>> queue.  Clover's transfer functions don't hit that path right now on
>> single-threaded applications, but they might in the future as we start
>> accelerating the APIs we currently implement with soft_copy_op().  This
>> is a bug IMHO: even direct mappings should last until the corresponding
>> unmap call.
>
> I think I'm not fully understanding you here. I tried to use
> PIPE_TRANSFER_MAP_DIRECTLY only with clEnqueue{Write,Read} functions,
> which map the resource, copy it and unmap it when finished. Is it
> possible for another kernel to access the memory of a buffer that is
> being read/written?
>
AFAICT, yes.  All command queues created on the same device share the
same memory pool, so a kernel being queued for execution in one could
invalidate a concurrent mapping done with PIPE_TRANSFER_MAP_DIRECTLY by
one of the transfer functions.  On top of that the transfer functions
might start queuing kernels themselves in the future to accelerate
certain operations we currently do on the CPU, which would make this
scenario more likely.

> I had no intention of having user mappings made with that flag.
> [Although a possible solution, with a lot of warnings of course, for the
> avobe problem could be to allow a user to use this flag]
>
>> I'm not advocating a revert of the series because it fixes a serious
>> bug, but please don't push patches 10-11, we should probably start
>> looking for a different solution.  Some suggestions are:
>
> I also asked for them to not to be pushed. And with your reasons, until
> we find a better way or we change how buffers are handled, I won't
> propose them again.
>
>>  - Why do you even need a pool?  Wouldn't it be possible to create a
>>    huge RAT, e.g. covering a 4GB portion of the GPU memory and then use
>>    a special memory domain or some sort of flag to tell the kernel to
>>    allocate a buffer from that region (or relocate if it's already been
>>    allocated elsewhere)?  This is especially easy on hardware with
>>    virtual memory, as you could simply reserve an arbitrarily large
>>    block of virtual memory, bind it as e.g. RAT0, and then map other
>>    buffer objects into the block on-demand as they're bound to the
>>    compute pipeline -- There would be no need to move the actual bits
>>    around.  This is similar to the approach I used in my original
>>    proof-of-concept implementation of the compute API on nv50.
>
> This is one of the things I have been wondering recently, given that
> radeonsi doesn't use a pool, why r600 needs one? I still have to
> understand AMD docs and how *exactly* everything works.
>

Probably because on SI compute kernels can access random locations of
memory without going through an RAT?  I have little actual experience
with radeons, Tom should know the low-level details.

> 4GB seems like a big amount of memory for me, my little cedar has only
> 512MB :)
>
>>  - If you insist on using a pool, you could (mostly) avoid the storage
>>    duplication and the mapping copies by allocating buffer objects
>>    directly from the pool as it was before this series, and then keep
>>    some sort of reference count specific to the pool storage that would
>>    be incremented on map and decremented on unmap.  Once you need to
>>    grow the pool you'd keep the old storage around temporarily and
>>    migrate buffers to the new storage lazily as they are required or
>>    unmapped.  Once the reference count drops to zero you'd be free to
>>    release the backing BO to the system.  The fact that you'd keep both
>>    storage buffers around for a bit means that you'd be able to use DMA
>>    to migrate the pool contents instead of the CPU copies you're doing
>>    now, which is likely to be substantially more efficient.
>
> I see how this would solve the slow mappings problem, but I think that
> it could mean a higher memory usage. In the case of a user creating some
> buffers, mapping one of them and them adding more so that the pool has
> to grow, we would have to keep the full size of the old pool just for a
> buffer, plus the new pool.
>

That's a fair point, this solution would only get rid of the extra
copying but it wouldn't solve memory usage problem in some situations
(long-lived mappings).  IMHO the former is more worrying because it has
an impact on every map operation no matter what, while the increased
memory usage will only have an impact in situations of heavy memory
pressure -- And even in those cases the kernel might be able to avoid an
OOM situation by evicting the old pool storage from graphics memory
(which, I must admit, might be rather punishing too), as we can be sure
that there will be no further references from the command stream to the
old storage once it's reallocated and its useful contents are copied
over to the new storage.

I think there's another variant of this solution that would solve the
storage duplication problem assuming that most long-lived mappings are
read-only (long-lived RW mappings aren't very useful AFAICT, because
concurrent access to them from the device has unspecified effects
according to the spec).  You'd keep a reference count of write mappings,
when the pool is reallocated you'd migrate unmapped and read-mapped
memory objects to the new pool using DMA as before, but once the
reference count drops to zero you'd call mmap() with MAP_FIXED providing
the same CPU address you had before to swap the pages of any active read
mappings atomically with the new pool, then you'd be able to release the
old pool storage immediately.  It may not be worth the hassle though...

> The problem of growing the pool using resource_copy_region instead of
> mappings was already in my mind and it was something that I wanted to do
> after solving this bug. :)
>
>>  - On hardware with virtual memory, I guess you could also devise a
>>    copy-less mechanism for growing a given buffer object (used as pool
>>    storage) by mapping additional pages at the end of the same region
>>    without changing its virtual address, assuming you know the maximum
>>    size of the pool beforehand.
>
> The final size of the pool is (will be when I implement the defrag
> function, for now you can have a low and high limit) known when all the
> needed items are going to be added, just before launching the kernel.
>
Well, or you could just reserve a virtual memory block of the maximum
RAT size, if your address space is large enough.

>[...]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 229 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20140622/7fd5326c/attachment-0001.sig>