[Mesa-dev] [PATCH 00/11] [RFC v2] Solve the mapping bug

Sat Jun 21 13:11:18 PDT 2014

On Sat, 2014-06-21 at 17:39 +0200, Francisco Jerez wrote:
> Bruno Jiménez <brunojimen at gmail.com> writes:
> 
> > Hi,
> >
> > This is my second attempt to fix the mapping bug adding all the
> > suggestions that Tom Stellard sent, and, so far, it seems that
> > it is resolved.
> >
> > This series changes completely how OpenCL buffers are handled
> > by the r600g driver. Before this, we would add them directly to
> > a pool, and this pool would grow whenever we needed more space.
> > But this process implied destroying the pool and creating a new
> > one. There could be cases where a buffer would be mapped and
> > the pool would grow, leaving one side of the mapping pointed
> > to where the item was. This is the 'mapping bug'
> >
> > Now, Items will have an intermediate resource, where all mappings
> > can be done, and when a buffer is going to be used with a kernel
> > it is promoted to the pool. In the case where a promoted item
> > is going to be mapped, it is previously demoted, so even if
> > the pool changes its location due to growing, the map remains
> > valid. In the case of a buffer mapped for reading, and used
> > by a kernel to read from it, we will duplicate this buffer,
> > having the intermediate buffer, where the user has its map, and
> > an item in the pool, which is the one that the kernel is going
> > to use.
> >
> > As a summary for v2:
> > Patches 1-8: These are the main part of the series, and solve
> >     the mapping bug.
> >     Patches 1 and 7 now use less explicit castings
> >     Patch 2 is new and introduces the 'is_item_in_pool'
> >         function, which is used in patches 3 and 8
> >
> > Patch 9: Is a complete rewrite of v1 patch 8 using gallium
> >     utils for double lists
> >
> > Patches 10 and 11: These are just a proof of concept for avoiding
> >     transfers GPU <-> GPU when using all CL Read/Write functions.
> >     They are v1 patch 9 splited in two to separate r600g changes
> >     from clover changes.
> >     Now, in clover's side it introduces and uses
> >     'CLOVER_TRANSFER_MAP_DIRECTLY' so it doesen't collide with
> >     any other OpenCL flag.
> >
> > Please review and Thanks :)
> >
> > Bruno Jiménez (11):
> >   r600g/compute: Add an intermediate resource for OpenCL buffers
> >   r600g/compute: Add an util function to know if an item is in the pool
> >   r600g/compute: Add statuses to the compute_memory_items
> >   r600g/compute: divide the item list in two
> >   r600g/compute: Only move to the pool the buffers marked for promoting
> >   r600g/compute: Avoid problems when promoting items mapped for reading
> >   r600g/compute: Implement compute_memory_demote_item
> >   r600g/compute: Map only against intermediate buffers
> >   r600g/compute: Use gallium util functions for double lists
> >   r600g/compute: Map directly the pool in some cases
> >   clover: Use PIPE_TRANSFER_MAP_DIRECTLY when writing/reading buffers
> >
> >  src/gallium/drivers/r600/compute_memory_pool.c     | 294 ++++++++++++---------
> >  src/gallium/drivers/r600/compute_memory_pool.h     |  31 ++-
> >  src/gallium/drivers/r600/evergreen_compute.c       |  38 ++-
> >  src/gallium/state_trackers/clover/api/transfer.cpp |   4 +-
> >  src/gallium/state_trackers/clover/core/object.hpp  |   4 +
> >  .../state_trackers/clover/core/resource.cpp        |   2 +
> >  6 files changed, 233 insertions(+), 140 deletions(-)
> >
> > -- 
> > 2.0.0
> >
> 
> Hi Bruno, 
> 
> I'm sorry for having taken so long to comment on this, I haven't had
> time to read through your patch series carefully until last night.

Hi Francisco,

It's Ok, we can't be always here to review everything :)

> I'm not convinced that this is going to be a satisfactory solution, it
> punishes well-behaving applications by duplicating memory usage under
> some circumstances, and by slowing down buffer mapping with (in most
> cases) unnecessary shadow copies: Even though using a buffer while
> mapped is something to keep in mind because it's allowed by the spec
> with some restrictions, I think it's reasonable to assume that we're
> only going to see a comparatively small number of active buffer mappings
> during the execution of any compute kernel, so it seems unfortunate to
> me that the rest of mappings will have to pay for this one corner case.

I have to say that I understand perfectly your concerns about memory
duplication and slow mappings. In fact, the memory duplication is mainly
caused by the possibility of having a buffer mapped for reading that a
kernel uses + relocation of the pool. And the slow mappings are caused
by any map that could stay alive while a kernel is launched +
relocation.

For the OpenCL code that I have seen, I haven't yet encountered one that
left a buffer mapped for reading and at the same time launched a kernel
that reads from it. But I do have encountered programs that left buffers
mapped while they  launched kernels that don't touch those buffers. But
I agree that punishing mappings made simply for read/write between
kernel executions is something that we have to find a way to avoid.

That's mainly why I tried to avoid the copies in the case of simple
Read/Writes without mappings with patches 10-11.

> The implementation of PIPE_TRANSFER_MAP_DIRECTLY introduced in PATCH 10
> has somewhat worrying semantics: A mapping with this flag might become
> stale unpredictably if a kernel is run, maybe from a different command
> queue.  Clover's transfer functions don't hit that path right now on
> single-threaded applications, but they might in the future as we start
> accelerating the APIs we currently implement with soft_copy_op().  This
> is a bug IMHO: even direct mappings should last until the corresponding
> unmap call.

I think I'm not fully understanding you here. I tried to use
PIPE_TRANSFER_MAP_DIRECTLY only with clEnqueue{Write,Read} functions,
which map the resource, copy it and unmap it when finished. Is it
possible for another kernel to access the memory of a buffer that is
being read/written?

I had no intention of having user mappings made with that flag.
[Although a possible solution, with a lot of warnings of course, for the
avobe problem could be to allow a user to use this flag]

> I'm not advocating a revert of the series because it fixes a serious
> bug, but please don't push patches 10-11, we should probably start
> looking for a different solution.  Some suggestions are:

I also asked for them to not to be pushed. And with your reasons, until
we find a better way or we change how buffers are handled, I won't
propose them again.

>  - Why do you even need a pool?  Wouldn't it be possible to create a
>    huge RAT, e.g. covering a 4GB portion of the GPU memory and then use
>    a special memory domain or some sort of flag to tell the kernel to
>    allocate a buffer from that region (or relocate if it's already been
>    allocated elsewhere)?  This is especially easy on hardware with
>    virtual memory, as you could simply reserve an arbitrarily large
>    block of virtual memory, bind it as e.g. RAT0, and then map other
>    buffer objects into the block on-demand as they're bound to the
>    compute pipeline -- There would be no need to move the actual bits
>    around.  This is similar to the approach I used in my original
>    proof-of-concept implementation of the compute API on nv50.

This is one of the things I have been wondering recently, given that
radeonsi doesn't use a pool, why r600 needs one? I still have to
understand AMD docs and how *exactly* everything works.

4GB seems like a big amount of memory for me, my little cedar has only
512MB :)

>  - If you insist on using a pool, you could (mostly) avoid the storage
>    duplication and the mapping copies by allocating buffer objects
>    directly from the pool as it was before this series, and then keep
>    some sort of reference count specific to the pool storage that would
>    be incremented on map and decremented on unmap.  Once you need to
>    grow the pool you'd keep the old storage around temporarily and
>    migrate buffers to the new storage lazily as they are required or
>    unmapped.  Once the reference count drops to zero you'd be free to
>    release the backing BO to the system.  The fact that you'd keep both
>    storage buffers around for a bit means that you'd be able to use DMA
>    to migrate the pool contents instead of the CPU copies you're doing
>    now, which is likely to be substantially more efficient.

I see how this would solve the slow mappings problem, but I think that
it could mean a higher memory usage. In the case of a user creating some
buffers, mapping one of them and them adding more so that the pool has
to grow, we would have to keep the full size of the old pool just for a
buffer, plus the new pool.

The problem of growing the pool using resource_copy_region instead of
mappings was already in my mind and it was something that I wanted to do
after solving this bug. :)

>  - On hardware with virtual memory, I guess you could also devise a
>    copy-less mechanism for growing a given buffer object (used as pool
>    storage) by mapping additional pages at the end of the same region
>    without changing its virtual address, assuming you know the maximum
>    size of the pool beforehand.

The final size of the pool is (will be when I implement the defrag
function, for now you can have a low and high limit) known when all the
needed items are going to be added, just before launching the kernel.

> Heh, sorry for the long rant.  Are you on IRC by the way?

No need to be sorry, these rants may be instructive to me. It is almost
obvious to me that you have a wider vision of the hardware
possibilities/clover/OpenCL than me. And I'd be thankful for any
suggestion you have.

Yes, I'm usually on IRC under the nick 'Noxbru' :)

If you need me for anything, just ask.

Thanks for all!
Bruno

> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/mesa-dev