[Mesa-dev] [PATCH 2/2] winsys/amdgpu: explicitly declare whether buffer_map is permanent or not
Marek Olšák
maraeo at gmail.com
Fri Nov 23 19:09:49 UTC 2018
On Thu, Nov 22, 2018 at 6:24 AM Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> On 21.11.18 21:27, Marek Olšák wrote:
> > On Wed, Nov 21, 2018 at 12:57 PM Nicolai Hähnle <nhaehnle at gmail.com
> > <mailto:nhaehnle at gmail.com>> wrote:
> >
> > From: Nicolai Hähnle <nicolai.haehnle at amd.com
> > <mailto:nicolai.haehnle at amd.com>>
> >
> > Introduce a new driver-private transfer flag
> RADEON_TRANSFER_TEMPORARY
> > that specifies whether the caller will use buffer_unmap or not. The
> > default behavior is set to permanent maps, because that's what
> drivers
> > do for Gallium buffer maps.
> >
> > This should eliminate the need for hacks in libdrm. Assertions are
> added
> > to catch when the buffer_unmap calls don't match the (temporary)
> > buffer_map calls.
> >
> > I did my best to update r600 and r300 as well for completeness (yes,
> > it's a no-op for r300 because it never calls buffer_unmap), even
> though
> > the radeon winsys ignores the new flag.
> >
> >
> > You didn't make any changes to r300.
>
> Yeah, that's what I wrote :)
>
>
> > You can also drop all r600 changes, because the radeon winsys doesn't
> care.
>
> I don't think it's a good idea, though. The interface of the two winsys
> is different, yes, but it's largely the same and it makes sense to keep
> it that way conceptually. Not that it matters much for the code itself.
>
>
> [snip]
> > +enum radeon_transfer_flags {
> > + /* Indicates that the caller will unmap the buffer.
> > + *
> > + * Not unmapping buffers is an important performance
> > optimization for
> > + * OpenGL (avoids kernel overhead for frequently mapped
> > buffers). However,
> > + * if you only map a buffer once and then use it indefinitely
> > from the GPU,
> > + * it is much better to unmap it so that the kernel is free to
> > move it to
> > + * non-visible VRAM.
> >
> >
> > The second half of the comment is misleading. The kernel will move
> > buffers to invisible VRAM regardless of whether they're mapped, so CPU
> > mappings have no effect on the placement. Buffers are only moved back to
> > CPU-accessible memory on a CPU page fault. If a buffer is mapped and
> > there no CPU access, it will stay in invisible VRAM forever. The general
> > recommendation is to keep those buffers mapped for CPU access just like
> > GTT buffers.
>
> Yeah, I'll change that.
>
>
> > + */
> > + RADEON_TRANSFER_TEMPORARY = (PIPE_TRANSFER_DRV_PRV << 0),
> > +};
> > +
> > #define RADEON_SPARSE_PAGE_SIZE (64 * 1024)
> >
> > enum ring_type {
> > RING_GFX = 0,
> > RING_COMPUTE,
> > RING_DMA,
> > RING_UVD,
> > RING_VCE,
> > RING_UVD_ENC,
> > RING_VCN_DEC,
> > @@ -287,23 +299,26 @@ struct radeon_winsys {
> > struct pb_buffer *(*buffer_create)(struct radeon_winsys *ws,
> > uint64_t size,
> > unsigned alignment,
> > enum radeon_bo_domain
> domain,
> > enum radeon_bo_flag flags);
> >
> > /**
> > * Map the entire data store of a buffer object into the
> > client's address
> > * space.
> > *
> > + * Callers are expected to unmap buffers again if and only if
> the
> > + * RADEON_TRANSFER_TEMPORARY flag is set in \p usage.
> > + *
> > * \param buf A winsys buffer object to map.
> > * \param cs A command stream to flush if the buffer is
> > referenced by it.
> > - * \param usage A bitmask of the PIPE_TRANSFER_* flags.
> > + * \param usage A bitmask of the PIPE_TRANSFER_* and
> > RADEON_TRANSFER_* flags.
> > * \return The pointer at the beginning of the buffer.
> > */
> > void *(*buffer_map)(struct pb_buffer *buf,
> > struct radeon_cmdbuf *cs,
> > enum pipe_transfer_usage usage);
> >
> > /**
> > * Unmap a buffer object from the client's address space.
> > *
> > * \param buf A winsys buffer object to unmap.
> > diff --git a/src/gallium/drivers/radeonsi/si_shader.c
> > b/src/gallium/drivers/radeonsi/si_shader.c
> > index 19522cc97b1..d455fb5db6a 100644
> > --- a/src/gallium/drivers/radeonsi/si_shader.c
> > +++ b/src/gallium/drivers/radeonsi/si_shader.c
> > @@ -5286,21 +5286,22 @@ int si_shader_binary_upload(struct si_screen
> > *sscreen, struct si_shader *shader)
> > 0 :
> > SI_RESOURCE_FLAG_READ_ONLY,
> > PIPE_USAGE_IMMUTABLE,
> > align(bo_size,
> > SI_CPDMA_ALIGNMENT),
> > 256);
> > if (!shader->bo)
> > return -ENOMEM;
> >
> > /* Upload. */
> > ptr = sscreen->ws->buffer_map(shader->bo->buf, NULL,
> > PIPE_TRANSFER_READ_WRITE |
> > -
> PIPE_TRANSFER_UNSYNCHRONIZED);
> > + PIPE_TRANSFER_UNSYNCHRONIZED
> |
> > + RADEON_TRANSFER_TEMPORARY);
> >
> > /* Don't use util_memcpy_cpu_to_le32. LLVM binaries are
> > * endian-independent. */
> > if (prolog) {
> > memcpy(ptr, prolog->code, prolog->code_size);
> > ptr += prolog->code_size;
> > }
> > if (previous_stage) {
> > memcpy(ptr, previous_stage->code,
> > previous_stage->code_size);
> > ptr += previous_stage->code_size;
> > diff --git a/src/gallium/include/pipe/p_defines.h
> > b/src/gallium/include/pipe/p_defines.h
> > index 693f041b1da..e99895d30d8 100644
> > --- a/src/gallium/include/pipe/p_defines.h
> > +++ b/src/gallium/include/pipe/p_defines.h
> > @@ -334,21 +334,27 @@ enum pipe_transfer_usage
> > */
> > PIPE_TRANSFER_PERSISTENT = (1 << 13),
> >
> > /**
> > * If PERSISTENT is set, this ensures any writes done by the
> > device are
> > * immediately visible to the CPU and vice versa.
> > *
> > * PIPE_RESOURCE_FLAG_MAP_COHERENT must be set when creating
> > * the resource.
> > */
> > - PIPE_TRANSFER_COHERENT = (1 << 14)
> > + PIPE_TRANSFER_COHERENT = (1 << 14),
> > +
> > + /**
> > + * This and higher bits are reserved for private use by drivers.
> > Drivers
> > + * should use this as (PIPE_TRANSFER_DRV_PRV << i).
> > + */
> > + PIPE_TRANSFER_DRV_PRV = (1 << 24)
> > };
> >
> > /**
> > * Flags for the flush function.
> > */
> > enum pipe_flush_flags
> > {
> > PIPE_FLUSH_END_OF_FRAME = (1 << 0),
> > PIPE_FLUSH_DEFERRED = (1 << 1),
> > PIPE_FLUSH_FENCE_FD = (1 << 2),
> > diff --git a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
> > b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
> > index 9f0d4c12482..355018c76fc 100644
> > --- a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
> > +++ b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
> > @@ -49,20 +49,21 @@
> > struct amdgpu_sparse_backing_chunk {
> > uint32_t begin, end;
> > };
> >
> > static struct pb_buffer *
> > amdgpu_bo_create(struct radeon_winsys *rws,
> > uint64_t size,
> > unsigned alignment,
> > enum radeon_bo_domain domain,
> > enum radeon_bo_flag flags);
> > +static void amdgpu_bo_unmap(struct pb_buffer *buf);
> >
> > static bool amdgpu_bo_wait(struct pb_buffer *_buf, uint64_t
> timeout,
> > enum radeon_bo_usage usage)
> > {
> > struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf);
> > struct amdgpu_winsys *ws = bo->ws;
> > int64_t abs_timeout;
> >
> > if (timeout == 0) {
> > if (p_atomic_read(&bo->num_active_ioctls))
> > @@ -166,20 +167,26 @@ static void amdgpu_bo_remove_fences(struct
> > amdgpu_winsys_bo *bo)
> > bo->max_fences = 0;
> > }
> >
> > void amdgpu_bo_destroy(struct pb_buffer *_buf)
> > {
> > struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf);
> > struct amdgpu_winsys *ws = bo->ws;
> >
> > assert(bo->bo && "must not be called for slab entries");
> >
> > + if (!bo->is_user_ptr && bo->cpu_ptr) {
> > + bo->cpu_ptr = NULL;
> > + amdgpu_bo_unmap(&bo->base);
> > + }
> > + assert(bo->is_user_ptr || bo->u.real.map_count == 0);
> > +
> > if (ws->debug_all_bos) {
> > simple_mtx_lock(&ws->global_bo_list_lock);
> > LIST_DEL(&bo->u.real.global_list_item);
> > ws->num_buffers--;
> > simple_mtx_unlock(&ws->global_bo_list_lock);
> > }
> >
> > simple_mtx_lock(&ws->bo_export_table_lock);
> > util_hash_table_remove(ws->bo_export_table, bo->bo);
> > simple_mtx_unlock(&ws->bo_export_table_lock);
> > @@ -188,54 +195,66 @@ void amdgpu_bo_destroy(struct pb_buffer *_buf)
> > amdgpu_va_range_free(bo->u.real.va_handle);
> > amdgpu_bo_free(bo->bo);
> >
> > amdgpu_bo_remove_fences(bo);
> >
> > if (bo->initial_domain & RADEON_DOMAIN_VRAM)
> > ws->allocated_vram -= align64(bo->base.size,
> > ws->info.gart_page_size);
> > else if (bo->initial_domain & RADEON_DOMAIN_GTT)
> > ws->allocated_gtt -= align64(bo->base.size,
> > ws->info.gart_page_size);
> >
> > - if (bo->u.real.map_count >= 1) {
> > - if (bo->initial_domain & RADEON_DOMAIN_VRAM)
> > - ws->mapped_vram -= bo->base.size;
> > - else if (bo->initial_domain & RADEON_DOMAIN_GTT)
> > - ws->mapped_gtt -= bo->base.size;
> > - ws->num_mapped_buffers--;
> > - }
> > -
> > simple_mtx_destroy(&bo->lock);
> > FREE(bo);
> > }
> >
> > static void amdgpu_bo_destroy_or_cache(struct pb_buffer *_buf)
> > {
> > struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf);
> >
> > assert(bo->bo); /* slab buffers have a separate vtbl */
> >
> > if (bo->u.real.use_reusable_pool)
> > pb_cache_add_buffer(&bo->u.real.cache_entry);
> > else
> > amdgpu_bo_destroy(_buf);
> > }
> >
> > +static bool amdgpu_bo_do_map(struct amdgpu_winsys_bo *bo, void
> **cpu)
> > +{
> > + assert(!bo->sparse && bo->bo && !bo->is_user_ptr);
> > + int r = amdgpu_bo_cpu_map(bo->bo, cpu);
> > + if (r) {
> > + /* Clear the cache and try again. */
> > + pb_cache_release_all_buffers(&bo->ws->bo_cache);
> > + r = amdgpu_bo_cpu_map(bo->bo, cpu);
> > + if (r)
> > + return false;
> > + }
> > +
> > + if (p_atomic_inc_return(&bo->u.real.map_count) == 1) {
> > + if (bo->initial_domain & RADEON_DOMAIN_VRAM)
> > + bo->ws->mapped_vram += bo->base.size;
> > + else if (bo->initial_domain & RADEON_DOMAIN_GTT)
> > + bo->ws->mapped_gtt += bo->base.size;
> > + bo->ws->num_mapped_buffers++;
> > + }
> > +
> > + return true;
> > +}
> > +
> > static void *amdgpu_bo_map(struct pb_buffer *buf,
> > struct radeon_cmdbuf *rcs,
> > enum pipe_transfer_usage usage)
> > {
> > struct amdgpu_winsys_bo *bo = (struct amdgpu_winsys_bo*)buf;
> > struct amdgpu_winsys_bo *real;
> > struct amdgpu_cs *cs = (struct amdgpu_cs*)rcs;
> > - int r;
> > - void *cpu = NULL;
> > - uint64_t offset = 0;
> >
> > assert(!bo->sparse);
> >
> > /* If it's not unsynchronized bo_map, flush CS if needed and
> > then wait. */
> > if (!(usage & PIPE_TRANSFER_UNSYNCHRONIZED)) {
> > /* DONTBLOCK doesn't make sense with UNSYNCHRONIZED. */
> > if (usage & PIPE_TRANSFER_DONTBLOCK) {
> > if (!(usage & PIPE_TRANSFER_WRITE)) {
> > /* Mapping for read.
> > *
> > @@ -306,63 +325,74 @@ static void *amdgpu_bo_map(struct pb_buffer
> *buf,
> > }
> >
> > amdgpu_bo_wait((struct pb_buffer*)bo,
> > PIPE_TIMEOUT_INFINITE,
> > RADEON_USAGE_READWRITE);
> > }
> >
> > bo->ws->buffer_wait_time += os_time_get_nano() - time;
> > }
> > }
> >
> > - /* If the buffer is created from user memory, return the user
> > pointer. */
> > - if (bo->user_ptr)
> > - return bo->user_ptr;
> > + /* Buffer synchronization has been checked, now actually map the
> > buffer. */
> > + void *cpu = NULL;
> > + uint64_t offset = 0;
> >
> > if (bo->bo) {
> > real = bo;
> > } else {
> > real = bo->u.slab.real;
> > offset = bo->va - real->va;
> > }
> >
> > - r = amdgpu_bo_cpu_map(real->bo, &cpu);
> > - if (r) {
> > - /* Clear the cache and try again. */
> > - pb_cache_release_all_buffers(&real->ws->bo_cache);
> > - r = amdgpu_bo_cpu_map(real->bo, &cpu);
> > - if (r)
> > - return NULL;
> > + if (usage & RADEON_TRANSFER_TEMPORARY) {
> > + if (real->is_user_ptr) {
> > + cpu = real->cpu_ptr;
> > + } else {
> > + if (!amdgpu_bo_do_map(real, &cpu))
> > + return NULL;
> > + }
> > + } else {
> > + cpu = real->cpu_ptr;
> > + if (!cpu) {
> >
> >
> > I think this is unsafe on 64-bit CPUs, because the lower 32 bits can be
> > initialized but not the upper 32 bits. You would either have to check
> > both 32-bit halves separately (if they are both expected to be non-zero)
> > or you would have to lock the mutex first.
>
> Well, there is actually a potential problem, but it's not that. A 64-bit
> CPU will do aligned 64-bit loads/stores atomically, but
> amdgpu_bo_cpu_map doesn't necessarily guarantee that it'll set the
> address atomically. I think it does that today, but it's a bad idea to
> depend on it.
>
>
> > If all my remarks are addressed, I'll ack this, but I still think that
> > this patch adds unnecessary cruft, complexity, and fragility in order to
> > fix an issue that is very trivial in nature. The overflow check in
> > libdrm is simple, clean, and robust.
>
> The overflow check, or actually the saturating behavior, is also very
> surprising and unusual. Reference counting doesn't usually work that way.
>
It's not reference counting. It's not even the map count. It's the map call
count. (I wrote the original code)
Marek
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20181123/76200067/attachment-0001.html>
More information about the mesa-dev
mailing list