[Mesa-dev] [PATCH 2/2] winsys/amdgpu: explicitly declare whether buffer_map is permanent or not

Fri Nov 23 19:09:49 UTC 2018

On Thu, Nov 22, 2018 at 6:24 AM Nicolai Hähnle <nhaehnle at gmail.com> wrote:

> On 21.11.18 21:27, Marek Olšák wrote:
> > On Wed, Nov 21, 2018 at 12:57 PM Nicolai Hähnle <nhaehnle at gmail.com
> > <mailto:nhaehnle at gmail.com>> wrote:
> >
> >     From: Nicolai Hähnle <nicolai.haehnle at amd.com
> >     <mailto:nicolai.haehnle at amd.com>>
> >
> >     Introduce a new driver-private transfer flag
> RADEON_TRANSFER_TEMPORARY
> >     that specifies whether the caller will use buffer_unmap or not. The
> >     default behavior is set to permanent maps, because that's what
> drivers
> >     do for Gallium buffer maps.
> >
> >     This should eliminate the need for hacks in libdrm. Assertions are
> added
> >     to catch when the buffer_unmap calls don't match the (temporary)
> >     buffer_map calls.
> >
> >     I did my best to update r600 and r300 as well for completeness (yes,
> >     it's a no-op for r300 because it never calls buffer_unmap), even
> though
> >     the radeon winsys ignores the new flag.
> >
> >
> > You didn't make any changes to r300.
>
> Yeah, that's what I wrote :)
>
>
> > You can also drop all r600 changes, because the radeon winsys doesn't
> care.
>
> I don't think it's a good idea, though. The interface of the two winsys
> is different, yes, but it's largely the same and it makes sense to keep
> it that way conceptually. Not that it matters much for the code itself.
>
>
> [snip]
> >     +enum radeon_transfer_flags {
> >     +   /* Indicates that the caller will unmap the buffer.
> >     +    *
> >     +    * Not unmapping buffers is an important performance
> >     optimization for
> >     +    * OpenGL (avoids kernel overhead for frequently mapped
> >     buffers). However,
> >     +    * if you only map a buffer once and then use it indefinitely
> >     from the GPU,
> >     +    * it is much better to unmap it so that the kernel is free to
> >     move it to
> >     +    * non-visible VRAM.
> >
> >
> > The second half of the comment is misleading. The kernel will move
> > buffers to invisible VRAM regardless of whether they're mapped, so CPU
> > mappings have no effect on the placement. Buffers are only moved back to
> > CPU-accessible memory on a CPU page fault. If a buffer is mapped and
> > there no CPU access, it will stay in invisible VRAM forever. The general
> > recommendation is to keep those buffers mapped for CPU access just like
> > GTT buffers.
>
> Yeah, I'll change that.
>
>
> >     +    */
> >     +   RADEON_TRANSFER_TEMPORARY = (PIPE_TRANSFER_DRV_PRV << 0),
> >     +};
> >     +
> >       #define RADEON_SPARSE_PAGE_SIZE (64 * 1024)
> >
> >       enum ring_type {
> >           RING_GFX = 0,
> >           RING_COMPUTE,
> >           RING_DMA,
> >           RING_UVD,
> >           RING_VCE,
> >           RING_UVD_ENC,
> >           RING_VCN_DEC,
> >     @@ -287,23 +299,26 @@ struct radeon_winsys {
> >           struct pb_buffer *(*buffer_create)(struct radeon_winsys *ws,
> >                                              uint64_t size,
> >                                              unsigned alignment,
> >                                              enum radeon_bo_domain
> domain,
> >                                              enum radeon_bo_flag flags);
> >
> >           /**
> >            * Map the entire data store of a buffer object into the
> >     client's address
> >            * space.
> >            *
> >     +     * Callers are expected to unmap buffers again if and only if
> the
> >     +     * RADEON_TRANSFER_TEMPORARY flag is set in \p usage.
> >     +     *
> >            * \param buf       A winsys buffer object to map.
> >            * \param cs        A command stream to flush if the buffer is
> >     referenced by it.
> >     -     * \param usage     A bitmask of the PIPE_TRANSFER_* flags.
> >     +     * \param usage     A bitmask of the PIPE_TRANSFER_* and
> >     RADEON_TRANSFER_* flags.
> >            * \return          The pointer at the beginning of the buffer.
> >            */
> >           void *(*buffer_map)(struct pb_buffer *buf,
> >                               struct radeon_cmdbuf *cs,
> >                               enum pipe_transfer_usage usage);
> >
> >           /**
> >            * Unmap a buffer object from the client's address space.
> >            *
> >            * \param buf       A winsys buffer object to unmap.
> >     diff --git a/src/gallium/drivers/radeonsi/si_shader.c
> >     b/src/gallium/drivers/radeonsi/si_shader.c
> >     index 19522cc97b1..d455fb5db6a 100644
> >     --- a/src/gallium/drivers/radeonsi/si_shader.c
> >     +++ b/src/gallium/drivers/radeonsi/si_shader.c
> >     @@ -5286,21 +5286,22 @@ int si_shader_binary_upload(struct si_screen
> >     *sscreen, struct si_shader *shader)
> >                                                      0 :
> >     SI_RESOURCE_FLAG_READ_ONLY,
> >                                                     PIPE_USAGE_IMMUTABLE,
> >                                                     align(bo_size,
> >     SI_CPDMA_ALIGNMENT),
> >                                                     256);
> >              if (!shader->bo)
> >                      return -ENOMEM;
> >
> >              /* Upload. */
> >              ptr = sscreen->ws->buffer_map(shader->bo->buf, NULL,
> >                                              PIPE_TRANSFER_READ_WRITE |
> >     -
>  PIPE_TRANSFER_UNSYNCHRONIZED);
> >     +                                       PIPE_TRANSFER_UNSYNCHRONIZED
> |
> >     +                                       RADEON_TRANSFER_TEMPORARY);
> >
> >              /* Don't use util_memcpy_cpu_to_le32. LLVM binaries are
> >               * endian-independent. */
> >              if (prolog) {
> >                      memcpy(ptr, prolog->code, prolog->code_size);
> >                      ptr += prolog->code_size;
> >              }
> >              if (previous_stage) {
> >                      memcpy(ptr, previous_stage->code,
> >     previous_stage->code_size);
> >                      ptr += previous_stage->code_size;
> >     diff --git a/src/gallium/include/pipe/p_defines.h
> >     b/src/gallium/include/pipe/p_defines.h
> >     index 693f041b1da..e99895d30d8 100644
> >     --- a/src/gallium/include/pipe/p_defines.h
> >     +++ b/src/gallium/include/pipe/p_defines.h
> >     @@ -334,21 +334,27 @@ enum pipe_transfer_usage
> >           */
> >          PIPE_TRANSFER_PERSISTENT = (1 << 13),
> >
> >          /**
> >           * If PERSISTENT is set, this ensures any writes done by the
> >     device are
> >           * immediately visible to the CPU and vice versa.
> >           *
> >           * PIPE_RESOURCE_FLAG_MAP_COHERENT must be set when creating
> >           * the resource.
> >           */
> >     -   PIPE_TRANSFER_COHERENT = (1 << 14)
> >     +   PIPE_TRANSFER_COHERENT = (1 << 14),
> >     +
> >     +   /**
> >     +    * This and higher bits are reserved for private use by drivers.
> >     Drivers
> >     +    * should use this as (PIPE_TRANSFER_DRV_PRV << i).
> >     +    */
> >     +   PIPE_TRANSFER_DRV_PRV = (1 << 24)
> >       };
> >
> >       /**
> >        * Flags for the flush function.
> >        */
> >       enum pipe_flush_flags
> >       {
> >          PIPE_FLUSH_END_OF_FRAME = (1 << 0),
> >          PIPE_FLUSH_DEFERRED = (1 << 1),
> >          PIPE_FLUSH_FENCE_FD = (1 << 2),
> >     diff --git a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
> >     b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
> >     index 9f0d4c12482..355018c76fc 100644
> >     --- a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
> >     +++ b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c
> >     @@ -49,20 +49,21 @@
> >       struct amdgpu_sparse_backing_chunk {
> >          uint32_t begin, end;
> >       };
> >
> >       static struct pb_buffer *
> >       amdgpu_bo_create(struct radeon_winsys *rws,
> >                        uint64_t size,
> >                        unsigned alignment,
> >                        enum radeon_bo_domain domain,
> >                        enum radeon_bo_flag flags);
> >     +static void amdgpu_bo_unmap(struct pb_buffer *buf);
> >
> >       static bool amdgpu_bo_wait(struct pb_buffer *_buf, uint64_t
> timeout,
> >                                  enum radeon_bo_usage usage)
> >       {
> >          struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf);
> >          struct amdgpu_winsys *ws = bo->ws;
> >          int64_t abs_timeout;
> >
> >          if (timeout == 0) {
> >             if (p_atomic_read(&bo->num_active_ioctls))
> >     @@ -166,20 +167,26 @@ static void amdgpu_bo_remove_fences(struct
> >     amdgpu_winsys_bo *bo)
> >          bo->max_fences = 0;
> >       }
> >
> >       void amdgpu_bo_destroy(struct pb_buffer *_buf)
> >       {
> >          struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf);
> >          struct amdgpu_winsys *ws = bo->ws;
> >
> >          assert(bo->bo && "must not be called for slab entries");
> >
> >     +   if (!bo->is_user_ptr && bo->cpu_ptr) {
> >     +      bo->cpu_ptr = NULL;
> >     +      amdgpu_bo_unmap(&bo->base);
> >     +   }
> >     +   assert(bo->is_user_ptr || bo->u.real.map_count == 0);
> >     +
> >          if (ws->debug_all_bos) {
> >             simple_mtx_lock(&ws->global_bo_list_lock);
> >             LIST_DEL(&bo->u.real.global_list_item);
> >             ws->num_buffers--;
> >             simple_mtx_unlock(&ws->global_bo_list_lock);
> >          }
> >
> >          simple_mtx_lock(&ws->bo_export_table_lock);
> >          util_hash_table_remove(ws->bo_export_table, bo->bo);
> >          simple_mtx_unlock(&ws->bo_export_table_lock);
> >     @@ -188,54 +195,66 @@ void amdgpu_bo_destroy(struct pb_buffer *_buf)
> >          amdgpu_va_range_free(bo->u.real.va_handle);
> >          amdgpu_bo_free(bo->bo);
> >
> >          amdgpu_bo_remove_fences(bo);
> >
> >          if (bo->initial_domain & RADEON_DOMAIN_VRAM)
> >             ws->allocated_vram -= align64(bo->base.size,
> >     ws->info.gart_page_size);
> >          else if (bo->initial_domain & RADEON_DOMAIN_GTT)
> >             ws->allocated_gtt -= align64(bo->base.size,
> >     ws->info.gart_page_size);
> >
> >     -   if (bo->u.real.map_count >= 1) {
> >     -      if (bo->initial_domain & RADEON_DOMAIN_VRAM)
> >     -         ws->mapped_vram -= bo->base.size;
> >     -      else if (bo->initial_domain & RADEON_DOMAIN_GTT)
> >     -         ws->mapped_gtt -= bo->base.size;
> >     -      ws->num_mapped_buffers--;
> >     -   }
> >     -
> >          simple_mtx_destroy(&bo->lock);
> >          FREE(bo);
> >       }
> >
> >       static void amdgpu_bo_destroy_or_cache(struct pb_buffer *_buf)
> >       {
> >          struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf);
> >
> >          assert(bo->bo); /* slab buffers have a separate vtbl */
> >
> >          if (bo->u.real.use_reusable_pool)
> >             pb_cache_add_buffer(&bo->u.real.cache_entry);
> >          else
> >             amdgpu_bo_destroy(_buf);
> >       }
> >
> >     +static bool amdgpu_bo_do_map(struct amdgpu_winsys_bo *bo, void
> **cpu)
> >     +{
> >     +   assert(!bo->sparse && bo->bo && !bo->is_user_ptr);
> >     +   int r = amdgpu_bo_cpu_map(bo->bo, cpu);
> >     +   if (r) {
> >     +      /* Clear the cache and try again. */
> >     +      pb_cache_release_all_buffers(&bo->ws->bo_cache);
> >     +      r = amdgpu_bo_cpu_map(bo->bo, cpu);
> >     +      if (r)
> >     +         return false;
> >     +   }
> >     +
> >     +   if (p_atomic_inc_return(&bo->u.real.map_count) == 1) {
> >     +      if (bo->initial_domain & RADEON_DOMAIN_VRAM)
> >     +         bo->ws->mapped_vram += bo->base.size;
> >     +      else if (bo->initial_domain & RADEON_DOMAIN_GTT)
> >     +         bo->ws->mapped_gtt += bo->base.size;
> >     +      bo->ws->num_mapped_buffers++;
> >     +   }
> >     +
> >     +   return true;
> >     +}
> >     +
> >       static void *amdgpu_bo_map(struct pb_buffer *buf,
> >                                  struct radeon_cmdbuf *rcs,
> >                                  enum pipe_transfer_usage usage)
> >       {
> >          struct amdgpu_winsys_bo *bo = (struct amdgpu_winsys_bo*)buf;
> >          struct amdgpu_winsys_bo *real;
> >          struct amdgpu_cs *cs = (struct amdgpu_cs*)rcs;
> >     -   int r;
> >     -   void *cpu = NULL;
> >     -   uint64_t offset = 0;
> >
> >          assert(!bo->sparse);
> >
> >          /* If it's not unsynchronized bo_map, flush CS if needed and
> >     then wait. */
> >          if (!(usage & PIPE_TRANSFER_UNSYNCHRONIZED)) {
> >             /* DONTBLOCK doesn't make sense with UNSYNCHRONIZED. */
> >             if (usage & PIPE_TRANSFER_DONTBLOCK) {
> >                if (!(usage & PIPE_TRANSFER_WRITE)) {
> >                   /* Mapping for read.
> >                    *
> >     @@ -306,63 +325,74 @@ static void *amdgpu_bo_map(struct pb_buffer
> *buf,
> >                   }
> >
> >                   amdgpu_bo_wait((struct pb_buffer*)bo,
> >     PIPE_TIMEOUT_INFINITE,
> >                                  RADEON_USAGE_READWRITE);
> >                }
> >
> >                bo->ws->buffer_wait_time += os_time_get_nano() - time;
> >             }
> >          }
> >
> >     -   /* If the buffer is created from user memory, return the user
> >     pointer. */
> >     -   if (bo->user_ptr)
> >     -      return bo->user_ptr;
> >     +   /* Buffer synchronization has been checked, now actually map the
> >     buffer. */
> >     +   void *cpu = NULL;
> >     +   uint64_t offset = 0;
> >
> >          if (bo->bo) {
> >             real = bo;
> >          } else {
> >             real = bo->u.slab.real;
> >             offset = bo->va - real->va;
> >          }
> >
> >     -   r = amdgpu_bo_cpu_map(real->bo, &cpu);
> >     -   if (r) {
> >     -      /* Clear the cache and try again. */
> >     -      pb_cache_release_all_buffers(&real->ws->bo_cache);
> >     -      r = amdgpu_bo_cpu_map(real->bo, &cpu);
> >     -      if (r)
> >     -         return NULL;
> >     +   if (usage & RADEON_TRANSFER_TEMPORARY) {
> >     +      if (real->is_user_ptr) {
> >     +         cpu = real->cpu_ptr;
> >     +      } else {
> >     +         if (!amdgpu_bo_do_map(real, &cpu))
> >     +            return NULL;
> >     +      }
> >     +   } else {
> >     +      cpu = real->cpu_ptr;
> >     +      if (!cpu) {
> >
> >
> > I think this is unsafe on 64-bit CPUs, because the lower 32 bits can be
> > initialized but not the upper 32 bits. You would either have to check
> > both 32-bit halves separately (if they are both expected to be non-zero)
> > or you would have to lock the mutex first.
>
> Well, there is actually a potential problem, but it's not that. A 64-bit
> CPU will do aligned 64-bit loads/stores atomically, but
> amdgpu_bo_cpu_map doesn't necessarily guarantee that it'll set the
> address atomically. I think it does that today, but it's a bad idea to
> depend on it.
>
>
> > If all my remarks are addressed, I'll ack this, but I still think that
> > this patch adds unnecessary cruft, complexity, and fragility in order to
> > fix an issue that is very trivial in nature. The overflow check in
> > libdrm is simple, clean, and robust.
>
> The overflow check, or actually the saturating behavior, is also very
> surprising and unusual. Reference counting doesn't usually work that way.
>

It's not reference counting. It's not even the map count. It's the map call
count. (I wrote the original code)

Marek
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20181123/76200067/attachment-0001.html>