<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Thu, Nov 22, 2018 at 6:24 AM Nicolai Hähnle <<a href="mailto:nhaehnle@gmail.com">nhaehnle@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 21.11.18 21:27, Marek Olšák wrote:<br>
> On Wed, Nov 21, 2018 at 12:57 PM Nicolai Hähnle <<a href="mailto:nhaehnle@gmail.com" target="_blank">nhaehnle@gmail.com</a> <br>
> <mailto:<a href="mailto:nhaehnle@gmail.com" target="_blank">nhaehnle@gmail.com</a>>> wrote:<br>
> <br>
> From: Nicolai Hähnle <<a href="mailto:nicolai.haehnle@amd.com" target="_blank">nicolai.haehnle@amd.com</a><br>
> <mailto:<a href="mailto:nicolai.haehnle@amd.com" target="_blank">nicolai.haehnle@amd.com</a>>><br>
> <br>
> Introduce a new driver-private transfer flag RADEON_TRANSFER_TEMPORARY<br>
> that specifies whether the caller will use buffer_unmap or not. The<br>
> default behavior is set to permanent maps, because that's what drivers<br>
> do for Gallium buffer maps.<br>
> <br>
> This should eliminate the need for hacks in libdrm. Assertions are added<br>
> to catch when the buffer_unmap calls don't match the (temporary)<br>
> buffer_map calls.<br>
> <br>
> I did my best to update r600 and r300 as well for completeness (yes,<br>
> it's a no-op for r300 because it never calls buffer_unmap), even though<br>
> the radeon winsys ignores the new flag.<br>
> <br>
> <br>
> You didn't make any changes to r300.<br>
<br>
Yeah, that's what I wrote :)<br>
<br>
<br>
> You can also drop all r600 changes, because the radeon winsys doesn't care.<br>
<br>
I don't think it's a good idea, though. The interface of the two winsys <br>
is different, yes, but it's largely the same and it makes sense to keep <br>
it that way conceptually. Not that it matters much for the code itself.<br>
<br>
<br>
[snip]<br>
> +enum radeon_transfer_flags {<br>
> + /* Indicates that the caller will unmap the buffer.<br>
> + *<br>
> + * Not unmapping buffers is an important performance<br>
> optimization for<br>
> + * OpenGL (avoids kernel overhead for frequently mapped<br>
> buffers). However,<br>
> + * if you only map a buffer once and then use it indefinitely<br>
> from the GPU,<br>
> + * it is much better to unmap it so that the kernel is free to<br>
> move it to<br>
> + * non-visible VRAM.<br>
> <br>
> <br>
> The second half of the comment is misleading. The kernel will move <br>
> buffers to invisible VRAM regardless of whether they're mapped, so CPU <br>
> mappings have no effect on the placement. Buffers are only moved back to <br>
> CPU-accessible memory on a CPU page fault. If a buffer is mapped and <br>
> there no CPU access, it will stay in invisible VRAM forever. The general <br>
> recommendation is to keep those buffers mapped for CPU access just like <br>
> GTT buffers.<br>
<br>
Yeah, I'll change that.<br>
<br>
<br>
> + */<br>
> + RADEON_TRANSFER_TEMPORARY = (PIPE_TRANSFER_DRV_PRV << 0),<br>
> +};<br>
> +<br>
> #define RADEON_SPARSE_PAGE_SIZE (64 * 1024)<br>
> <br>
> enum ring_type {<br>
> RING_GFX = 0,<br>
> RING_COMPUTE,<br>
> RING_DMA,<br>
> RING_UVD,<br>
> RING_VCE,<br>
> RING_UVD_ENC,<br>
> RING_VCN_DEC,<br>
> @@ -287,23 +299,26 @@ struct radeon_winsys {<br>
> struct pb_buffer *(*buffer_create)(struct radeon_winsys *ws,<br>
> uint64_t size,<br>
> unsigned alignment,<br>
> enum radeon_bo_domain domain,<br>
> enum radeon_bo_flag flags);<br>
> <br>
> /**<br>
> * Map the entire data store of a buffer object into the<br>
> client's address<br>
> * space.<br>
> *<br>
> + * Callers are expected to unmap buffers again if and only if the<br>
> + * RADEON_TRANSFER_TEMPORARY flag is set in \p usage.<br>
> + *<br>
> * \param buf A winsys buffer object to map.<br>
> * \param cs A command stream to flush if the buffer is<br>
> referenced by it.<br>
> - * \param usage A bitmask of the PIPE_TRANSFER_* flags.<br>
> + * \param usage A bitmask of the PIPE_TRANSFER_* and<br>
> RADEON_TRANSFER_* flags.<br>
> * \return The pointer at the beginning of the buffer.<br>
> */<br>
> void *(*buffer_map)(struct pb_buffer *buf,<br>
> struct radeon_cmdbuf *cs,<br>
> enum pipe_transfer_usage usage);<br>
> <br>
> /**<br>
> * Unmap a buffer object from the client's address space.<br>
> *<br>
> * \param buf A winsys buffer object to unmap.<br>
> diff --git a/src/gallium/drivers/radeonsi/si_shader.c<br>
> b/src/gallium/drivers/radeonsi/si_shader.c<br>
> index 19522cc97b1..d455fb5db6a 100644<br>
> --- a/src/gallium/drivers/radeonsi/si_shader.c<br>
> +++ b/src/gallium/drivers/radeonsi/si_shader.c<br>
> @@ -5286,21 +5286,22 @@ int si_shader_binary_upload(struct si_screen<br>
> *sscreen, struct si_shader *shader)<br>
> 0 :<br>
> SI_RESOURCE_FLAG_READ_ONLY,<br>
> PIPE_USAGE_IMMUTABLE,<br>
> align(bo_size,<br>
> SI_CPDMA_ALIGNMENT),<br>
> 256);<br>
> if (!shader->bo)<br>
> return -ENOMEM;<br>
> <br>
> /* Upload. */<br>
> ptr = sscreen->ws->buffer_map(shader->bo->buf, NULL,<br>
> PIPE_TRANSFER_READ_WRITE |<br>
> - PIPE_TRANSFER_UNSYNCHRONIZED);<br>
> + PIPE_TRANSFER_UNSYNCHRONIZED |<br>
> + RADEON_TRANSFER_TEMPORARY);<br>
> <br>
> /* Don't use util_memcpy_cpu_to_le32. LLVM binaries are<br>
> * endian-independent. */<br>
> if (prolog) {<br>
> memcpy(ptr, prolog->code, prolog->code_size);<br>
> ptr += prolog->code_size;<br>
> }<br>
> if (previous_stage) {<br>
> memcpy(ptr, previous_stage->code,<br>
> previous_stage->code_size);<br>
> ptr += previous_stage->code_size;<br>
> diff --git a/src/gallium/include/pipe/p_defines.h<br>
> b/src/gallium/include/pipe/p_defines.h<br>
> index 693f041b1da..e99895d30d8 100644<br>
> --- a/src/gallium/include/pipe/p_defines.h<br>
> +++ b/src/gallium/include/pipe/p_defines.h<br>
> @@ -334,21 +334,27 @@ enum pipe_transfer_usage<br>
> */<br>
> PIPE_TRANSFER_PERSISTENT = (1 << 13),<br>
> <br>
> /**<br>
> * If PERSISTENT is set, this ensures any writes done by the<br>
> device are<br>
> * immediately visible to the CPU and vice versa.<br>
> *<br>
> * PIPE_RESOURCE_FLAG_MAP_COHERENT must be set when creating<br>
> * the resource.<br>
> */<br>
> - PIPE_TRANSFER_COHERENT = (1 << 14)<br>
> + PIPE_TRANSFER_COHERENT = (1 << 14),<br>
> +<br>
> + /**<br>
> + * This and higher bits are reserved for private use by drivers.<br>
> Drivers<br>
> + * should use this as (PIPE_TRANSFER_DRV_PRV << i).<br>
> + */<br>
> + PIPE_TRANSFER_DRV_PRV = (1 << 24)<br>
> };<br>
> <br>
> /**<br>
> * Flags for the flush function.<br>
> */<br>
> enum pipe_flush_flags<br>
> {<br>
> PIPE_FLUSH_END_OF_FRAME = (1 << 0),<br>
> PIPE_FLUSH_DEFERRED = (1 << 1),<br>
> PIPE_FLUSH_FENCE_FD = (1 << 2),<br>
> diff --git a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c<br>
> b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c<br>
> index 9f0d4c12482..355018c76fc 100644<br>
> --- a/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c<br>
> +++ b/src/gallium/winsys/amdgpu/drm/amdgpu_bo.c<br>
> @@ -49,20 +49,21 @@<br>
> struct amdgpu_sparse_backing_chunk {<br>
> uint32_t begin, end;<br>
> };<br>
> <br>
> static struct pb_buffer *<br>
> amdgpu_bo_create(struct radeon_winsys *rws,<br>
> uint64_t size,<br>
> unsigned alignment,<br>
> enum radeon_bo_domain domain,<br>
> enum radeon_bo_flag flags);<br>
> +static void amdgpu_bo_unmap(struct pb_buffer *buf);<br>
> <br>
> static bool amdgpu_bo_wait(struct pb_buffer *_buf, uint64_t timeout,<br>
> enum radeon_bo_usage usage)<br>
> {<br>
> struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf);<br>
> struct amdgpu_winsys *ws = bo->ws;<br>
> int64_t abs_timeout;<br>
> <br>
> if (timeout == 0) {<br>
> if (p_atomic_read(&bo->num_active_ioctls))<br>
> @@ -166,20 +167,26 @@ static void amdgpu_bo_remove_fences(struct<br>
> amdgpu_winsys_bo *bo)<br>
> bo->max_fences = 0;<br>
> }<br>
> <br>
> void amdgpu_bo_destroy(struct pb_buffer *_buf)<br>
> {<br>
> struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf);<br>
> struct amdgpu_winsys *ws = bo->ws;<br>
> <br>
> assert(bo->bo && "must not be called for slab entries");<br>
> <br>
> + if (!bo->is_user_ptr && bo->cpu_ptr) {<br>
> + bo->cpu_ptr = NULL;<br>
> + amdgpu_bo_unmap(&bo->base);<br>
> + }<br>
> + assert(bo->is_user_ptr || bo->u.real.map_count == 0);<br>
> +<br>
> if (ws->debug_all_bos) {<br>
> simple_mtx_lock(&ws->global_bo_list_lock);<br>
> LIST_DEL(&bo->u.real.global_list_item);<br>
> ws->num_buffers--;<br>
> simple_mtx_unlock(&ws->global_bo_list_lock);<br>
> }<br>
> <br>
> simple_mtx_lock(&ws->bo_export_table_lock);<br>
> util_hash_table_remove(ws->bo_export_table, bo->bo);<br>
> simple_mtx_unlock(&ws->bo_export_table_lock);<br>
> @@ -188,54 +195,66 @@ void amdgpu_bo_destroy(struct pb_buffer *_buf)<br>
> amdgpu_va_range_free(bo->u.real.va_handle);<br>
> amdgpu_bo_free(bo->bo);<br>
> <br>
> amdgpu_bo_remove_fences(bo);<br>
> <br>
> if (bo->initial_domain & RADEON_DOMAIN_VRAM)<br>
> ws->allocated_vram -= align64(bo->base.size,<br>
> ws->info.gart_page_size);<br>
> else if (bo->initial_domain & RADEON_DOMAIN_GTT)<br>
> ws->allocated_gtt -= align64(bo->base.size,<br>
> ws->info.gart_page_size);<br>
> <br>
> - if (bo->u.real.map_count >= 1) {<br>
> - if (bo->initial_domain & RADEON_DOMAIN_VRAM)<br>
> - ws->mapped_vram -= bo->base.size;<br>
> - else if (bo->initial_domain & RADEON_DOMAIN_GTT)<br>
> - ws->mapped_gtt -= bo->base.size;<br>
> - ws->num_mapped_buffers--;<br>
> - }<br>
> -<br>
> simple_mtx_destroy(&bo->lock);<br>
> FREE(bo);<br>
> }<br>
> <br>
> static void amdgpu_bo_destroy_or_cache(struct pb_buffer *_buf)<br>
> {<br>
> struct amdgpu_winsys_bo *bo = amdgpu_winsys_bo(_buf);<br>
> <br>
> assert(bo->bo); /* slab buffers have a separate vtbl */<br>
> <br>
> if (bo->u.real.use_reusable_pool)<br>
> pb_cache_add_buffer(&bo->u.real.cache_entry);<br>
> else<br>
> amdgpu_bo_destroy(_buf);<br>
> }<br>
> <br>
> +static bool amdgpu_bo_do_map(struct amdgpu_winsys_bo *bo, void **cpu)<br>
> +{<br>
> + assert(!bo->sparse && bo->bo && !bo->is_user_ptr);<br>
> + int r = amdgpu_bo_cpu_map(bo->bo, cpu);<br>
> + if (r) {<br>
> + /* Clear the cache and try again. */<br>
> + pb_cache_release_all_buffers(&bo->ws->bo_cache);<br>
> + r = amdgpu_bo_cpu_map(bo->bo, cpu);<br>
> + if (r)<br>
> + return false;<br>
> + }<br>
> +<br>
> + if (p_atomic_inc_return(&bo->u.real.map_count) == 1) {<br>
> + if (bo->initial_domain & RADEON_DOMAIN_VRAM)<br>
> + bo->ws->mapped_vram += bo->base.size;<br>
> + else if (bo->initial_domain & RADEON_DOMAIN_GTT)<br>
> + bo->ws->mapped_gtt += bo->base.size;<br>
> + bo->ws->num_mapped_buffers++;<br>
> + }<br>
> +<br>
> + return true;<br>
> +}<br>
> +<br>
> static void *amdgpu_bo_map(struct pb_buffer *buf,<br>
> struct radeon_cmdbuf *rcs,<br>
> enum pipe_transfer_usage usage)<br>
> {<br>
> struct amdgpu_winsys_bo *bo = (struct amdgpu_winsys_bo*)buf;<br>
> struct amdgpu_winsys_bo *real;<br>
> struct amdgpu_cs *cs = (struct amdgpu_cs*)rcs;<br>
> - int r;<br>
> - void *cpu = NULL;<br>
> - uint64_t offset = 0;<br>
> <br>
> assert(!bo->sparse);<br>
> <br>
> /* If it's not unsynchronized bo_map, flush CS if needed and<br>
> then wait. */<br>
> if (!(usage & PIPE_TRANSFER_UNSYNCHRONIZED)) {<br>
> /* DONTBLOCK doesn't make sense with UNSYNCHRONIZED. */<br>
> if (usage & PIPE_TRANSFER_DONTBLOCK) {<br>
> if (!(usage & PIPE_TRANSFER_WRITE)) {<br>
> /* Mapping for read.<br>
> *<br>
> @@ -306,63 +325,74 @@ static void *amdgpu_bo_map(struct pb_buffer *buf,<br>
> }<br>
> <br>
> amdgpu_bo_wait((struct pb_buffer*)bo,<br>
> PIPE_TIMEOUT_INFINITE,<br>
> RADEON_USAGE_READWRITE);<br>
> }<br>
> <br>
> bo->ws->buffer_wait_time += os_time_get_nano() - time;<br>
> }<br>
> }<br>
> <br>
> - /* If the buffer is created from user memory, return the user<br>
> pointer. */<br>
> - if (bo->user_ptr)<br>
> - return bo->user_ptr;<br>
> + /* Buffer synchronization has been checked, now actually map the<br>
> buffer. */<br>
> + void *cpu = NULL;<br>
> + uint64_t offset = 0;<br>
> <br>
> if (bo->bo) {<br>
> real = bo;<br>
> } else {<br>
> real = bo->u.slab.real;<br>
> offset = bo->va - real->va;<br>
> }<br>
> <br>
> - r = amdgpu_bo_cpu_map(real->bo, &cpu);<br>
> - if (r) {<br>
> - /* Clear the cache and try again. */<br>
> - pb_cache_release_all_buffers(&real->ws->bo_cache);<br>
> - r = amdgpu_bo_cpu_map(real->bo, &cpu);<br>
> - if (r)<br>
> - return NULL;<br>
> + if (usage & RADEON_TRANSFER_TEMPORARY) {<br>
> + if (real->is_user_ptr) {<br>
> + cpu = real->cpu_ptr;<br>
> + } else {<br>
> + if (!amdgpu_bo_do_map(real, &cpu))<br>
> + return NULL;<br>
> + }<br>
> + } else {<br>
> + cpu = real->cpu_ptr;<br>
> + if (!cpu) {<br>
> <br>
> <br>
> I think this is unsafe on 64-bit CPUs, because the lower 32 bits can be <br>
> initialized but not the upper 32 bits. You would either have to check <br>
> both 32-bit halves separately (if they are both expected to be non-zero) <br>
> or you would have to lock the mutex first.<br>
<br>
Well, there is actually a potential problem, but it's not that. A 64-bit <br>
CPU will do aligned 64-bit loads/stores atomically, but <br>
amdgpu_bo_cpu_map doesn't necessarily guarantee that it'll set the <br>
address atomically. I think it does that today, but it's a bad idea to <br>
depend on it.<br>
<br>
<br>
> If all my remarks are addressed, I'll ack this, but I still think that <br>
> this patch adds unnecessary cruft, complexity, and fragility in order to <br>
> fix an issue that is very trivial in nature. The overflow check in <br>
> libdrm is simple, clean, and robust.<br>
<br>
The overflow check, or actually the saturating behavior, is also very <br>
surprising and unusual. Reference counting doesn't usually work that way.<br></blockquote><div><br></div>It's not reference counting. It's not even the map count. It's the map call count. (I wrote the original code)</div><div class="gmail_quote"><br></div><div class="gmail_quote">Marek<br></div></div>