[Bug 103304] multi-threaded usage of Gallium RadeonSI leads to NULL pointer exception in pb_cache_reclaim_buffer

Tue Oct 17 06:48:27 UTC 2017

https://bugs.freedesktop.org/show_bug.cgi?id=103304

            Bug ID: 103304
           Summary: multi-threaded usage of Gallium RadeonSI leads to NULL
                    pointer exception in pb_cache_reclaim_buffer
           Product: Mesa
           Version: 17.0
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: Drivers/Gallium/radeonsi
          Assignee: dri-devel at lists.freedesktop.org
          Reporter: lper.home at gmail.com
        QA Contact: dri-devel at lists.freedesktop.org

Issue is not present in Mesa 11.X. It is however present in Mesa 13.0.X, 17.0.X
and as far as I can see in the code, it is probably as well present in latest
Mesa 17.2.X.
Our code is very similar as the second example in
https://www.khronos.org/opengl/wiki/OpenGL_and_multithreading : we have two
contexts which are shared. In one context/thread the rendering is done and in
the other context/thread the texture uploading is done. It is in this case we
hit the race causing a crash (on average we need about an hour to hit the
issue).

The crash has following footprint:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  pb_cache_reclaim_buffer (mgr=mgr at entry=0x1e8dd30, size=size at entry=2088960,
alignment=alignment at entry=4096, usage=usage at entry=20,
    bucket_index=bucket_index at entry=3) at pipebuffer/pb_cache.c:183
#1  0x00007fe2671c50e7 in amdgpu_bo_create (rws=0x1e8dbf0, size=<optimized
out>, alignment=4096, domain=RADEON_DOMAIN_VRAM_GTT, flags=RADEON_FLAG_GTT_WC)
    at amdgpu_bo.c:754
#2  0x00007fe2671db666 in r600_alloc_resource (rscreen=rscreen at entry=0x1e8f0c0,
res=res at entry=0x7fe24c2d3100) at r600_buffer_common.c:197
#3  0x00007fe2671e6eff in r600_texture_invalidate_storage
(rctx=rctx at entry=0x1f9e900, rtex=rtex at entry=0x7fe24c2d3100) at
r600_texture.c:1414
#4  0x00007fe2671eb474 in r600_texture_transfer_map (ctx=0x1f9e900,
texture=0x7fe24c2d3100, level=0, usage=258, box=0x7fe265bca970,
    ptransfer=0x7fe265bca898) at r600_texture.c:1483
#5  0x00007fe267041807 in u_transfer_map_vtbl (context=<optimized out>,
resource=<optimized out>, level=<optimized out>, usage=<optimized out>,
    box=<optimized out>, transfer=<optimized out>) at util/u_transfer.c:138
#6  0x00007fe267041732 in u_default_texture_subdata (pipe=0x1f9e900,
resource=0x7fe24c2d3100, level=<optimized out>, usage=<optimized out>,
    box=0x7fe265bca970, data=0x7fe218ac05e0, stride=1920, layer_stride=2088960)
at util/u_transfer.c:59
#7  0x00007fe266e51137 in st_TexSubImage (ctx=<optimized out>, dims=2,
texImage=<optimized out>, xoffset=0, yoffset=0, zoffset=0, width=1920,
    height=1088, depth=1, format=6403, type=5121, pixels=0x7fe218ac05e0,
unpack=0x2000fc0) at state_tracker/st_cb_texture.c:1412
#8  0x00007fe266dd75bf in _mesa_texture_sub_image (ctx=ctx at entry=0x1fe5d50,
dims=dims at entry=2, texObj=texObj at entry=0x7fe24c2d2ca0,
    texImage=0x7fe24c2cda20, target=target at entry=3553, level=level at entry=0,
xoffset=xoffset at entry=0, yoffset=yoffset at entry=0, zoffset=zoffset at entry=0,
    width=width at entry=1920, height=height at entry=1088, depth=depth at entry=1,
format=format at entry=6403, type=type at entry=5121,
    pixels=pixels at entry=0x7fe218ac05e0, dsa=dsa at entry=false) at
main/teximage.c:3239
#9  0x00007fe266dd7787 in texsubimage (ctx=0x1fe5d50, dims=dims at entry=2,
target=3553, level=0, xoffset=0, yoffset=0, zoffset=zoffset at entry=0,
    width=1920, height=1088, depth=depth at entry=1, format=format at entry=6403,
type=type at entry=5121, pixels=pixels at entry=0x7fe218ac05e0,
    callerName=callerName at entry=0x7fe26723c036 "glTexSubImage2D") at
main/teximage.c:3297
#10 0x00007fe266dd7b49 in _mesa_TexSubImage2D (target=<optimized out>,
level=<optimized out>, xoffset=<optimized out>, yoffset=<optimized out>,
    width=<optimized out>, height=<optimized out>, format=6403, type=5121,
pixels=0x7fe218ac05e0) at main/teximage.c:3438

If we enable the assert() handling in the mesa3d library, then this crash will
not occur, as an assert is triggered before:

#0  0x00007fd388fed124 in raise () from /lib64/libc.so.6
#1  0x00007fd388fee58a in abort () from /lib64/libc.so.6
#2  0x00007fd388fe5e47 in ?? () from /lib64/libc.so.6
#3  0x00007fd388fe5ef2 in __assert_fail () from /lib64/libc.so.6
#4  0x00007fd373986091 in pipe_reference_described (get_desc=<optimized out>,
reference=0x7fd35801b100, ptr=0x0)
    at gallium/auxiliary/util/u_inlines.h:82
#5  pipe_reference (reference=0x7fd35801b100, ptr=0x0) at
gallium/auxiliary/util/u_inlines.h:102
#6  pb_reference (src=0x7fd35801b100, dst=0x2a260d0) at
gallium/auxiliary/pipebuffer/pb_buffer.h:241
#7  amdgpu_winsys_bo_reference (src=0x7fd35801b100, dst=0x2a260d0) at
amdgpu_bo.h:116
#8  amdgpu_lookup_or_add_real_buffer (acs=0x3fea9d0, bo=0x7fd35801b100) at
amdgpu_cs.c:358
#9  0x00007fd3739863ac in amdgpu_cs_add_buffer (rcs=<optimized out>,
buf=<optimized out>, usage=10, domains=<optimized out>,
    priority=RADEON_PRIO_SAMPLER_TEXTURE) at amdgpu_cs.c:450
#10 0x00007fd3738d79fd in radeon_add_to_buffer_list
(priority=RADEON_PRIO_SAMPLER_TEXTURE, usage=RADEON_USAGE_READ,
rbo=0x7fd358019cd0, ring=0x1eedeb8,
    rctx=0x1eedb60) at gallium/drivers/radeon/r600_cs.h:77
#11 radeon_add_to_buffer_list_check_mem (check_mem=false,
priority=RADEON_PRIO_SAMPLER_TEXTURE, usage=RADEON_USAGE_READ,
rbo=0x7fd358019cd0,
    ring=0x1eedeb8, rctx=0x1eedb60) at gallium/drivers/radeon/r600_cs.h:114
#12 si_sampler_view_add_buffer (sctx=sctx at entry=0x1eedb60,
resource=0x7fd358019cd0, usage=usage at entry=RADEON_USAGE_READ,
    is_stencil_sampler=<optimized out>, check_mem=check_mem at entry=false) at
si_descriptors.c:316
#13 0x00007fd3738d7cb2 in si_sampler_views_begin_new_cs
(sctx=sctx at entry=0x1eedb60, views=views at entry=0x1eef360) at
si_descriptors.c:350
#14 0x00007fd3738dfd5a in si_all_descriptors_begin_new_cs
(sctx=sctx at entry=0x1eedb60) at si_descriptors.c:2019
#15 0x00007fd3738e0983 in si_begin_new_cs (ctx=ctx at entry=0x1eedb60) at
si_hw_context.c:227
#16 0x00007fd3738e14d3 in si_context_gfx_flush (context=0x1eedb60, flags=0,
fence=0x0) at si_hw_context.c:162
#17 0x00007fd37399c2a7 in r600_flush_from_st (ctx=0x1eedb60, fence=0x0,
flags=<optimized out>) at r600_pipe_common.c:381
#18 0x00007fd3735587ff in st_flush (st=st at entry=0x3e33870,
fence=fence at entry=0x0, flags=flags at entry=0) at state_tracker/st_cb_flush.c:87
#19 0x00007fd37355881e in st_glFlush (ctx=<optimized out>) at
state_tracker/st_cb_flush.c:121
#20 0x00007fd3733f7d71 in _mesa_flush (ctx=0x42cb4d0) at main/context.c:1838
#21 0x00007fd3733f8436 in _mesa_Flush () at main/context.c:1870

The thing that happens is a race between the texture uploading thread calling
the r600_texture_invalidate_storage() and the glFlush call in the rendering
thread calling the radeon_add_to_buffer_list() function:
In the radeon_add_to_buffer_list following code is executed:

  return rctx->ws->cs_add_buffer(
                  ring->cs, rbo->buf,
                  (enum radeon_bo_usage)(usage | RADEON_USAGE_SYNCHRONIZED),
                  rbo->domains, priority) * 4;

While in the function r600_alloc_resource the following code is executed:

        /* Replace the pointer such that if res->buf wasn't NULL, it won't be
         * NULL. This should prevent crashes with multiple contexts using
         * the same buffer where one of the contexts invalidates it while
         * the others are using it. */
        old_buf = res->buf;
        res->buf = new_buf; /* should be atomic */

Where both the rbo variable in radeon_add_to_buffer_list and res variable in
r600_alloc_resource are the same thing. In the further processing of
cs_add_buffer, the buffer is not linked anymore with the rbo as it has been
swapped in the other thread! The r600_alloc_resource will decrease the buffer
use reference so it gets zero, then causing the assert in the other thread
(where the assert checks the reference count).
Without the assert being enabled, the buf object will be cleaned up actually
setting its prev/next pointer to NULL and causing a crash in
pb_cache_reclaim_buffer when it is walking its bucket/cache list of buffers.

We performed a couple of tests:
-       By letting the texture upload perform by the render thread (done by a
dirty hack in our code): stability issue is gone.
-       By letting return the r600_can_invalidate_texture() always false, so
that the reallocation is not done: stability issue is gone.

These two tests proof that the race condition comes from the multi-threading
aspect and the texture invalidation during texture upload.

I suppose that the check in r600_texture_transfer_map():

                        if (r600_can_invalidate_texture(rctx->screen, rtex,
                                                        usage, box))
                                r600_texture_invalidate_storage(rctx, rtex);
                        else
                                use_staging_texture = true;

thus r600_can_invalidate_texture() returns true, while it shouldn’t as a bit
later it is used in another thread by the glFlush command.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20171017/c1ad9008/attachment.html>