Mesa (staging/22.1): iris: fix race condition during busy tracking

GitLab Mirror gitlab-mirror at kemper.freedesktop.org
Fri Apr 22 16:02:24 UTC 2022


Module: Mesa
Branch: staging/22.1
Commit: de0c61c03bbdb1c7654a5bed1f69a3bf3ac68656
URL:    http://cgit.freedesktop.org/mesa/mesa/commit/?id=de0c61c03bbdb1c7654a5bed1f69a3bf3ac68656

Author: Paulo Zanoni <paulo.r.zanoni at intel.com>
Date:   Wed Feb  9 12:56:35 2022 -0800

iris: fix race condition during busy tracking

The Iris code that deals with implicit tracking is protected by
bufmgr->bo_deps_lock. Before this patch, we hold this lock during
update_batch_syncobjs() but don't keep it held until we actually
submit the batch in the execbuf ioctl. This can lead to the following
race condition:

  - Context C1 generates a batch B1 that signals syncobj S1.
  - Context C2 generates a batch B2 that depends on something that B1
    from C1 is using, so we mark B2 as having to wait syncobj S1.
  - C2 calls submit_batch() before C1 does it.
  - The Kernel detects it was told to wait on syncobj S1 that was
    never even submitted, so it returns EINVAL to the execbuf ioctl.
  - We run abort() at the end of _iris_batch_flush().
    - If DEBUG is defined, we also print:
      iris: Failed to submit batchbuffer: Invalid argument

I couldn't figure out a way to reproduce this issue with real
workloads, but I was able to write a small reproducer to trigger this.
Basically it's a little GL program that has lots of contexts running
in different threads submitting compute shaders that keep using the
same SSBOs. I'll submit this as a piglit test. Edit: Tapani found a
dEQP test case which fails intermintently without this fix, so I'm not
sure a new Piglit is worth it now.

The solution itself is quite simple: just keep bo_deps_lock held all
the way from update_batch_syncobjs() until ioctl(). In order to make
that easier we just call update_batch_syncobjs() a little later. We
have to drop the lock as soon as the ioctl returns because removing
the references on the buffers would trigger other functions to try to
grab the lock again, leading to deadlocks.

Thanks to Kenneth Graunke for pointing out this issue.

This has also been confirmed to fix a dEQP test that was giving
intermittent failures:
  dEQP-EGL.functional.sharing.gles2.multithread.random.images.copyteximage2d.12

v2: Move decode_batch() out, just to be safe (Jason).
v3: Do it all after assembling validation_list (Ken).

Cc: mesa-stable
Fixes: 89a34cb8450a ("iris: switch to explicit busy tracking")
Tested-by: Tapani Pälli <tapani.palli at intel.com>
Reviewed-by: Jason Ekstrand <jason.ekstrand at collabora.com>
Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>
Signed-off-by: Paulo Zanoni <paulo.r.zanoni at intel.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14964>
(cherry picked from commit 3532c374de2de1655b7d56aca69588c3f5de19ee)

---

 .pick_status.json                     |  2 +-
 src/gallium/drivers/iris/iris_batch.c | 28 ++++++++++++++++------------
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/.pick_status.json b/.pick_status.json
index eb56f524b63..900f6034816 100644
--- a/.pick_status.json
+++ b/.pick_status.json
@@ -427,7 +427,7 @@
         "description": "iris: fix race condition during busy tracking",
         "nominated": true,
         "nomination_type": 1,
-        "resolution": 0,
+        "resolution": 1,
         "main_sha": null,
         "because_sha": "89a34cb8450a6fdaceb0e537613871fa86d93132"
     },
diff --git a/src/gallium/drivers/iris/iris_batch.c b/src/gallium/drivers/iris/iris_batch.c
index 26a07df1a02..f4a52f04ce6 100644
--- a/src/gallium/drivers/iris/iris_batch.c
+++ b/src/gallium/drivers/iris/iris_batch.c
@@ -882,11 +882,6 @@ update_bo_syncobjs(struct iris_batch *batch, struct iris_bo *bo, bool write)
 static void
 update_batch_syncobjs(struct iris_batch *batch)
 {
-   struct iris_bufmgr *bufmgr = batch->screen->bufmgr;
-   simple_mtx_t *bo_deps_lock = iris_bufmgr_get_bo_deps_lock(bufmgr);
-
-   simple_mtx_lock(bo_deps_lock);
-
    for (int i = 0; i < batch->exec_count; i++) {
       struct iris_bo *bo = batch->exec_bos[i];
       bool write = BITSET_TEST(batch->bos_written, i);
@@ -896,7 +891,6 @@ update_batch_syncobjs(struct iris_batch *batch)
 
       update_bo_syncobjs(batch, bo, write);
    }
-   simple_mtx_unlock(bo_deps_lock);
 }
 
 /**
@@ -905,6 +899,9 @@ update_batch_syncobjs(struct iris_batch *batch)
 static int
 submit_batch(struct iris_batch *batch)
 {
+   struct iris_bufmgr *bufmgr = batch->screen->bufmgr;
+   simple_mtx_t *bo_deps_lock = iris_bufmgr_get_bo_deps_lock(bufmgr);
+
    iris_bo_unmap(batch->bo);
 
    struct drm_i915_gem_exec_object2 *validation_list =
@@ -938,15 +935,22 @@ submit_batch(struct iris_batch *batch)
 
    free(index_for_handle);
 
+   /* The decode operation may map and wait on the batch buffer, which could
+    * in theory try to grab bo_deps_lock. Let's keep it safe and decode
+    * outside the lock.
+    */
+   if (INTEL_DEBUG(DEBUG_BATCH))
+      decode_batch(batch);
+
+   simple_mtx_lock(bo_deps_lock);
+
+   update_batch_syncobjs(batch);
+
    if (INTEL_DEBUG(DEBUG_BATCH | DEBUG_SUBMIT)) {
       dump_fence_list(batch);
       dump_bo_list(batch);
    }
 
-   if (INTEL_DEBUG(DEBUG_BATCH)) {
-      decode_batch(batch);
-   }
-
    /* The requirement for using I915_EXEC_NO_RELOC are:
     *
     *   The addresses written in the objects must match the corresponding
@@ -984,6 +988,8 @@ submit_batch(struct iris_batch *batch)
        intel_ioctl(batch->screen->fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf))
       ret = -errno;
 
+   simple_mtx_unlock(bo_deps_lock);
+
    for (int i = 0; i < batch->exec_count; i++) {
       struct iris_bo *bo = batch->exec_bos[i];
 
@@ -1029,8 +1035,6 @@ _iris_batch_flush(struct iris_batch *batch, const char *file, int line)
 
    iris_finish_batch(batch);
 
-   update_batch_syncobjs(batch);
-
    if (INTEL_DEBUG(DEBUG_BATCH | DEBUG_SUBMIT | DEBUG_PIPE_CONTROL)) {
       const char *basefile = strstr(file, "iris/");
       if (basefile)



More information about the mesa-commit mailing list