[Bug 99209] [EGL, i965] dEQP-EGL.functional.sharing.gles2.multithread.simple_egl_server_sync.textures.copyteximage2d_texsubimage2d_render

Mon Jan 9 15:58:33 UTC 2017

https://bugs.freedesktop.org/show_bug.cgi?id=99209

--- Comment #7 from Jason Ekstrand <jason at jlekstrand.net> ---
(In reply to Tapani Pälli from comment #6)
> Created attachment 128824 [details] [review]
> patch to remove meta usage
> 
> Here's a patch to remove meta usage, this fixes the issue for me. I don't
> spot any regression in CI, only bunch of tests hat start to pass when
> expecting a failure. Could we do such move or is meta version preferred?

Removing meta shouldn't fix anything here since meta isn't really doing
anything the user couldn't already do.  If removing meta fixes it, then we're
just papering over some other bug.

(In reply to Kenneth Graunke from comment #4)
> (In reply to Chad Versace from comment #3)
> > +ken +jason
> > 
> > Helgrind complains about potential read/write races on
> > drm_intel_gem_bo::offset64. According to helgrind, the conflict occurs when
> > intel_batchbuffer_flush() updates the offset in thread1 and
> > brw_update_texture_surfaces() reads the offset during batchbuffer
> > construction in thread2.

This has me concerned...

> I think that should be harmless.  offset64 is the presumed location of the
> buffer - i.e. our guess where the kernel relocated it to on the last
> execbuf.  If we guess correctly, the kernel sees that everything's where we
> think it is and skips relocating things.  If we guess wrong, it goes ahead
> and does relocations anyway.

Not so much.  The race is actually way more subtle.  When we add a relocation
to the BO, we write some value into the BO based on a read of offset64.  We
then pass the list of BOs off to the kernel with their offsets.  If the offsets
in the list match the kernel's view of addresses, then it skips doing any
relocations.  Suppose we have the following:

1) We create a texture.  By default it has offset64 == 0
2) Thread A emits a relocation for some reason
3) Thread B does stuff, flushes the batch, and writes the new value into
offset64
4) Thread A finishes what it was doing and flushes the batch

Given that the kernel doesn't move things around on us like crazy, there's a
decent chance that offset64 hasn't changed between (3) and (4) so the offsets
will match the kernel's view when A submits in (4).  However, because we
emitted a reloaction in (2) and the offset changed in (3) before we submit the
batch in (4), the offset we provide to the kernel in the BO list is different
from the offset with which the relocation was emitted.  The kernel doesn't
relocate and things explode.

In other words, the problem isn't so much a race between threads A and B as it
is that all relocations need to happen with the same offset as the one we send
off to the kernel.  This is fine in a single-context environment because we
only have one batch in flight at any given time and so the value of offset64 is
always exactly what was returned by the kernel last time and will remain that
way until the current batch is submitted and we update it.  In a multi-context
environment we're just toast.

The solution here is that each context needs to track its own set of offsets. 
There's no real utility in sharing them since each hardware context has it's
own GTT on modern hardware.  On older hardware, we'll do the relocation for the
first batch we submit with the BO but then the kernel will tell us the offset
and future batches won't relocate.  In either case, tracking it globally in
libdrm isn't gaining us anything and is, in fact, the problem.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-3d-bugs/attachments/20170109/5b4e8518/attachment.html>