[Intel-gfx] sna: buffer overrun

Sun Nov 3 20:43:48 CET 2013

On Sun, Nov 03, 2013 at 01:22:52PM +0100, Mark Kettenis wrote:
> I ran into a "regression" in xf86-video-intel master.  X would spin
> for several seconds and eventually I'd see a message like:
> 
> [   170.724] kgem_bo_write: failed to write 3600 bytes into BO handle=175: 14
> 
> in Xorg.0.log
> 
> Bisected it down to the following commit:
> 
> 
> commit 4f41bf3de059c4e0a03fb161fb2e78d94be69e3f
> Author: Chris Wilson <chris at chris-wilson.co.uk>
> Date:   Tue Oct 29 09:56:10 2013 +0000
> 
>     sna: Try harder to complete writes
>     
>     Expunge our caches if we fail to write into a bo (presuming that
>     allocation failure is the likely fixable cause).
>     
>     Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> 
> 
> It's obviously trying really really hard to write into that bo now ;).
> But failing eventually anyway.  So I started digging deeper.  The
> pwrite is failing with EFAULT.  Some kernel debugging revealed that
> the fault happens when copying data from user space.  Adding some
> debugging printf's shows that
> 
> pwrite.data_ptr = 0x1cc19d9831f0
> pwrite.size = 3648
> 
> and that the fault happens at address 0x1cc19d984000.  This is the
> same 3600 byte buffer from the message.  Obviously pwrite is reading
> beyond the end of the buffer, running into the next page, which isn't
> there.
> 
> Now I'm seeing this on OpenBSD.  I'm guessing this is actually a
> malloc()'ed buffer.  And on OpenBSD malloc() is extremely nasty.  It
> tries to align the allocated space such that the end lies right at the
> end of a page and inserts a guard page.  It does this to catch buffer
> overruns like this.  You're much less likely to hit something like
> this on Linux, unless you're using a special debug malloc library.
> 
> This problem was introduced with the following commit:
> 
> 
> commit 95f4da647a4055545b09cae0834df0fa2127a458
> Author: Chris Wilson <chris at chris-wilson.co.uk>
> Date:   Wed Nov 30 11:59:31 2011 +0000
> 
>     sna: Align pwrite to transfer whole cachelines
>     
>     Daniel claims that this is will be faster, or will be once he has
>     completed rewriting pwrite!
>     
>     Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> 
> 
> I fear that optimization simply isn't safe.

It was. All the local allocations are page-aligned, the issue are the
couple of places where we upload from other buffers.

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index 264a379..1217367 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -482,7 +482,7 @@ bool kgem_bo_write(struct kgem *kgem, struct kgem_bo *bo,
 
        assert(length <= bytes(bo));
 retry:
-       if (gem_write(kgem->fd, bo->handle, 0, length, data)) {
+       if (__gem_write(kgem->fd, bo->handle, 0, length, data)) {
                int err = errno;
 
                assert(err != EINVAL);
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre