[Intel-gfx] sna: buffer overrun

Sun Nov 3 23:06:55 CET 2013

> Date: Sun, 3 Nov 2013 19:43:48 +0000
> From: Chris Wilson <chris at chris-wilson.co.uk>
> 
> On Sun, Nov 03, 2013 at 01:22:52PM +0100, Mark Kettenis wrote:
> > I ran into a "regression" in xf86-video-intel master.  X would spin
> > for several seconds and eventually I'd see a message like:
> > 
> > [   170.724] kgem_bo_write: failed to write 3600 bytes into BO handle=175: 14
> > 
> > in Xorg.0.log
> > 
> > Bisected it down to the following commit:
> > 
> > 
> > commit 4f41bf3de059c4e0a03fb161fb2e78d94be69e3f
> > Author: Chris Wilson <chris at chris-wilson.co.uk>
> > Date:   Tue Oct 29 09:56:10 2013 +0000
> > 
> >     sna: Try harder to complete writes
> >     
> >     Expunge our caches if we fail to write into a bo (presuming that
> >     allocation failure is the likely fixable cause).
> >     
> >     Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > 
> > 
> > It's obviously trying really really hard to write into that bo now ;).
> > But failing eventually anyway.  So I started digging deeper.  The
> > pwrite is failing with EFAULT.  Some kernel debugging revealed that
> > the fault happens when copying data from user space.  Adding some
> > debugging printf's shows that
> > 
> > pwrite.data_ptr = 0x1cc19d9831f0
> > pwrite.size = 3648
> > 
> > and that the fault happens at address 0x1cc19d984000.  This is the
> > same 3600 byte buffer from the message.  Obviously pwrite is reading
> > beyond the end of the buffer, running into the next page, which isn't
> > there.
> > 
> > Now I'm seeing this on OpenBSD.  I'm guessing this is actually a
> > malloc()'ed buffer.  And on OpenBSD malloc() is extremely nasty.  It
> > tries to align the allocated space such that the end lies right at the
> > end of a page and inserts a guard page.  It does this to catch buffer
> > overruns like this.  You're much less likely to hit something like
> > this on Linux, unless you're using a special debug malloc library.
> > 
> > This problem was introduced with the following commit:
> > 
> > 
> > commit 95f4da647a4055545b09cae0834df0fa2127a458
> > Author: Chris Wilson <chris at chris-wilson.co.uk>
> > Date:   Wed Nov 30 11:59:31 2011 +0000
> > 
> >     sna: Align pwrite to transfer whole cachelines
> >     
> >     Daniel claims that this is will be faster, or will be once he has
> >     completed rewriting pwrite!
> >     
> >     Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > 
> > 
> > I fear that optimization simply isn't safe.
> 
> It was. All the local allocations are page-aligned, the issue are the
> couple of places where we upload from other buffers.
> 
> diff --git a/src/sna/kgem.c b/src/sna/kgem.c
> index 264a379..1217367 100644
> --- a/src/sna/kgem.c
> +++ b/src/sna/kgem.c
> @@ -482,7 +482,7 @@ bool kgem_bo_write(struct kgem *kgem, struct kgem_bo *bo,
>  
>         assert(length <= bytes(bo));
>  retry:
> -       if (gem_write(kgem->fd, bo->handle, 0, length, data)) {
> +       if (__gem_write(kgem->fd, bo->handle, 0, length, data)) {
>                 int err = errno;
>  
>                 assert(err != EINVAL);

Fixed it the same way here, and things seem stable enough.  So I guess
that's a:

Tested-by: Mark Kettenis <kettenis at openbsd.org>
Reviewed-by: Mark Kettenis <kettenis at openbsd.org>

Thanks,

Mark