i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

Sat Mar 17 17:43:55 PDT 2012

<#part sign=pgpmime>
On Sat, 17 Mar 2012 15:52:15 -0700, Linus Torvalds <torvalds at linux-foundation.org> wrote:

> I do not believe we actually ever uncovered the original problem with
> _MOVABLE: the problem was bisected down to the stable-backported
> version of commit 4bdadb978569 ("drm/i915: Selectively enable
> self-reclaim"), and I looked at the changes and decided that one of
> the main ones was the removal of the mapping_set_gfp_mask() - which
> resulted in __GFP_MOVABLE being on for the mapping.

Can anyone explain what __GFP_MOVABLE even does? I can't understand what
this flag would be for; if the page is locked (with get_page), then the
page cannot move. If it isn't locked, then it's subject to swapping, in
which case the page will almost certainly move when it returns from
disk. Is it that the page won't move if it isn't swapped? That doesn't
seem all that useful to me.

> but I didn't actually see why the i915 page pinning would be defeated
> by __GFP_MOVABLE. The code does get a reference to them afaik.

GTT mapping and page locking are done in lock-step in the driver:

i915_gem_object_bind_to_gtt
        i915_gem_object_get_pages_gtt
                pins the pages
        i915_gem_gtt_bind_object
                maps to GTT

i915_gem_object_unbind
	i915_gem_gtt_unbind_object
                unmaps from GTT
	i915_gem_object_put_pages_gtt
                unpins the pages.

There are no other code paths to unmapping objects from the GTT or
unpinning the pages that I can find.

> So for example, i915_gem_object_get_pages_gtt() will use
> shmem_read_mapping_page_gfp() which will increment the page count for
> the page it gets, so all the obj->pages[] entries should have properly
> incremented page counts. And they get released by
> i915_gem_object_put_pages_gtt(), but maybe that is called too early
> while the pages are still in use by the GFX unit...

This seems the most likely problem -- there are so many caches and
buffers involved. However, we're seeing troubles on hibernate resume, at
which point there isn't any acceleration going on, just two fbdev
drivers poking the hardware. Which really reduces the complexity quite a
bit; it's just CPU reads/writes through the GTT aperture created for the
two console frame buffers. That makes this an interesting place to look
for trouble; we can ignore vast areas within the driver that deal with
acceleration, at least for this case.

-- 
keith.packard at intel.com