[Intel-gfx] i-g-t prime_self_import & gem_flink_race: leaked -1 objects

Fri Nov 1 13:55:54 CET 2013

Hi Ben,

I´ll switch the conversation to the mailing list...

In the case of prime_self_import, the problem is self-contained (it doesn´t really need a previous test A): the first subtest opens the first fd, which provokes a context switch (gem_quiescent_gpu). This switch is actually completed (gem_quiescent_gpu makes sure with a gem_sync) and the old context disposed of, but its backing object remains alive until a retire_work kicks in (which in my case usually happens in the middle of the prime_self_test/export-vs-gem_close-race  subtest, thus the "-1 objects leaked"). The comment in do_switch says it all:

	/* The backing object for the context is done after switching to the
	 * *next* context. Therefore we cannot retire the previous context until
	 * the next context has already started running. In fact, the below code
	 * is a bit suboptimal because the retiring can occur simply after the
	 * MI_SET_CONTEXT instead of when the next seqno has completed.
	 */

I´ll send a fix for prime_self_import, but... maybe we should make sure that the GPU is really quiescent, rather than fixing individual tests? (retire requests via drop caches at the end of gem_quiescent_gpu?).

-- Oscar

> -----Original Message-----
> From: Ben Widawsky [mailto:benjamin.widawsky at intel.com]
> Sent: Friday, November 01, 2013 3:38 AM
> To: Mateo Lozano, Oscar
> Cc: Chris Wilson; Daniel Vetter
> Subject: Re: new PPGTT patches pushed
> 
> That wasn't clear... I fixed gem_flink_close, I didn't touch the prime test.
> 
> On Thu, Oct 31, 2013 at 8:29 PM, Ben Widawsky
> <benjamin.widawsky at intel.com> wrote:
> > Yep, I agree this looks like the problem here.
> >
> > What likely happens is a context from a previous run (which was
> > created for the fd) finally dies. So for example:
> >
> > Test A creates context, runs, finishes.
> > (Context is not destroyed yet since we didn't switch away) Test
> > prime_self_import runs, opens the fd, and creates the context, but
> > doesn't yet switch. The switch will kill the context from test A. This
> > is how we are minus one.
> >
> > I've pushed a fix to my intel-gpu-tools PPGTT branch which uses drop
> > caches to switch back to the global default context when needed.
> >
> > Would you like to fix prime_self_import, I hope it's the same?
> >
> > On Wed, Oct 30, 2013 at 04:31:11PM +0000, Mateo Lozano, Oscar wrote:
> >> Hi Ben,
> >>
> >> I think I homed inthe cause for the regression in prime_self_import
> (maybe gem_flink_race as well):
> >>
> >> When the first fd is opened with drm_open_any(), it calls
> gem_quiescent_gpu() which in turn sends a nop execbuf to all the rings.
> During the do_switch() for the render ring, the backing object for the old
> context happens to be kept alive until later on. If this backing bo is freed
> between consequtive calls to get_object_count(), then we have a "false"
> leaking object report.
> >>
> >> Printk messages during the do_switch():
> >>
> >> [  428.389210] ACHTUNG! do_switch render ring, to ctx object
> ffff8800a21d4a80
> >> [  428.391717]     Is object pinned now?: no
> >> [  428.393199]     Object set to gtt domain
> >> [  428.394601]     GTT offset: 020dd000, size: 00011000, table: ggtt)
> >> [  428.394630]     hw_flags |= MI_RESTORE_INHIBIT
> >> [  428.396605]     mi_set_context succesful
> >> [  428.397284] ACHTUNG! : from ctx object ffff8800a21d5800          <---------
> ----------------------------------------------------------------------------------------------
> --------- backing bo for the old context
> >> [  428.397918]     GTT offset: 020cb000, size: 00011000, table: ggtt) <---------
> ----------------------------------------------------------------------------------------------
> ---------
> >> [  428.397931]     Done!
> >>
> >> GTT  just after the first get_object_count() in the "export-vs-gem_close-
> race" test:
> >>
> >>    ffff880240cde000: p g       68KiB 10 00 0 0 0 L3+LLC dirty (pinned x 1) (ggtt
> offset: 00000000, size: 00011000)
> >>    ffff880240cde180: p g        4KiB 01 01 0 0 0 snooped or LLC (pinned x 1)
> (ggtt offset: 00011000, size: 00001000) (p mappable)
> >>    ffff880240cde300: p g      128KiB 40 40 0 0 0 snooped or LLC dirty (pinned x
> 1) (ggtt offset: 00012000, size: 00020000) (p mappable)
> >>    ffff880240cde480: p g        4KiB 01 01 0 0 0 snooped or LLC (pinned x 1)
> (ggtt offset: 00032000, size: 00001000) (p mappable)
> >>    ffff880240cde600: p g        4KiB 01 01 0 0 0 snooped or LLC (pinned x 1)
> (ggtt offset: 00033000, size: 00001000) (p mappable)
> >>    ffff880240cde780: p g      128KiB 40 40 0 0 0 snooped or LLC dirty (pinned x
> 1) (ggtt offset: 00034000, size: 00020000) (p mappable)
> >>    ffff880240cde900: p g        4KiB 01 01 0 0 0 snooped or LLC (pinned x 1)
> (ggtt offset: 00054000, size: 00001000) (p mappable)
> >>    ffff880240cdea80: p g      128KiB 40 40 0 0 0 snooped or LLC dirty (pinned x
> 1) (ggtt offset: 00055000, size: 00020000) (p mappable)
> >>    ffff880240cdec00: p g        4KiB 01 01 0 0 0 snooped or LLC (pinned x 1)
> (ggtt offset: 00075000, size: 00001000) (p mappable)
> >>    ffff880240cded80: p g      128KiB 40 40 0 0 0 snooped or LLC dirty (pinned x
> 1) (ggtt offset: 00076000, size: 00020000) (p mappable)
> >>    ffff880240cdef00: p g     8100KiB 41 00 0 0 0 uncached (pinned x 2)
> (display) (ggtt offset: 00096000, size: 007e9000) (stolen: 00000000) (p
> mappable)
> >>    ffff8800a21d5800:   g       68KiB 10 00 272 0 0 L3+LLC dirty (ggtt offset:
> 020cb000, size: 00011000) (render ring) <--------------------------------------------
> --------------- backing bo still alive and kicking
> >>    ffff8800a21d4a80: p g       68KiB 41 00 0 0 0 L3+LLC (pinned x 1) (ggtt
> offset: 020dd000, size: 00011000)
> >>    ffff8800a21d4480:   g       16KiB 41 00 0 0 0 snooped or LLC (ggtt offset:
> 020ee000, size: 00004000) (f mappable)
> >> Total 14 objects, 9064448 bytes, 9064448 GTT size
> >>
> >> GTT just before the second get_object_count() in the "export-vs-
> gem_close-race" test:
> >>
> >>    ffff880240cde000: p g       68KiB 10 00 0 0 0 L3+LLC dirty (pinned x 1) (ggtt
> offset: 00000000, size: 00011000)
> >>    ffff880240cde180: p g        4KiB 01 01 0 0 0 snooped or LLC (pinned x 1)
> (ggtt offset: 00011000, size: 00001000) (p mappable)
> >>    ffff880240cde300: p g      128KiB 40 40 0 0 0 snooped or LLC dirty (pinned x
> 1) (ggtt offset: 00012000, size: 00020000) (p mappable)
> >>    ffff880240cde480: p g        4KiB 01 01 0 0 0 snooped or LLC (pinned x 1)
> (ggtt offset: 00032000, size: 00001000) (p mappable)
> >>    ffff880240cde600: p g        4KiB 01 01 0 0 0 snooped or LLC (pinned x 1)
> (ggtt offset: 00033000, size: 00001000) (p mappable)
> >>    ffff880240cde780: p g      128KiB 40 40 0 0 0 snooped or LLC dirty (pinned x
> 1) (ggtt offset: 00034000, size: 00020000) (p mappable)
> >>    ffff880240cde900: p g        4KiB 01 01 0 0 0 snooped or LLC (pinned x 1)
> (ggtt offset: 00054000, size: 00001000) (p mappable)
> >>    ffff880240cdea80: p g      128KiB 40 40 0 0 0 snooped or LLC dirty (pinned x
> 1) (ggtt offset: 00055000, size: 00020000) (p mappable)
> >>    ffff880240cdec00: p g        4KiB 01 01 0 0 0 snooped or LLC (pinned x 1)
> (ggtt offset: 00075000, size: 00001000) (p mappable)
> >>    ffff880240cded80: p g      128KiB 40 40 0 0 0 snooped or LLC dirty (pinned x
> 1) (ggtt offset: 00076000, size: 00020000) (p mappable)
> >>    ffff880240cdef00: p g     8100KiB 41 00 0 0 0 uncached (pinned x 2)
> (display) (ggtt offset: 00096000, size: 007e9000) (stolen: 00000000) (p
> mappable)
> >>    ffff8800a21d4a80: p g       68KiB 41 00 0 0 0 L3+LLC (pinned x 1) (ggtt
> offset: 020dd000, size: 00011000)
> >>    ffff8800a21d4480:   g       16KiB 41 00 0 0 0 snooped or LLC (ggtt offset:
> 020ee000, size: 00004000) (f mappable)
> >> Total 13 objects, 8994816 bytes, 8994816 GTT size
> >>
> >> Results of the test:
> >>
> >> leaked -1 objects
> >> Test assertion failure function test_export_close_race, file
> prime_self_import.c:392:
> >> Failed assertion: obj_count == 0
> >> Subtest export-vs-gem_close-race: FAIL
> >>
> >> I´m struggling to understand how this happens exactly, but I can avoid it
> by sending an extra nop execbuffer to the render ring right after
> gem_quiescent_gpu(). I´m not saying this is a fix, but rather a (meaningful?)
> thought experiment.
> >> It looks to me like this is not a problem with the KMD, bur rather with the
> way the test is written. What do you think?
> >> -- Oscar