[Intel-gfx] [PATCH i-g-t] gem_flink_race/prime_self_import: Improve test reliability

Mon Dec 14 01:38:11 PST 2015

>
>
>-----Original Message-----
>From: Daniel Vetter [mailto:daniel.vetter at ffwll.ch] On Behalf Of Daniel Vetter
>Sent: Friday, December 11, 2015 5:06 PM
>To: Morton, Derek J
>Cc: Daniel Vetter; intel-gfx at lists.freedesktop.org; Wood, Thomas
>Subject: Re: [Intel-gfx] [PATCH i-g-t] gem_flink_race/prime_self_import: Improve test reliability
>
>On Fri, Dec 11, 2015 at 10:33:46AM +0000, Morton, Derek J wrote:
>> >
>> >
>> >-----Original Message-----
>> >From: Daniel Vetter [mailto:daniel.vetter at ffwll.ch] On Behalf Of 
>> >Daniel Vetter
>> >Sent: Thursday, December 10, 2015 12:53 PM
>> >To: Morton, Derek J
>> >Cc: Daniel Vetter; intel-gfx at lists.freedesktop.org; Wood, Thomas
>> >Subject: Re: [Intel-gfx] [PATCH i-g-t] 
>> >gem_flink_race/prime_self_import: Improve test reliability
>> >
>> >On Thu, Dec 10, 2015 at 11:51:29AM +0000, Morton, Derek J wrote:
>> >> >
>> >> >
>> >> >-----Original Message-----
>> >> >From: Daniel Vetter [mailto:daniel.vetter at ffwll.ch] On Behalf Of 
>> >> >Daniel Vetter
>> >> >Sent: Thursday, December 10, 2015 10:13 AM
>> >> >To: Morton, Derek J
>> >> >Cc: intel-gfx at lists.freedesktop.org; Wood, Thomas
>> >> >Subject: Re: [Intel-gfx] [PATCH i-g-t]
>> >> >gem_flink_race/prime_self_import: Improve test reliability
>> >> >
>> >> >On Tue, Dec 08, 2015 at 12:44:44PM +0000, Derek Morton wrote:
>> >> >> gem_flink_race and prime_self_import have subtests which read 
>> >> >> the number of open gem objects from debugfs to determine if 
>> >> >> objects have leaked during the test. However the test can fail 
>> >> >> sporadically if the number of gem objects changes due to other process activity.
>> >> >> This patch introduces a change to check the number of gem 
>> >> >> objects several times to filter out any fluctuations.
>> >> >
>> >> >Why exactly does this happen? IGT tests should be run on bare 
>> >> >metal, with everything else killed/subdued/shutup. If there's 
>> >> >still things going on that create objects, we need to stop them from doing that.
>> >> >
>> >> >If this only applies to Android, or some special Android deamon 
>> >> >them imo check for that at runtime and igt_skip("your setup is 
>> >> >invalid, deamon %s running\n"); is the correct fix. After all just 
>> >> >because you sampled for a bit doesn't mean that it wont still 
>> >> >change right when you start running the test for real, so this is still fragile.
>> >> 
>> >> Before running tests on android we do stop everything possible. I 
>> >> suspect the culprit is coreu getting automatically restarted after 
>> >> it is stopped. I had additional debug while developing this patch 
>> >> and what I saw was the system being mostly quiescent but with some 
>> >> very low level background activity. 1 extra object being created 
>> >> and then deleted occasionally. Depending on whether it occurred at 
>> >> the start or end of the test it was resulting in a reported leak of either 1 or -1 objects.
>> >> The patch fixes that issue by taking several samples and requiring 
>> >> them to be the same, therefore filtering out the low level background noise.
>> >> It would not help if something in the background allocated an 
>> >> object and kept it allocated, but I have not seen that happen. I 
>> >> only saw once the object count increasing for 2 consecutive reads 
>> >> hence the count to 4 to give a margin. The test was failing about 
>> >> 10%. With this patch I got 100% pass across 300 runs of each of the tests.
>> >
>> >Hm, piglit checks that there's no other drm clients running. Have you tried re-running that check to zero in on the culprit?
>> 
>> We don't use piglet to run IGT tests on Android. I have had a look at 
>> what piglet does and added the same check to our scripts. (It reads a list of clients from /sys/kernel/debug/dri/0/clients) For CHV it shows a process called 'y', though that seems to be some issue on CHV that all driver clients are called 'y'. I checked on BXT which properly shows the process names and it looks like it is the binder process (which  is handling some inter process communication). I don't think this is something we can stop.
>
>Nah, you definitely can't stop binder, won't have an android left after that ;-)
>
>But it is strange that binder owns these buffers. Binder is just IPC, but like unix domain sockets you can also throw around file descriptors. So something on your system is moving open drm fd devices still around. I don't have an idea what kind of audit/debug tooling binder offers, but there should be a way to figure out who really owns that file descriptor.
>If you're lucky lsof (if android has that, otherwise walk /proc/*/fd/* symlinks manually) should help.
>
>Cheers, Daniel

I am still suspicious that it is coreu that is the real culprit. I ran the tests Friday and did some reading of i915_gem_objects in parallel to get the full content of what it reported. I once caught coreu being listed as owning an active gem object during the test. Coreu gets stopped explicitly before tests are run but I suspect the init demon is detecting that it is stopped and is restarting it automatically. Something is reporting in logcat when coreu is stopped anyway.

I will address your comments on V2 and submit another patch.

//Derek

>
>> >> If you are concerned about the behaviour when running the test with 
>> >> a load of background activity I could add code to limit to the 
>> >> reset of the count and fail the test in that instance. That would 
>> >> give a benefit of distinguishing a test fail due to excessive 
>> >> background activity from a detected leak.
>> >
>> >I'm also concerned for the overhead this causes everyone else. If this really is some Android trouble then I think it'd be good to only compile this on Android. But would still be much better if you can get to a reliably clean test environment.
>> 
>> I will make the loop part android specific.
>> 
>> 
>> //Derek
>> 
>> >
>> >> I would not want to just have the test skip as that introduces a 
>> >> hole in our test coverage.
>> >> 
>> >> >Also would be good to extract get_stable_obj_count to a proper igt 
>> >> >library function, if it indeed needs to be this tricky. And then 
>> >> >add the explanation for why we need this in the gtkdoc.
>> >> 
>> >> I  can move the code to an igt library. Which library would you suggest? Igt_debugfs ?
>> >
>> >Hm yeah, it's a bit the dumping ground for all things debugfs access 
>> >;-) -Daniel
>> >--
>> >Daniel Vetter
>> >Software Engineer, Intel Corporation
>> >http://blog.ffwll.ch
>> >
>
>--
>Daniel Vetter
>Software Engineer, Intel Corporation
>http://blog.ffwll.ch
>