[Intel-gfx] [igt-dev] [PATCH i-g-t 2/2] tests/gem_exec_await: Add a memory pressure subtest

Mon Nov 19 15:54:44 UTC 2018

On 19/11/2018 15:36, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-11-19 15:22:29)
>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>
>> Memory pressure subtest attempts to provoke system overload which can
>> cause GPU hangs, especially when combined with spin batches which do
>> not allow for some nop instructions to provide relief.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>> ---
>>   tests/i915/gem_exec_await.c | 107 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 107 insertions(+)
>>
>> diff --git a/tests/i915/gem_exec_await.c b/tests/i915/gem_exec_await.c
>> index 3ea5b5903c6b..ccb5159a6fe1 100644
>> --- a/tests/i915/gem_exec_await.c
>> +++ b/tests/i915/gem_exec_await.c
>> @@ -30,6 +30,11 @@
>>   
>>   #include <sys/ioctl.h>
>>   #include <sys/signal.h>
>> +#include <sys/types.h>
>> +#include <sys/stat.h>
>> +#include <fcntl.h>
>> +#include <pthread.h>
>> +#include <sched.h>
>>   
>>   #define LOCAL_I915_EXEC_NO_RELOC (1<<11)
>>   #define LOCAL_I915_EXEC_HANDLE_LUT (1<<12)
>> @@ -227,6 +232,92 @@ static void wide(int fd, int ring_size, int timeout, unsigned int flags)
>>          free(exec);
>>   }
>>   
>> +struct thread {
>> +       pthread_t thread;
>> +       volatile bool done;
>> +};
>> +
>> +static unsigned long get_avail_ram_mb(void)
> 
> intel_get_avail_ram_mb() ?

I thought so but when things went slow I looked inside and concluded it 
is not suitable.

>> +#define PAGE_SIZE 4096
>> +static void *mempressure(void *arg)
>> +{
>> +       struct thread *thread = arg;
>> +       const unsigned int sz_mb = 2;
>> +       const unsigned int sz = sz_mb << 20;
>> +       unsigned int n = 0, max = 0;
>> +       unsigned int blocks;
>> +       void **ptr = NULL;
>> +
>> +       while (!thread->done) {
> 
> You can use READ_ONCE(thread->done) here for familiarity.

Okay, didn't realize we copied it to IGT.

>> +               unsigned long ram_mb = get_avail_ram_mb();
>> +
>> +               if (!ptr) {
>> +                       blocks = ram_mb / sz_mb;
>> +                       ptr = calloc(blocks, sizeof(void *));
>> +                       igt_assert(ptr);
>> +               } else if (ram_mb < 384) {
>> +                       blocks = max + 1;
>> +               }
>> +
>> +               if (ptr[n])
>> +                       munmap(ptr[n], sz);
>> +
>> +               ptr[n] = mmap(NULL, sz, PROT_WRITE,
>> +                             MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> +               assert(ptr[n] != MAP_FAILED);
>> +
>> +               madvise(ptr[n], sz, MADV_HUGEPAGE);
>> +
>> +               for (size_t page = 0; page < sz; page += PAGE_SIZE)
>> +                       *(volatile uint32_t *)((unsigned char *)ptr[n] + page) =
>> +                               0;
>> +
>> +               if (n > max)
>> +                       max = n;
>> +
>> +               n++;
>> +
>> +               if (n >= blocks)
>> +                       n = 0;
> 
> Another method would be to use mlock to force exhaustion.
> 
> However, as the supposition is that rcu is part of the underlying
> mechanism if you fill the dentry cache we'll exercise both the shrinker
> and RCU.

As said in previous reply, in my testing, well at least the one thing I 
was able to reproduce and which has the same symptoms as the bug, the 
problem went away with the addition of nops.

But yeah, maybe that could be an indirect effect.

Also this cleaned up patch does not cut it any longer. :( I seems I've 
lost the magic ingredient to reproduce the stalls during cleanups. I 
have to go back and add stuff to get it back.

Regards,

Tvrtko