[PATCH i-g-t 1/1] lib/xe/xe_query: Wait for xe_supports_faults

Thu May 9 14:27:17 UTC 2024

-----Original Message-----
From: Welty, Brian <brian.welty at intel.com> 
Sent: Wednesday, May 8, 2024 6:28 PM
To: Kamil Konieczny <kamil.konieczny at linux.intel.com>; igt-dev at lists.freedesktop.org; Cavitt, Jonathan <jonathan.cavitt at intel.com>; Gupta, saurabhg <saurabhg.gupta at intel.com>; Mistat, Tomasz <tomasz.mistat at intel.com>; Girotra, Himanshu <himanshu.girotra at intel.com>
Subject: Re: [PATCH i-g-t 1/1] lib/xe/xe_query: Wait for xe_supports_faults
> On 5/8/2024 10:11 AM, Kamil Konieczny wrote:
> > Hi,
> > On 2024-05-08 at 16:43:04 +0000, Cavitt, Jonathan wrote:
> >> -----Original Message-----
> >> From: Kamil Konieczny <kamil.konieczny at linux.intel.com>
> >> Sent: Wednesday, May 8, 2024 9:24 AM
> >> To: igt-dev at lists.freedesktop.org
> >> Cc: Cavitt, Jonathan <jonathan.cavitt at intel.com>; Gupta, saurabhg <saurabhg.gupta at intel.com>; Welty, Brian <brian.welty at intel.com>; Mistat, Tomasz <tomasz.mistat at intel.com>; Girotra, Himanshu <himanshu.girotra at intel.com>
> >> Subject: Re: [PATCH i-g-t 1/1] lib/xe/xe_query: Wait for xe_supports_faults
> >>>
> >>> Hi Jonathan,
> >>> On 2024-05-03 at 12:37:14 -0700, Jonathan Cavitt wrote:
> >>>> It's possible for xe_supports_faults to return false if the system is
> >>>> busy with multiple running tests.  This is because the check looks for
> >>>> all active VMs and searches for VMs that do not have faults enabled,
> >>>> returning false if any exist.  Recently, this check has been changed to
> >>>> return EBUSY when the check fails in this way, so wait for up to ten
> >>>> seconds for all the active VMs to flush out before proceeding.
> >>>>
> >>>> Suggested-by: Brian Welty <brian.welty at intel.com>
> >>>> Signed-off-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
> >>>> ---
> >>>>   lib/xe/xe_query.c | 9 +++++++--
> >>>>   1 file changed, 7 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/lib/xe/xe_query.c b/lib/xe/xe_query.c
> >>>> index 6df8f42649..5458c73417 100644
> >>>> --- a/lib/xe/xe_query.c
> >>>> +++ b/lib/xe/xe_query.c
> >>>> @@ -314,8 +314,13 @@ bool xe_supports_faults(int fd)
> >>>>   		.flags = DRM_XE_VM_CREATE_FLAG_LR_MODE |
> >>>>   			 DRM_XE_VM_CREATE_FLAG_FAULT_MODE,
> >>>>   	};
> >>>> -
> >>>> -	supports_faults = !igt_ioctl(fd, DRM_IOCTL_XE_VM_CREATE, &create);
> >>>> +	struct timespec tv = {};
> >>>> +	int result, timeout;
> >>>> +	do {
> >>>> +		result = igt_ioctl(fd, DRM_IOCTL_XE_VM_CREATE, &create);
> >>>> +		supports_faults = !result;
> >>>> +		timeout = igt_seconds_elapsed(&tv);
> >>>> +	} while (result == -EBUSY && timeout < 10);
> >>> -------------------------------- ^^^^^^^^^^^^
> >>>
> >>> Waiting any number of seconds in library function is way too much,
> >>> imho this is ok in test itself, not on lib.
> >>
> >> Is the suggestion here that we should perform the wait "on" xe_supports_faults,
> >> rather than "in" xe_supports_faults?  I.E. that we do something like this instead
> >> in xe_exec_fault_mode.c:
> >>
> >>          igt_fixture {
> >>                  bool supports_faults;
> >>                  struct timespec tv = {};
> >>                  fd = drm_open_driver(DRIVER_XE);
> >>                  do {
> >>                          supports_faults = xe_supports_faults(fd);
> >>                  } while (!supports_faults && igt_seconds_elapsed(&tv) < 10);
> >>                  igt_require(supports_faults);
> >>          }
> >>
> >> I can do this, though if xe_supports_faults returns false for any non-EBUSY related
> >> reasons, we won't be able to detect it from here and we'll spend ten seconds waiting
> >> for xe_supports_faults to return true when it strictly cannot.  Is this an acceptable
> >> tradeoff?
> >> -Jonathan Cavitt
> >>
> > 
> > If you are doing this just after igt_main you do not expect
> > it will return EBUSY. Other way would be to return errno from
> > lib function and let subtest decide if it want to wait.
> > There you could also account for simulation, where it can
> > take longer. Btw is there any sysfs param for it?
> > 
> > Regards,
> > Kamil
> > 
> 
> Problem is the issue is that 2 VMs cannot be created in the 2 different 
> modes.
> Cannot have one VM in 'fault mode' at same time as one is in 'non-fault 
> mode'.   So the correct place for the check is not even inside 
> xe_supports_fault_mode(), but in xe_vm_create().
> 
> As every single IGT test that creates an IGT can fail with -EBUSY from
> xe_vm_create() as creating 2 such  VMs is not allowed.  Whoever tries to 
> create the VM second loses (fails).

If it's the case that each calling of xe_vm_create has the potential to return
an error due to this race condition, I'd expect that we'd see this issue crop up
from time to time in cases where the issue with xe_supports_faults is not seen,
especially in tests that don't use fault mode vms.

However, I couldn't see any reports affirming as such on a cursory inspection.
There were a couple of xe_vm_create failure reports, but it looked like they
were caused by interface divergence and not by the above race condition.
-Jonathan Cavitt

> 
> Maybe a good first step is to have xe_vm_create() raise a SKIP on -EBUSY 
> errors?
> 
> Maybe a good second step is to figure when and where adding a 
> timeout+retry might make sense.....   maybe since this is a shard run (I 
> think?),  this test can be attempted again later in the batch of tests?
> 
> -Brian
> 
> >>>
> >>> Regards,
> >>> Kamil
> >>>
> >>>>   
> >>>>   	if (supports_faults)
> >>>>   		xe_vm_destroy(fd, create.vm_id);
> >>>> -- 
> >>>> 2.25.1
> >>>>
> >>>
>