[RFC v4 i-g-t] tests/intel/xe_sec_exec_queue_timeslice: Timeslice Abuse on Exec Queues

Wed May 14 21:26:28 UTC 2025

On Thu, May 08, 2025 at 10:51:43AM +0200, Peter Senna Tschudin wrote:
> The objective is to test the behavior of the GPU scheduler under
> conditions where one execution queue ("attacker") is configured with an
> abnormally large timeslice, potentially disrupting the normal execution
> of another queue ("attacked"). The explicit attack point is the ioctl
> DRM_XE_EXEC_QUEUE_EXTENSION_SET_PROPERTY.
> 
> This disables drm and xe logging to prevent it from slowing down and
> serializing the tests using igt_drm_debug_level_update() that installs
> an exit handler to restore the logging to its previous value.
> 
> This RFC implements the first step that is reading timeslice values for
> each engine and run the following tests on timeslice limits:
>  - Boundary Value Analysis: Focuses on testing at the boundaries of
>    input values with values such as max - 1, max, max + 1.

That should already be covered by xe_exec_queue_property test, right?

>  - Equivalence Partitioning:  Divide values into partitions where all
>    values behave similarly and test one value from each partition.

I don't understand this one - but looking at actual implementation, it
looks pretty similar to boundary value analysis (except with min - 1,
min).

>  - Fuzz Testing: Basic fuzz tester for
>       - Large numbers
>       - Empty value
>       - '\0'

But those are all constants. And empty value is just 0 :)
I fail to see what we're "fuzzing" by trying same constants multiple
times (again looks more like invalid values, similar to boundary value
checks).

>       - 1M u64 random values
>  - Fuzz stress test: Create 50k threads for the fuzzing test with 500
>    random numbers each taking about 30 seconds to run

Perhaps we can extend the xe_exec_queue_property test? If we want to go
beyond boundary testing? We could then hit other properties as well.

> 
> The proposed steps are (this patch goes until 2a):
> 
> 1. Determine the values for the following parameters:
>         - `timeslice_duration_us`: Default timeslice duration in
>            microseconds.
>         - `timeslice_duration_min`: Minimum allowable timeslice duration.
>         - `timeslice_duration_max`: Maximum allowable timeslice duration.
> 
> 2. Create two execution queues with the following configurations:
>         - `attacked`: Queue with standard/default settings.
>         - `attacker`: Queue configured with an extended timeslice
>                       duration. The goal is to:
>                 a) Try to set timeslice to invalid values. This is
>                    expected to fail.
> 
>                 b) Create the attacker queue setting the timeslice to
>                    `timeslice_duration_max`.
> 
> 3. Submit tasks to both queues:
>         - Submit a workload to the `attacked` queue with normal
>            operations.
>         - Submit a workload to the `attacker` queue designed to
>           maximize its timeslice and potentially disrupt the GPU
>           scheduler.
> 
> 4. Verify the behavior of the `attacked` queue:
>         - Ensure that tasks in the `attacked` queue execute within the
>           expected time constraints and are not delayed or blocked due
>           to the extended timeslice of the `attacker` queue.
>         - Specifically, confirm that tasks in the `attacked` queue do
>           not exceed `timeslice_duration_max` in terms of execution
>           delays or interruptions.

I think we should separate functional timeslice testing (which is
verifying whether Xe driver is indeed respecting what was set as a valid 
timeslice by the user, and is a proposed future extension of this test),
from checking uAPI abuse attempts (which is what this test is currently
doing).

Thanks,
-Michał