[RFC v4 i-g-t] tests/intel/xe_sec_exec_queue_timeslice: Timeslice Abuse on Exec Queues

Thu May 15 10:34:38 UTC 2025

On 5/14/2025 11:26 PM, Michał Winiarski wrote:
> On Thu, May 08, 2025 at 10:51:43AM +0200, Peter Senna Tschudin wrote:
>> The objective is to test the behavior of the GPU scheduler under
>> conditions where one execution queue ("attacker") is configured with an
>> abnormally large timeslice, potentially disrupting the normal execution
>> of another queue ("attacked"). The explicit attack point is the ioctl
>> DRM_XE_EXEC_QUEUE_EXTENSION_SET_PROPERTY.
>>
>> This disables drm and xe logging to prevent it from slowing down and
>> serializing the tests using igt_drm_debug_level_update() that installs
>> an exit handler to restore the logging to its previous value.
>>
>> This RFC implements the first step that is reading timeslice values for
>> each engine and run the following tests on timeslice limits:
>>  - Boundary Value Analysis: Focuses on testing at the boundaries of
>>    input values with values such as max - 1, max, max + 1.
> 
> That should already be covered by xe_exec_queue_property test, right?
> 

The idea is to make a test that is complete, not to depends on other
tests. I am considering these "security igts" as a new category of test
so that we can decide what to do with them.

>>  - Equivalence Partitioning:  Divide values into partitions where all
>>    values behave similarly and test one value from each partition.
> 
> I don't understand this one - but looking at actual implementation, it
> looks pretty similar to boundary value analysis (except with min - 1,
> min).

Equivalence Partitioning and Boundary Value Analysis are two different
methods for testing values. These are not my creation, they are standard
practice. They are similar but complementary. The idea is to use common
practice tests to check more boxes that can be useful in the context of
an audit.

> 
>>  - Fuzz Testing: Basic fuzz tester for
>>       - Large numbers
>>       - Empty value
>>       - '\0'
> 
> But those are all constants. And empty value is just 0 :)
> I fail to see what we're "fuzzing" by trying same constants multiple
> times (again looks more like invalid values, similar to boundary value
> checks).

Similar with the previous one. These are corner cases that I could come up
with. The idea is to maximize coverage. This is a contrast to trying to make
an efficient test where we make sure each case gets tested only once. These
all come from a shallow analysis I did on what are the best practices for our
context.

> 
>>       - 1M u64 random values
>>  - Fuzz stress test: Create 50k threads for the fuzzing test with 500
>>    random numbers each taking about 30 seconds to run
> 
> Perhaps we can extend the xe_exec_queue_property test? If we want to go
> beyond boundary testing? We could then hit other properties as well.

To answer that I guess we need alignment on the goal. The goal for me is
to have as many check boxes as we can that are meaningful in the context of a
security audit. This is why I am very happy to have overlapping tests such as
Equivalence Partitioning and Boundary Value Analysis: they are similar but
they check for different potential issues.

Please note that I am not defending that my implementation is good. Up to this
point the RFC has about a day of coding effort.

> 
>>
>> The proposed steps are (this patch goes until 2a):
>>
>> 1. Determine the values for the following parameters:
>>         - `timeslice_duration_us`: Default timeslice duration in
>>            microseconds.
>>         - `timeslice_duration_min`: Minimum allowable timeslice duration.
>>         - `timeslice_duration_max`: Maximum allowable timeslice duration.
>>
>> 2. Create two execution queues with the following configurations:
>>         - `attacked`: Queue with standard/default settings.
>>         - `attacker`: Queue configured with an extended timeslice
>>                       duration. The goal is to:
>>                 a) Try to set timeslice to invalid values. This is
>>                    expected to fail.
>>
>>                 b) Create the attacker queue setting the timeslice to
>>                    `timeslice_duration_max`.
>>
>> 3. Submit tasks to both queues:
>>         - Submit a workload to the `attacked` queue with normal
>>            operations.
>>         - Submit a workload to the `attacker` queue designed to
>>           maximize its timeslice and potentially disrupt the GPU
>>           scheduler.
>>
>> 4. Verify the behavior of the `attacked` queue:
>>         - Ensure that tasks in the `attacked` queue execute within the
>>           expected time constraints and are not delayed or blocked due
>>           to the extended timeslice of the `attacker` queue.
>>         - Specifically, confirm that tasks in the `attacked` queue do
>>           not exceed `timeslice_duration_max` in terms of execution
>>           delays or interruptions.
> 
> I think we should separate functional timeslice testing (which is
> verifying whether Xe driver is indeed respecting what was set as a valid 
> timeslice by the user, and is a proposed future extension of this test),
> from checking uAPI abuse attempts (which is what this test is currently
> doing).

I am fine either way, but the current suggestion is based on the specification
of the adversary and attack point. My list of proposed actions use the same
adversary and attack point. Why do you want to separate them?

> 
> Thanks,
> -Michał