[PATCH v8 0/8] drm/xe/vf: Post-migration recovery of queues and jobs
Lis, Tomasz
tomasz.lis at intel.com
Tue Aug 12 16:38:47 UTC 2025
On 8/12/2025 5:55 PM, Thomas Hellström wrote:
> On Mon, 2025-08-11 at 21:47 -0700, Matthew Brost wrote:
>> On Fri, Aug 01, 2025 at 03:50:37AM +0200, Tomasz Lis wrote:
>>
>> +Maintainers, public, consolidating threads.
>>
>> IMO this needs to be fully reverted or at minimum a subset of
>> problematic patches, along with [1].
>>
>> To start, it was brought to my attention that this code was not
>> tested
>> due to a lack of internal infrastructure. I believe it is completely
>> unacceptable for untested code to be merged, especially for a complex
>> feature like this. After spending some time reviewing the feature, I
>> found clear bugs, and the overall software architecture appears
>> ill-conceived.
>>
> Tomasz Lis, Michał Wajdeczko, what type of testing is missing here? Is
> the code not tested at all, not even internally or do we lack igts for
> this specific functionality?
Most of the patches were tested with simplified VF migration scenario -
where
both PF and VF driver is probed on the same kernel. This scenario
limited the
ability to test objects flowing due to the GPU being busy - like jobs or CTB
messages. This is because the scenario doesn't use VM, therefore VM cannot
be paused. But GGTT address change was tested.
Recently (circa 3 weeks ago) we've enabled complete testing environment,
with VF KMD running within a VM. We are now testing the solution in an
environment where we can see the ring emit position bug noticed by
Matthew.
To conclude this was tested, and is now going through more thorough
testing. The issues are:
* A bug with position of CTB messages within ring was not spotted by the
testing.
* The patch "drm/xe/vf: Pause submissions during RESFIX fixups" is
too naive with its approach - this one patch is introducing problems
which may end up in deadlock.
All VF migration code is only reachable in debug builds:
```
static bool vf_migration_supported(struct xe_device *xe)
{
return IS_ENABLED(CONFIG_DRM_XE_DEBUG);
}
```
The feature is not enabled, because it is not fully merged yet.
-Tomasz
>
> /Thomas
>
>
More information about the Intel-xe
mailing list