GVT Scheduler

Julian Stecklina julian.stecklina at cyberus-technology.de
Fri Oct 9 09:26:08 UTC 2020


Hi Zhi,

your explanation is really helpful. Thank you! See my comments below.

On Thu, 2020-10-08 at 16:41 +0000, Wang, Zhi A wrote:
> Now let's see the timeline:
> 
> - GVT-g submits a workload to i915.
> - i915 append the breadcrumb to the workload
> - i915 submits the workload to HW.
> - GVT-g called i915_wait_request to wait for the GPU execution passed the
> breadcrumb. (But the context might not be switched out at this time)
> - GVT-g waits for the context to be switched out by the
> shadow_context_status_change. (Because GVT-g need to copy the content in the
> shadow context back to the guest context. The shadow context must be idle at
> this time.)
> - No one is going to touch the shadow context anymore and GVT-g call
> complete_current_workload. 
> 
> The race between shadow_context_status_change and complete_current_workload
> should be addressed in our design. So this problem might be caused by i915
> change, e.g. the timing of call shadow_context_status_change is changed. But
> we will double confirm in GVT-g as well.

We definitely see shadow_context_status_change being called for a workload that
has already passed beyond wait_event(workload->shadow_ctx_status_wq, ...); in
complete_current_workload.

> The patch you mentioned is for a corner case in GPU reset. But this shouldn't
> happen in a normal submission flow unless someone breaks the flow above.

The problem for us is that we can only reproduce this issue in a hardened Linux
build after many hours. So it's not exactly the most friendly issue to debug. :)

We are currently using a workaround that serializes the actual completion of the
workload against handling schedule out in shadow_context_status_change:
https://github.com/blitz/linux/commit/50a1cfd0695f7c141d16377c087a3642faee9b99

This is not pretty, but so far this has prevented the issue from popping up
again.

Thanks,
Julian



More information about the intel-gvt-dev mailing list