GVT Scheduler

Wang, Zhi A zhi.a.wang at intel.com
Mon Oct 19 06:11:43 UTC 2020


Hi Julian:

According to the discussion last time, I reviewed all the code paths of execlist context schedule-in and schedule-out in your code repo. According to our assumption, there might be extra execlist schedule-out status notification. Is it possible that you can open the tracepoint in execlist_context_schedule_in and execlist_context_schedule_out in intel_lrc.c? so that we can check if there are unpair schedule_in and schedule_out events. Also, better move the trace_i915_request_in to __execlists_schedule_in as there is a seqlock like sync try_cmpxchg around.

-----Original Message-----
From: intel-gvt-dev <intel-gvt-dev-bounces at lists.freedesktop.org> On Behalf Of Julian Stecklina
Sent: Friday, October 9, 2020 12:26 PM
To: Wang, Zhi A <zhi.a.wang at intel.com>; Intel GVT Dev <intel-gvt-dev at lists.freedesktop.org>
Cc: Thomas Prescher <thomas.prescher at cyberus-technology.de>
Subject: Re: GVT Scheduler

Hi Zhi,

your explanation is really helpful. Thank you! See my comments below.

On Thu, 2020-10-08 at 16:41 +0000, Wang, Zhi A wrote:
> Now let's see the timeline:
> 
> - GVT-g submits a workload to i915.
> - i915 append the breadcrumb to the workload
> - i915 submits the workload to HW.
> - GVT-g called i915_wait_request to wait for the GPU execution passed 
> the breadcrumb. (But the context might not be switched out at this 
> time)
> - GVT-g waits for the context to be switched out by the 
> shadow_context_status_change. (Because GVT-g need to copy the content 
> in the shadow context back to the guest context. The shadow context 
> must be idle at this time.)
> - No one is going to touch the shadow context anymore and GVT-g call 
> complete_current_workload.
> 
> The race between shadow_context_status_change and 
> complete_current_workload should be addressed in our design. So this 
> problem might be caused by i915 change, e.g. the timing of call 
> shadow_context_status_change is changed. But we will double confirm in GVT-g as well.

We definitely see shadow_context_status_change being called for a workload that has already passed beyond wait_event(workload->shadow_ctx_status_wq, ...); in complete_current_workload.

> The patch you mentioned is for a corner case in GPU reset. But this 
> shouldn't happen in a normal submission flow unless someone breaks the flow above.

The problem for us is that we can only reproduce this issue in a hardened Linux build after many hours. So it's not exactly the most friendly issue to debug. :)

We are currently using a workaround that serializes the actual completion of the workload against handling schedule out in shadow_context_status_change:
https://github.com/blitz/linux/commit/50a1cfd0695f7c141d16377c087a3642faee9b99

This is not pretty, but so far this has prevented the issue from popping up again.

Thanks,
Julian

_______________________________________________
intel-gvt-dev mailing list
intel-gvt-dev at lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev


More information about the intel-gvt-dev mailing list