[Intel-xe] [PATCH v3] drm/xe: Fix pagefault and access counter worker functions

Fri Nov 3 15:20:43 UTC 2023

On Fri, Nov 03, 2023 at 03:08:41PM +0000, Zeng, Oak wrote:
> 
> 
> > -----Original Message-----
> > From: Summers, Stuart <stuart.summers at intel.com>
> > Sent: Friday, November 3, 2023 10:51 AM
> > To: intel-xe at lists.freedesktop.org; Welty, Brian <brian.welty at intel.com>; Zeng,
> > Oak <oak.zeng at intel.com>
> > Subject: Re: [Intel-xe] [PATCH v3] drm/xe: Fix pagefault and access counter
> > worker functions
> > 
> > On Fri, 2023-11-03 at 02:46 +0000, Zeng, Oak wrote:
> > > Hi,
> > >
> > > Can someone explain why we introduced an circular buffer for page
> > > fault handling in xekmd?
> > >
> > > I915 page fault is also handled in a work queue/worker way. Whenever
> > > we have a guc2host request, we queue_work to process the incoming
> > > request. No circular buffer. We didn't have a problem there.
> > 
> > i915 also uses a circular buffer:
> 
> By circular buffer, I meant the struct pf_queue::data. Xekmd copy the page fault request to this buffer and process it in a worker. This buffer is a xekmd internal thing.
> 
> While in below code i915 handle request directly in a loop. The request circular buffer is co-maintained by kmd and guc.  No above intermediate buffer.
> 

In Xe we have a dedicated worker / queue to sink page faults into
(pf_queue::data). This is needed as faults cannot be directly processed
in the CT handler as this in the dma-fence signaling path (no memomry
allocations allowed) as this will deadlock if the CT can't independently
make forward progress. Fault are pushed this worker / queue from the CT
handler. As we can't allocate memory in the CT this worker / queue is a
preallocated circular buffer. I hope this clears this up.

Matt

> > static noinline void ct_incoming_request_worker_func(struct work_struct
> > *w)
> > {
> >         struct intel_guc_ct *ct =
> >                 container_of(w, struct intel_guc_ct, requests.worker);
> >         bool done;
> > 
> >         do {
> >                 done = ct_process_incoming_requests(ct, &ct-
> > >requests.incoming);
> >         } while (!done);
> > }
> > 
> > The issue with Xe here is the queue_work function is used to queue each
> > incoming fault. If we get too many concurrent faults though, there's a
> > chance we could queue something while a prior fault is in the process
> > of being launched. In that case the queue_work() function will return
> > false and we miss the fault handling.
> > 
> > In the i915 case (see above), we do handle this in a worker, but we
> > continue to handle requests from the GuC until nothing is left to
> > handle - circular buffer. In that case, the worker queue processing
> > those incoming CT requests shouldn't ever hit that corner case of
> > queue_work() returning false.
> 
> So isn't i915 way better?
> 
> Oak
> > 
> > Thanks,
> > Stuart
> > 
> > >
> > > Thanks,
> > > Oak
> > >
> > > > -----Original Message-----
> > > > From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf
> > > > Of Brian Welty
> > > > Sent: Thursday, November 2, 2023 9:43 PM
> > > > To: intel-xe at lists.freedesktop.org
> > > > Subject: [Intel-xe] [PATCH v3] drm/xe: Fix pagefault and access
> > > > counter worker
> > > > functions
> > > >
> > > > When processing G2H messages for pagefault or access counters, we
> > > > queue a
> > > > work item and call queue_work(). This fails if the worker thread is
> > > > already
> > > > queued to run.
> > > > The expectation is that the worker function will do more than
> > > > process a
> > > > single item and return. It needs to either process all pending
> > > > items or
> > > > requeue itself if items are pending. But requeuing will add latency
> > > > and
> > > > potential context switch can occur.
> > > >
> > > > We don't want to add unnecessary latency and so the worker should
> > > > process
> > > > as many faults as it can within a reasonable duration of time.
> > > > We also do not want to hog the cpu core, so here we execute in a
> > > > loop
> > > > and requeue if still running after more than 20 ms.
> > > > This seems reasonable framework and easy to tune this further if
> > > > needed.
> > > >
> > > > This resolves issues seen with several igt at xe_exec_fault_mode
> > > > subtests
> > > > where the GPU will hang when KMD ignores a pending pagefault.
> > > >
> > > > v2: requeue the worker instead of having an internal processing
> > > > loop.
> > > > v3: implement hybrid model of v1 and v2
> > > >     now, run for 20 msec before we will requeue if still running
> > > >
> > > > Signed-off-by: Brian Welty <brian.welty at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_gt_pagefault.c | 80 ++++++++++++++++--------
> > > > ----
> > > >  1 file changed, 46 insertions(+), 34 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > index ab6daebbd77c..c99af751d1fb 100644
> > > > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > > > @@ -276,10 +276,10 @@ static void print_pagefault(struct xe_device
> > > > *xe, struct
> > > > pagefault *pf)
> > > >
> > > >  #define PF_MSG_LEN_DW  4
> > > >
> > > > -static int get_pagefault(struct pf_queue *pf_queue, struct
> > > > pagefault *pf)
> > > > +static bool get_pagefault(struct pf_queue *pf_queue, struct
> > > > pagefault *pf)
> > > >  {
> > > >         const struct xe_guc_pagefault_desc *desc;
> > > > -       int ret = 0;
> > > > +       bool ret = false;
> > > >
> > > >         spin_lock_irq(&pf_queue->lock);
> > > >         if (pf_queue->head != pf_queue->tail) {
> > > > @@ -303,8 +303,7 @@ static int get_pagefault(struct pf_queue
> > > > *pf_queue, struct
> > > > pagefault *pf)
> > > >
> > > >                 pf_queue->head = (pf_queue->head + PF_MSG_LEN_DW) %
> > > >                         PF_QUEUE_NUM_DW;
> > > > -       } else {
> > > > -               ret = -1;
> > > > +               ret = true;
> > > >         }
> > > >         spin_unlock_irq(&pf_queue->lock);
> > > >
> > > > @@ -348,6 +347,8 @@ int xe_guc_pagefault_handler(struct xe_guc
> > > > *guc, u32 *msg,
> > > > u32 len)
> > > >         return full ? -ENOSPC : 0;
> > > >  }
> > > >
> > > > +#define USM_QUEUE_MAX_RUNTIME_MS       20
> > > > +
> > > >  static void pf_queue_work_func(struct work_struct *w)
> > > >  {
> > > >         struct pf_queue *pf_queue = container_of(w, struct
> > > > pf_queue, worker);
> > > > @@ -355,31 +356,37 @@ static void pf_queue_work_func(struct
> > > > work_struct *w)
> > > >         struct xe_device *xe = gt_to_xe(gt);
> > > >         struct xe_guc_pagefault_reply reply = {};
> > > >         struct pagefault pf = {};
> > > > +       unsigned long threshold;
> > > >         int ret;
> > > >
> > > > -       ret = get_pagefault(pf_queue, &pf);
> > > > -       if (ret)
> > > > -               return;
> > > > +       threshold = jiffies +
> > > > msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
> > > >
> > > > -       ret = handle_pagefault(gt, &pf);
> > > > -       if (unlikely(ret)) {
> > > > -               print_pagefault(xe, &pf);
> > > > -               pf.fault_unsuccessful = 1;
> > > > -               drm_dbg(&xe->drm, "Fault response: Unsuccessful
> > > > %d\n", ret);
> > > > -       }
> > > > +       while (get_pagefault(pf_queue, &pf)) {
> > > > +               ret = handle_pagefault(gt, &pf);
> > > > +               if (unlikely(ret)) {
> > > > +                       print_pagefault(xe, &pf);
> > > > +                       pf.fault_unsuccessful = 1;
> > > > +                       drm_dbg(&xe->drm, "Fault response:
> > > > Unsuccessful %d\n",
> > > > ret);
> > > > +               }
> > > > +
> > > > +               reply.dw0 = FIELD_PREP(PFR_VALID, 1) |
> > > > +                       FIELD_PREP(PFR_SUCCESS,
> > > > pf.fault_unsuccessful) |
> > > > +                       FIELD_PREP(PFR_REPLY, PFR_ACCESS) |
> > > > +                       FIELD_PREP(PFR_DESC_TYPE,
> > > > FAULT_RESPONSE_DESC) |
> > > > +                       FIELD_PREP(PFR_ASID, pf.asid);
> > > >
> > > > -       reply.dw0 = FIELD_PREP(PFR_VALID, 1) |
> > > > -               FIELD_PREP(PFR_SUCCESS, pf.fault_unsuccessful) |
> > > > -               FIELD_PREP(PFR_REPLY, PFR_ACCESS) |
> > > > -               FIELD_PREP(PFR_DESC_TYPE, FAULT_RESPONSE_DESC) |
> > > > -               FIELD_PREP(PFR_ASID, pf.asid);
> > > > +               reply.dw1 = FIELD_PREP(PFR_VFID, pf.vfid) |
> > > > +                       FIELD_PREP(PFR_ENG_INSTANCE,
> > > > pf.engine_instance) |
> > > > +                       FIELD_PREP(PFR_ENG_CLASS, pf.engine_class)
> > > > |
> > > > +                       FIELD_PREP(PFR_PDATA, pf.pdata);
> > > >
> > > > -       reply.dw1 = FIELD_PREP(PFR_VFID, pf.vfid) |
> > > > -               FIELD_PREP(PFR_ENG_INSTANCE, pf.engine_instance) |
> > > > -               FIELD_PREP(PFR_ENG_CLASS, pf.engine_class) |
> > > > -               FIELD_PREP(PFR_PDATA, pf.pdata);
> > > > +               send_pagefault_reply(&gt->uc.guc, &reply);
> > > >
> > > > -       send_pagefault_reply(&gt->uc.guc, &reply);
> > > > +               if (time_after(jiffies, threshold)) {
> > > > +                       queue_work(gt->usm.pf_wq, w);
> > > > +                       break;
> > > > +               }
> > > > +       }
> > > >  }
> > > >
> > > >  static void acc_queue_work_func(struct work_struct *w);
> > > > @@ -544,10 +551,10 @@ static int handle_acc(struct xe_gt *gt,
> > > > struct acc *acc)
> > > >
> > > >  #define ACC_MSG_LEN_DW        4
> > > >
> > > > -static int get_acc(struct acc_queue *acc_queue, struct acc *acc)
> > > > +static bool get_acc(struct acc_queue *acc_queue, struct acc *acc)
> > > >  {
> > > >         const struct xe_guc_acc_desc *desc;
> > > > -       int ret = 0;
> > > > +       bool ret = false;
> > > >
> > > >         spin_lock(&acc_queue->lock);
> > > >         if (acc_queue->head != acc_queue->tail) {
> > > > @@ -567,8 +574,7 @@ static int get_acc(struct acc_queue *acc_queue,
> > > > struct acc
> > > > *acc)
> > > >
> > > >                 acc_queue->head = (acc_queue->head +
> > > > ACC_MSG_LEN_DW) %
> > > >                                   ACC_QUEUE_NUM_DW;
> > > > -       } else {
> > > > -               ret = -1;
> > > > +               ret = true;
> > > >         }
> > > >         spin_unlock(&acc_queue->lock);
> > > >
> > > > @@ -581,16 +587,22 @@ static void acc_queue_work_func(struct
> > > > work_struct *w)
> > > >         struct xe_gt *gt = acc_queue->gt;
> > > >         struct xe_device *xe = gt_to_xe(gt);
> > > >         struct acc acc = {};
> > > > +       unsigned long threshold;
> > > >         int ret;
> > > >
> > > > -       ret = get_acc(acc_queue, &acc);
> > > > -       if (ret)
> > > > -               return;
> > > > +       threshold = jiffies +
> > > > msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
> > > >
> > > > -       ret = handle_acc(gt, &acc);
> > > > -       if (unlikely(ret)) {
> > > > -               print_acc(xe, &acc);
> > > > -               drm_warn(&xe->drm, "ACC: Unsuccessful %d\n", ret);
> > > > +       while (get_acc(acc_queue, &acc)) {
> > > > +               ret = handle_acc(gt, &acc);
> > > > +               if (unlikely(ret)) {
> > > > +                       print_acc(xe, &acc);
> > > > +                       drm_warn(&xe->drm, "ACC: Unsuccessful
> > > > %d\n", ret);
> > > > +               }
> > > > +
> > > > +               if (time_after(jiffies, threshold)) {
> > > > +                       queue_work(gt->usm.acc_wq, w);
> > > > +                       break;
> > > > +               }
> > > >         }
> > > >  }
> > > >
> > > > --
> > > > 2.38.0
> > >
>