[Intel-xe] [PATCH] drm/xe: Use large rings for compute contexts

Thu Jun 8 12:54:45 UTC 2023

On Mon, 2023-06-05 at 17:04 +0000, Matthew Brost wrote:
> On Mon, Jun 05, 2023 at 10:33:29AM +0200, Thomas Hellström wrote:
> > Hi, Niranjana,
> > 
> > On 6/2/23 20:07, Niranjana Vishwanathapura wrote:
> > > Allow compute contexts to submit the maximal amount of work
> > > without
> > > blocking userspace.
> > > 
> > > The original size for user LRC ring's (SZ_16K) was chosen to
> > > minimise
> > > memory consumption, without being so small as to frequently stall
> > > in the
> > > middle of workloads. With the main consumers being GL / media
> > > pipelines
> > > of 2 or 3 batches per frame, we want to support ~10 requests in
> > > flight
> > > to allow for the application to control throttling without
> > > stalling
> > > within a frame.
> > > 
> > > Signed-off-by: Chris Wilson <chris.p.wilson at intel.com>
> > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
> > > Signed-off-by: Andi Shyti <andi.shyti at linux.intel.com>
> > > Signed-off-by: Niranjana Vishwanathapura
> > > <niranjana.vishwanathapura at intel.com>
> > 
> > With the DRM scheduler, jobs are written to the ring from a drm
> > scheduler
> > notifier when all dependencies are satisfied and there is ring
> > space
> > available, and I believe that's the case even after Matthew Brost's
> > changes
> > to xe / drm-scheduler on how compute jobs are handled.
> > 
> > So this patch shouldn't be needed for xe.
> > 
> > /Thomas
> 
> On tip of Xe we just hold jobs in the DRM scheduler if there isn't
> room
> the ring and write the ring in run_job once space frees up. But I
> guess
> we can't do this because dma-fencing rules for long running jobs
> (compute or fault mode VM), so after my update we return -EWOULDBLOCK
> in
> the exec IOCTL [1]. So yes, this is needed if the user will regularly
> write more than 85 jobs (16k / 192) at a time on long running VMs.
> 
> [1]
> https://patchwork.freedesktop.org/patch/538665/?series=118091&rev=3

Hmm. Yes indeed I forgot about the -EWOULDBLOCK since the review. 
OTOH, IIRC this patch was to handle a situation where a compute batch
was made to wait on a later batch being submitted, but that never
happened due to lack of ring space. So deadlock. But the patch was
considered a temporary solution and everybody agreed that the correct
solution would be that UMD implemented the needed buffering here.

So now when this is going upstream is actually a good time to make that
happen IMO, but probably needs a sync with the compute UMD folks.

/Thomas

> 
> > 
> > > ---
> > >   drivers/gpu/drm/xe/xe_engine.c | 4 +++-
> > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_engine.c
> > > b/drivers/gpu/drm/xe/xe_engine.c
> > > index b3036c4a8ec3..47c9f119fb29 100644
> > > --- a/drivers/gpu/drm/xe/xe_engine.c
> > > +++ b/drivers/gpu/drm/xe/xe_engine.c
> > > @@ -29,6 +29,7 @@ static struct xe_engine
> > > *__xe_engine_create(struct xe_device *xe,
> > >   {
> > >         struct xe_engine *e;
> > >         struct xe_gt *gt = hwe->gt;
> > > +       u32 ring_size;
> > >         int err;
> > >         int i;
> > > @@ -65,8 +66,9 @@ static struct xe_engine
> > > *__xe_engine_create(struct xe_device *xe,
> > >                 e->bind.fence_seqno = XE_FENCE_INITIAL_SEQNO;
> > >         }
> > > +       ring_size = (e->class == XE_ENGINE_CLASS_COMPUTE) ?
> > > SZ_512K : SZ_16K;
> 
> SZ_512K seems like overkill. Also we'd want to check if VM is a long
> running as Thomas is correct this not needed for normal VMs.
> 
> Matt
> 
> > >         for (i = 0; i < width; ++i) {
> > > -               err = xe_lrc_init(e->lrc + i, hwe, e, vm,
> > > SZ_16K);
> > > +               err = xe_lrc_init(e->lrc + i, hwe, e, vm,
> > > ring_size);
> > >                 if (err)
> > >                         goto err_lrc;
> > >         }