[Intel-xe] [PATCH 1/3] drm/xe: Include hardware prefetch buffer in batchbuffer allocations

Fri Mar 31 20:34:30 UTC 2023

On Fri, Mar 31, 2023 at 01:14:15PM -0700, Lucas De Marchi wrote:
> On Wed, Mar 29, 2023 at 10:33:32AM -0700, Matt Roper wrote:
> > The hardware prefetches several cachelines of data from batchbuffers
> > before they are parsed.  This prefetching only stops when the parser
> > encounters an MI_BATCH_BUFFER_END instruction (or a nested
> > MI_BATCH_BUFFER_START), so we must ensure that there is enough padding
> > at the end of the batchbuffer to prevent the prefetcher from running
> > past the end of the allocation and potentially faulting.
> > 
> > Bspec: 45717
> > Signed-off-by: Matt Roper <matthew.d.roper at intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_bb.c | 25 +++++++++++++++++++++++--
> > 1 file changed, 23 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_bb.c b/drivers/gpu/drm/xe/xe_bb.c
> > index 5b24018e2a80..f326f117ba3b 100644
> > --- a/drivers/gpu/drm/xe/xe_bb.c
> > +++ b/drivers/gpu/drm/xe/xe_bb.c
> > @@ -8,11 +8,26 @@
> > #include "regs/xe_gpu_commands.h"
> > #include "xe_device.h"
> > #include "xe_engine_types.h"
> > +#include "xe_gt.h"
> > #include "xe_hw_fence.h"
> > #include "xe_sa.h"
> > #include "xe_sched_job.h"
> > #include "xe_vm_types.h"
> > 
> > +static int bb_prefetch(struct xe_gt *gt)
> > +{
> > +	struct xe_device *xe = gt->xe;
> > +
> > +	if (GRAPHICS_VERx100(xe) >= 1250 && !xe_gt_is_media_type(gt))
> > +		/*
> > +		 * RCS and CCS require 1K, although other engines would be
> > +		 * okay with 512.
> > +		 */
> > +		return SZ_1K;
> > +	else
> > +		return SZ_512;
> > +}
> > +
> > struct xe_bb *xe_bb_new(struct xe_gt *gt, u32 dwords, bool usm)
> > {
> > 	struct xe_bb *bb = kmalloc(sizeof(*bb), GFP_KERNEL);
> > @@ -21,8 +36,14 @@ struct xe_bb *xe_bb_new(struct xe_gt *gt, u32 dwords, bool usm)
> > 	if (!bb)
> > 		return ERR_PTR(-ENOMEM);
> > 
> > -	bb->bo = xe_sa_bo_new(!usm ? &gt->kernel_bb_pool :
> > -			      &gt->usm.bb_pool, 4 * dwords + 4);
> > +	/*
> > +	 * We need to allocate space for the requested number of dwords,
> > +	 * one additional MI_BATCH_BUFFER_END dword, and additional buffer
> > +	 * space to accomodate the platform-specific hardware prefetch
> > +	 * requirements.
> > +	 */
> > +	bb->bo = xe_sa_bo_new(!usm ? &gt->kernel_bb_pool : &gt->usm.bb_pool,
> > +			      4 * (dwords + 1) + bb_prefetch(gt));
> 
> if the command buffer for the CS is 512 or 1024, wouldn't it be
> sufficient to just align the end rather than increase it by that?

I thought that initially, but the spec indicates that the prefetch
doesn't just happen in .5K chunks or whatever, but that they're
continuously happening as each command in the batch buffer is executed.
So if you complete a 1 DW instructions, a new DW is prefetched at the
current HEAD + 0.5K.  If that falls outside the batchbuffer (and
potentially outside the bound pages), the hardware can fault.

The MI_BATCH_BUFFER_START bspec page (45718) also gives some more
clarity on this:

        "It keeps fetching command data as and when space is available
        in the storage upon execution of commands. In case of batch
        buffer execution, DMA engine stops prefetching the command data
        only on executing MI_BATCH_BUFFER_END command and
        MI_BATCH_BUFFER_START in case of chained batch buffers."

Matt

> 
> Lucas De Marchi

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation