[Intel-gfx] [RFC 2/2] drm/i915/migrate: Evict and restore the ccs data

Hellstrom, Thomas thomas.hellstrom at intel.com
Mon Feb 7 15:22:33 UTC 2022


On Mon, 2022-02-07 at 20:44 +0530, Ramalingam C wrote:
> On 2022-02-07 at 20:25:42 +0530, Hellstrom, Thomas wrote:
> > Hi, Ram,
> > 
> > A couple of quick questions before starting a more detailed review:
> > 
> > 1) Does this also support migrating of compressed data LMEM->LMEM?
> > What-about inter-tile?
> Honestly this series mainly facused on eviction of lmem into smem and
> restoration of same.
> 
> To cover migration, we need to handle this differently from eviction.
> Becasue when we migrate the compressed content we need to be able to
> use
> that from that new placement. can't keep the ccs data separately.
> 
> Migration of lmem->smem needs decompression incorportated.
> Migration of lmem_m->lmem_n needs to maintain the
> compressed/decompressed state as it is.
> 
> So we need to pass the information upto emit_copy to differentiate
> eviction and migration
> 
> If you dont have objection I would like to take the migration once we
> have the eviction of lmem in place.

Sure NP. I was thinking that in the final solution we might also need
to think about the possibility that we might evict to another lmem
region, although I figure that won't be enabled until we support multi-
tile.

> 
> > 
> > 2) Do we need to block faulting of compressed data in the fault
> > handler
> > as a follow-up patch?
> 
> In case of evicted compressed data we dont need to treat it
> differently
> from the evicted normal data. So I dont think this needs a special
> treatment. Sorry if i dont understand your question.

My question wasn't directly related to eviction actually, but does
user-space need to have mmap access to compressed data? If not, block
it?

Thanks,
Thomas



> 
> Ram
> > 
> > /Thomas
> > 
> > 
> > On Mon, 2022-02-07 at 15:07 +0530, Ramalingam C wrote:
> > > When we are swapping out the local memory obj on flat-ccs capable
> > > platform,
> > > we need to capture the ccs data too along with main meory and we
> > > need
> > > to
> > > restore it when we are swapping in the content.
> > > 
> > > Extracting and restoring the CCS data is done through a special
> > > cmd
> > > called
> > > XY_CTRL_SURF_COPY_BLT
> > > 
> > > Signed-off-by: Ramalingam C <ramalingam.c at intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_migrate.c | 283 +++++++++++++-----
> > > ----
> > > --
> > >  1 file changed, 155 insertions(+), 128 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > index 5bdab0b3c735..e60ae6ff1847 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> > > @@ -449,14 +449,146 @@ static bool wa_1209644611_applies(int ver,
> > > u32
> > > size)
> > >         return height % 4 == 3 && height <= 8;
> > >  }
> > > 
> > > +/**
> > > + * DOC: Flat-CCS - Memory compression for Local memory
> > > + *
> > > + * On Xe-HP and later devices, we use dedicated compression
> > > control
> > > state (CCS)
> > > + * stored in local memory for each surface, to support the 3D
> > > and
> > > media
> > > + * compression formats.
> > > + *
> > > + * The memory required for the CCS of the entire local memory is
> > > 1/256 of the
> > > + * local memory size. So before the kernel boot, the required
> > > memory
> > > is reserved
> > > + * for the CCS data and a secure register will be programmed
> > > with
> > > the CCS base
> > > + * address.
> > > + *
> > > + * Flat CCS data needs to be cleared when a lmem object is
> > > allocated.
> > > + * And CCS data can be copied in and out of CCS region through
> > > + * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data
> > > directly.
> > > + *
> > > + * When we exaust the lmem, if the object's placements support
> > > smem,
> > > then we can
> > > + * directly decompress the compressed lmem object into smem and
> > > start using it
> > > + * from smem itself.
> > > + *
> > > + * But when we need to swapout the compressed lmem object into a
> > > smem region
> > > + * though objects' placement doesn't support smem, then we copy
> > > the
> > > lmem content
> > > + * as it is into smem region along with ccs data (using
> > > XY_CTRL_SURF_COPY_BLT).
> > > + * When the object is referred, lmem content will be swaped in
> > > along
> > > with
> > > + * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > > corresponding
> > > + * location.
> > > + *
> > > + *
> > > + * Flat-CCS Modifiers for different compression formats
> > > + * ----------------------------------------------------
> > > + *
> > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the
> > > buffers
> > > of Flat CCS
> > > + * render compression formats. Though the general layout is same
> > > as
> > > + * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > > algorithm is
> > > + * used. Render compression uses 128 byte compression blocks
> > > + *
> > > + * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the
> > > buffers
> > > of Flat CCS
> > > + * media compression formats. Though the general layout is same
> > > as
> > > + * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > > algorithm is
> > > + * used. Media compression uses 256 byte compression blocks.
> > > + *
> > > + * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > > buffers of Flat
> > > + * CCS clear color render compression formats. Unified
> > > compression
> > > format for
> > > + * clear color render compression. The genral layout is a tiled
> > > layout using
> > > + * 4Kb tiles i.e Tile4 layout.
> > > + */
> > > +
> > > +static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > > +{
> > > +       /* Mask the 3 LSB to use the PPGTT address space */
> > > +       *cmd++ = MI_FLUSH_DW | flags;
> > > +       *cmd++ = lower_32_bits(dst);
> > > +       *cmd++ = upper_32_bits(dst);
> > > +
> > > +       return cmd;
> > > +}
> > > +
> > > +static u32 calc_ctrl_surf_instr_size(struct drm_i915_private
> > > *i915,
> > > int size)
> > > +{
> > > +       u32 num_cmds, num_blks, total_size;
> > > +
> > > +       if (!GET_CCS_SIZE(i915, size))
> > > +               return 0;
> > > +
> > > +       /*
> > > +        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > > +        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > > +        * trnasfer upto 1024 blocks.
> > > +        */
> > > +       num_blks = GET_CCS_SIZE(i915, size);
> > > +       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >>
> > > 10;
> > > +       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > > +
> > > +       /*
> > > +        * We need to add a flush before and after
> > > +        * XY_CTRL_SURF_COPY_BLT
> > > +        */
> > > +       total_size += 2 * MI_FLUSH_DW_SIZE;
> > > +       return total_size;
> > > +}
> > > +
> > > +static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > > dst_addr,
> > > +                                    u8 src_mem_access, u8
> > > dst_mem_access,
> > > +                                    int src_mocs, int dst_mocs,
> > > +                                    u16 num_ccs_blocks)
> > > +{
> > > +       int i = num_ccs_blocks;
> > > +
> > > +       /*
> > > +        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy
> > > the
> > > CCS
> > > +        * data in and out of the CCS region.
> > > +        *
> > > +        * We can copy at most 1024 blocks of 256 bytes using one
> > > +        * XY_CTRL_SURF_COPY_BLT instruction.
> > > +        *
> > > +        * In case we need to copy more than 1024 blocks, we need
> > > to
> > > add
> > > +        * another instruction to the same batch buffer.
> > > +        *
> > > +        * 1024 blocks of 256 bytes of CCS represent a total
> > > 256KB of
> > > CCS.
> > > +        *
> > > +        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > > +        */
> > > +       do {
> > > +               /*
> > > +                * We use logical AND with 1023 since the size
> > > field
> > > +                * takes values which is in the range of 0 - 1023
> > > +                */
> > > +               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > > +                         (src_mem_access <<
> > > SRC_ACCESS_TYPE_SHIFT) |
> > > +                         (dst_mem_access <<
> > > DST_ACCESS_TYPE_SHIFT) |
> > > +                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > > +               *cmd++ = lower_32_bits(src_addr);
> > > +               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > > +                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > +               *cmd++ = lower_32_bits(dst_addr);
> > > +               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > > +                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > +               src_addr += SZ_64M;
> > > +               dst_addr += SZ_64M;
> > > +               i -= NUM_CCS_BLKS_PER_XFER;
> > > +       } while (i > 0);
> > > +
> > > +       return cmd;
> > > +}
> > > +
> > >  static int emit_copy(struct i915_request *rq,
> > > -                    u32 dst_offset, u32 src_offset, int size)
> > > +                    bool dst_is_lmem, u32 dst_offset,
> > > +                    bool src_is_lmem, u32 src_offset, int size)
> > >  {
> > > +       struct drm_i915_private *i915 = rq->engine->i915;
> > >         const int ver = GRAPHICS_VER(rq->engine->i915);
> > >         u32 instance = rq->engine->instance;
> > > +       u32 num_ccs_blks, ccs_ring_size;
> > > +       u8 src_access, dst_access;
> > >         u32 *cs;
> > > 
> > > -       cs = intel_ring_begin(rq, ver >= 8 ? 10 : 6);
> > > +       ccs_ring_size = ((src_is_lmem || dst_is_lmem) &&
> > > HAS_FLAT_CCS(i915)) ?
> > > +                        calc_ctrl_surf_instr_size(i915, size) :
> > > 0;
> > > +
> > > +       cs = intel_ring_begin(rq, ver >= 8 ? 10 + ccs_ring_size :
> > > 6);
> > >         if (IS_ERR(cs))
> > >                 return PTR_ERR(cs);
> > > 
> > > @@ -492,6 +624,25 @@ static int emit_copy(struct i915_request
> > > *rq,
> > >                 *cs++ = src_offset;
> > >         }
> > > 
> > > +       if (ccs_ring_size) {
> > > +               /* TODO: Migration needs to be handled with
> > > resolve
> > > of compressed data */
> > > +               num_ccs_blks = (GET_CCS_SIZE(i915, size) +
> > > +                               NUM_CCS_BYTES_PER_BLOCK - 1) >>
> > > 8;
> > > +
> > > +               src_access = !src_is_lmem && dst_is_lmem;
> > > +               dst_access = !src_access;
> > > +
> > > +               if (src_access) /* Swapin of compressed data */
> > > +                       src_offset += size;
> > > +               else
> > > +                       dst_offset += size;
> > > +
> > > +               cs = _i915_ctrl_surf_copy_blt(cs, src_offset,
> > > dst_offset,
> > > +                                             src_access,
> > > dst_access,
> > > +                                             1, 1,
> > > num_ccs_blks);
> > > +               cs = i915_flush_dw(cs, dst_offset, MI_FLUSH_LLC |
> > > MI_FLUSH_CCS);
> > > +       }
> > > +
> > >         intel_ring_advance(rq, cs);
> > >         return 0;
> > >  }
> > > @@ -578,7 +729,8 @@ intel_context_migrate_copy(struct
> > > intel_context
> > > *ce,
> > >                 if (err)
> > >                         goto out_rq;
> > > 
> > > -               err = emit_copy(rq, dst_offset, src_offset, len);
> > > +               err = emit_copy(rq, dst_is_lmem, dst_offset,
> > > +                               src_is_lmem, src_offset, len);
> > > 
> > >                 /* Arbitration is re-enabled between requests. */
> > >  out_rq:
> > > @@ -596,131 +748,6 @@ intel_context_migrate_copy(struct
> > > intel_context
> > > *ce,
> > >         return err;
> > >  }
> > > 
> > > -/**
> > > - * DOC: Flat-CCS - Memory compression for Local memory
> > > - *
> > > - * On Xe-HP and later devices, we use dedicated compression
> > > control
> > > state (CCS)
> > > - * stored in local memory for each surface, to support the 3D
> > > and
> > > media
> > > - * compression formats.
> > > - *
> > > - * The memory required for the CCS of the entire local memory is
> > > 1/256 of the
> > > - * local memory size. So before the kernel boot, the required
> > > memory
> > > is reserved
> > > - * for the CCS data and a secure register will be programmed
> > > with
> > > the CCS base
> > > - * address.
> > > - *
> > > - * Flat CCS data needs to be cleared when a lmem object is
> > > allocated.
> > > - * And CCS data can be copied in and out of CCS region through
> > > - * XY_CTRL_SURF_COPY_BLT. CPU can't access the CCS data
> > > directly.
> > > - *
> > > - * When we exaust the lmem, if the object's placements support
> > > smem,
> > > then we can
> > > - * directly decompress the compressed lmem object into smem and
> > > start using it
> > > - * from smem itself.
> > > - *
> > > - * But when we need to swapout the compressed lmem object into a
> > > smem region
> > > - * though objects' placement doesn't support smem, then we copy
> > > the
> > > lmem content
> > > - * as it is into smem region along with ccs data (using
> > > XY_CTRL_SURF_COPY_BLT).
> > > - * When the object is referred, lmem content will be swaped in
> > > along
> > > with
> > > - * restoration of the CCS data (using XY_CTRL_SURF_COPY_BLT) at
> > > corresponding
> > > - * location.
> > > - *
> > > - *
> > > - * Flat-CCS Modifiers for different compression formats
> > > - * ----------------------------------------------------
> > > - *
> > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS - used to indicate the
> > > buffers
> > > of Flat CCS
> > > - * render compression formats. Though the general layout is same
> > > as
> > > - * I915_FORMAT_MOD_Y_TILED_GEN12_RC_CCS, new hashing/compression
> > > algorithm is
> > > - * used. Render compression uses 128 byte compression blocks
> > > - *
> > > - * I915_FORMAT_MOD_F_TILED_DG2_MC_CCS -used to indicate the
> > > buffers
> > > of Flat CCS
> > > - * media compression formats. Though the general layout is same
> > > as
> > > - * I915_FORMAT_MOD_Y_TILED_GEN12_MC_CCS, new hashing/compression
> > > algorithm is
> > > - * used. Media compression uses 256 byte compression blocks.
> > > - *
> > > - * I915_FORMAT_MOD_F_TILED_DG2_RC_CCS_CC - used to indicate the
> > > buffers of Flat
> > > - * CCS clear color render compression formats. Unified
> > > compression
> > > format for
> > > - * clear color render compression. The genral layout is a tiled
> > > layout using
> > > - * 4Kb tiles i.e Tile4 layout.
> > > - */
> > > -
> > > -static inline u32 *i915_flush_dw(u32 *cmd, u64 dst, u32 flags)
> > > -{
> > > -       /* Mask the 3 LSB to use the PPGTT address space */
> > > -       *cmd++ = MI_FLUSH_DW | flags;
> > > -       *cmd++ = lower_32_bits(dst);
> > > -       *cmd++ = upper_32_bits(dst);
> > > -
> > > -       return cmd;
> > > -}
> > > -
> > > -static u32 calc_ctrl_surf_instr_size(struct drm_i915_private
> > > *i915,
> > > int size)
> > > -{
> > > -       u32 num_cmds, num_blks, total_size;
> > > -
> > > -       if (!GET_CCS_SIZE(i915, size))
> > > -               return 0;
> > > -
> > > -       /*
> > > -        * XY_CTRL_SURF_COPY_BLT transfers CCS in 256 byte
> > > -        * blocks. one XY_CTRL_SURF_COPY_BLT command can
> > > -        * trnasfer upto 1024 blocks.
> > > -        */
> > > -       num_blks = GET_CCS_SIZE(i915, size);
> > > -       num_cmds = (num_blks + (NUM_CCS_BLKS_PER_XFER - 1)) >>
> > > 10;
> > > -       total_size = (XY_CTRL_SURF_INSTR_SIZE) * num_cmds;
> > > -
> > > -       /*
> > > -        * We need to add a flush before and after
> > > -        * XY_CTRL_SURF_COPY_BLT
> > > -        */
> > > -       total_size += 2 * MI_FLUSH_DW_SIZE;
> > > -       return total_size;
> > > -}
> > > -
> > > -static u32 *_i915_ctrl_surf_copy_blt(u32 *cmd, u64 src_addr, u64
> > > dst_addr,
> > > -                                    u8 src_mem_access, u8
> > > dst_mem_access,
> > > -                                    int src_mocs, int dst_mocs,
> > > -                                    u16 num_ccs_blocks)
> > > -{
> > > -       int i = num_ccs_blocks;
> > > -
> > > -       /*
> > > -        * The XY_CTRL_SURF_COPY_BLT instruction is used to copy
> > > the
> > > CCS
> > > -        * data in and out of the CCS region.
> > > -        *
> > > -        * We can copy at most 1024 blocks of 256 bytes using one
> > > -        * XY_CTRL_SURF_COPY_BLT instruction.
> > > -        *
> > > -        * In case we need to copy more than 1024 blocks, we need
> > > to
> > > add
> > > -        * another instruction to the same batch buffer.
> > > -        *
> > > -        * 1024 blocks of 256 bytes of CCS represent a total
> > > 256KB of
> > > CCS.
> > > -        *
> > > -        * 256 KB of CCS represents 256 * 256 KB = 64 MB of LMEM.
> > > -        */
> > > -       do {
> > > -               /*
> > > -                * We use logical AND with 1023 since the size
> > > field
> > > -                * takes values which is in the range of 0 - 1023
> > > -                */
> > > -               *cmd++ = ((XY_CTRL_SURF_COPY_BLT) |
> > > -                         (src_mem_access <<
> > > SRC_ACCESS_TYPE_SHIFT) |
> > > -                         (dst_mem_access <<
> > > DST_ACCESS_TYPE_SHIFT) |
> > > -                         (((i - 1) & 1023) << CCS_SIZE_SHIFT));
> > > -               *cmd++ = lower_32_bits(src_addr);
> > > -               *cmd++ = ((upper_32_bits(src_addr) & 0xFFFF) |
> > > -                         (src_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > -               *cmd++ = lower_32_bits(dst_addr);
> > > -               *cmd++ = ((upper_32_bits(dst_addr) & 0xFFFF) |
> > > -                         (dst_mocs << XY_CTRL_SURF_MOCS_SHIFT));
> > > -               src_addr += SZ_64M;
> > > -               dst_addr += SZ_64M;
> > > -               i -= NUM_CCS_BLKS_PER_XFER;
> > > -       } while (i > 0);
> > > -
> > > -       return cmd;
> > > -}
> > > -
> > >  static int emit_clear(struct i915_request *rq,
> > >                       u64 offset,
> > >                       int size,
> > 

----------------------------------------------------------------------
Intel Sweden AB
Registered Office: Isafjordsgatan 30B, 164 40 Kista, Stockholm, Sweden
Registration Number: 556189-6027

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


More information about the Intel-gfx mailing list