[PATCH v3 00/15] CCS static load balance

Daniel Vetter daniel.vetter at ffwll.ch
Tue Aug 27 17:31:21 UTC 2024


On Fri, Aug 23, 2024 at 03:08:40PM +0200, Andi Shyti wrote:
> Hi,
> 
> This patch series introduces static load balancing for GPUs with
> multiple compute engines. It's a lengthy series, and some
> challenging aspects still need to be resolved.

Do we have an actual user for this, where just reloading the entire driver
(or well-rebinding, if you only want to change the value for a specific
device) with a new module option isn't enough?

There's some really gnarly locking and lifetime fun in there, and it needs
a corresponding justification. Which needs to be enormous for this case,
meaning actual customers willing to shout on dri-devel that they really,
absolutely need this, or their machines will go up in flames.

Otherwise this is a nack from me.

Thanks, Sima

> 
> I have tried to split the work as much as possible to facilitate
> the review process.
> 
> To summarize, in patches 1 to 14, no functional changes occur
> except for the addition of the num_cslices interface. The
> significant changes happen in patch 15, which is the core part of
> the CCS mode setting, utilizing the groundwork laid in the
> earlier patches.
> 
> In this updated approach, the focus is now on managing the UABI
> engine list, which controls the engines exposed to userspace.
> Instead of manipulating phuscal engines and their memory, we now
> handle engine exposure through this list.
> 
> I would greatly appreciate further input from all reviewers who
> have already assisted with the previous work.
> 
> IGT tests have also been developed, but I haven't sent them yet.
> 
> Thank you Chris for the offline reviews.
> 
> Thanks,
> Andi
> 
> Changelog:
> ==========
> PATCHv2 -> PATCHv3
> ------------------
>  - Fix a NULL pointer dereference during module unload.
>    In i915_gem_driver_remove() I was accessing the gt after the
>    gt was removed. Use the dev_priv, instead (obviously!).
>  - Fix a lockdep issue: Some of the uabi_engines_mutex unlocks
>    were not correctly placed in the exit paths.
>  - Fix a checkpatch error for spaces after and before parenthesis
>    in the for_each_enabled_engine() definition.
> 
> PATCHv1 -> PATCHv2
> ------------------
>  - Use uabi_mutex to protect the uabi_engines, not the engine
>    itself. Rename it to uabi_engines_mutex.
>  - Use kobject_add/kobject_del for adding and removing
>    interfaces, this way we don't need to destroy and recreate the
>    engines, anymore. Refactor intel_engine_add_single_sysfs() to
>    reflect this scenario.
>  - After adding engines to the rb_tree check that they have been
>    added correctly.
>  - Fix rb_find_add() compare function to take into accoung also
>    the class, not just the instance.
> 
> RFCv2 -> PATCHv1
> ----------------
>  - Removed gt->ccs.mutex
>  - Rename m -> width, ccs_id -> engine in
>    intel_gt_apply_ccs_mode().
>  - In the CCS register value calculation
>    (intel_gt_apply_ccs_mode()) the engine (ccs_id) needs to move
>    along the ccs_mask (set by the user) instead of the
>    cslice_mask.
>  - Add GEM_BUG_ON after calculating the new ccs_mask
>    (update_ccs_mask()) to make sure all angines have been
>    evaluated (i.e. ccs_mask must be '0' at the end of the
>    algorithm).
>  - move wakeref lock before evaluating intel_gt_pm_is_awake() and
>    fix exit path accordingly.
>  - Use a more compact form in intel_gt_sysfs_ccs_init() and
>    add_uabi_ccs_engines() when evaluating sysfs_create_file(): no
>    need to store the return value to the err variable which is
>    unused. Get rid of err.
>  - Print a warnging instead of a debug message if we fail to
>    create the sysfs files.
>  - If engine files creation fails in
>    intel_engine_add_single_sysfs(), print a warning, not an
>    error.
>  - Rename gt->ccs.ccs_mask to gt->ccs.id_mask and add a comment
>    to explain its purpose.
>  - During uabi engine creation, in
>    intel_engines_driver_register(), the uabi_ccs_instance is
>    redundant because the ccs_instances is already tracked in
>    engine->uabi_instance.
>  - Mark add_uabi_ccs_engines() and remove_uabi_ccs_engines() as
>    __maybe_unused not to break bisectability. They wouldn't
>    compile in their own commit. They will be used in the next
>    patch and the __maybe_unused is removed.
>  - Update engine's workaround every time a new mode is set in
>    update_ccs_mask().
>  - Mark engines as valid or invalid using their status as
>    rb_node. Invalid engines are marked as invalid using
>    RB_CLEAR_NODE(). Execbufs will check for their validity when
>    selecting the engine to be combined to a context.
>  - Create for_each_enabled_engine() which skips the non valid
>    engines and use it in selftests.
> 
> RFCv1 -> RFCv2
> --------------
> Compared to the first version I've taken a completely different
> approach to adding and removing engines. in v1 physical engines
> were directly added and removed, along with the memory allocated
> to them, each time the user changed the CCS mode (from the
> previous cover letter).
> 
> Andi Shyti (15):
>   drm/i915/gt: Avoid using masked workaround for CCS_MODE setting
>   drm/i915/gt: Move the CCS mode variable to a global position
>   drm/i915/gt: Allow the creation of multi-mode CCS masks
>   drm/i915/gt: Refactor uabi engine class/instance list creation
>   drm/i915/gem: Mark and verify UABI engine validity
>   drm/i915/gt: Introduce for_each_enabled_engine() and apply it in
>     selftests
>   drm/i915/gt: Manage CCS engine creation within UABI exposure
>   drm/i915/gt: Remove cslices mask value from the CCS structure
>   drm/i915/gt: Expose the number of total CCS slices
>   drm/i915/gt: Store engine-related sysfs kobjects
>   drm/i915/gt: Store active CCS mask
>   drm/i915: Protect access to the UABI engines list with a mutex
>   drm/i915/gt: Isolate single sysfs engine file creation
>   drm/i915/gt: Implement creation and removal routines for CCS engines
>   drm/i915/gt: Allow the user to change the CCS mode through sysfs
> 
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   |   3 +
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c    |  28 +-
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  23 --
>  drivers/gpu/drm/i915/gt/intel_engine_types.h  |   2 +
>  drivers/gpu/drm/i915/gt/intel_engine_user.c   |  62 ++-
>  drivers/gpu/drm/i915/gt/intel_gt.c            |   3 +
>  drivers/gpu/drm/i915/gt/intel_gt.h            |  12 +
>  drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.c   | 353 +++++++++++++++++-
>  drivers/gpu/drm/i915/gt/intel_gt_ccs_mode.h   |   5 +-
>  drivers/gpu/drm/i915/gt/intel_gt_sysfs.c      |   2 +
>  drivers/gpu/drm/i915/gt/intel_gt_types.h      |  19 +-
>  drivers/gpu/drm/i915/gt/intel_workarounds.c   |   8 +-
>  drivers/gpu/drm/i915/gt/selftest_context.c    |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_engine_cs.c  |   4 +-
>  .../drm/i915/gt/selftest_engine_heartbeat.c   |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_engine_pm.c  |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_execlists.c  |  52 +--
>  drivers/gpu/drm/i915/gt/selftest_gt_pm.c      |   2 +-
>  drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |  22 +-
>  drivers/gpu/drm/i915/gt/selftest_lrc.c        |  18 +-
>  drivers/gpu/drm/i915/gt/selftest_mocs.c       |   6 +-
>  drivers/gpu/drm/i915/gt/selftest_rc6.c        |   4 +-
>  drivers/gpu/drm/i915/gt/selftest_reset.c      |   8 +-
>  .../drm/i915/gt/selftest_ring_submission.c    |   2 +-
>  drivers/gpu/drm/i915/gt/selftest_rps.c        |  14 +-
>  drivers/gpu/drm/i915/gt/selftest_timeline.c   |  14 +-
>  drivers/gpu/drm/i915/gt/selftest_tlb.c        |   2 +-
>  .../gpu/drm/i915/gt/selftest_workarounds.c    |  14 +-
>  drivers/gpu/drm/i915/gt/sysfs_engines.c       |  79 ++--
>  drivers/gpu/drm/i915/gt/sysfs_engines.h       |   2 +
>  drivers/gpu/drm/i915/i915_cmd_parser.c        |   2 +
>  drivers/gpu/drm/i915/i915_debugfs.c           |   4 +
>  drivers/gpu/drm/i915/i915_drv.h               |   5 +
>  drivers/gpu/drm/i915/i915_gem.c               |   4 +
>  drivers/gpu/drm/i915/i915_perf.c              |   8 +-
>  drivers/gpu/drm/i915/i915_pmu.c               |  11 +-
>  drivers/gpu/drm/i915/i915_query.c             |  21 +-
>  37 files changed, 643 insertions(+), 193 deletions(-)
> 
> -- 
> 2.45.2
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


More information about the Intel-gfx mailing list