[PATCH v3 00/15] CCS static load balance

Mon Sep 2 11:13:33 UTC 2024

On Wed, Aug 28, 2024 at 05:35:45PM +0200, Andi Shyti wrote:
> Hi Sima,
> 
> On Wed, Aug 28, 2024 at 03:47:21PM +0200, Daniel Vetter wrote:
> > On Wed, Aug 28, 2024 at 10:20:15AM +0200, Andi Shyti wrote:
> > > Hi Sima,
> > > 
> > > first of all, thanks for looking into this series.
> > > 
> > > On Tue, Aug 27, 2024 at 07:31:21PM +0200, Daniel Vetter wrote:
> > > > On Fri, Aug 23, 2024 at 03:08:40PM +0200, Andi Shyti wrote:
> > > > > Hi,
> > > > > 
> > > > > This patch series introduces static load balancing for GPUs with
> > > > > multiple compute engines. It's a lengthy series, and some
> > > > > challenging aspects still need to be resolved.
> > > > 
> > > > Do we have an actual user for this, where just reloading the entire driver
> > > > (or well-rebinding, if you only want to change the value for a specific
> > > > device) with a new module option isn't enough?
> > > 
> > > Yes, we have users for this and this has been already agreed with
> > > architects and maintainers.
> > 
> > So my understanding is that for upstream, this only applies to dg2,
> > because the other platforms don't have enough CCS engines to make this a
> > real issue.
> > 
> > Do we really have upstream demand for this feature on dg2 only?
> 
> That's my understanding.
> 
> > Also how hard would it be to make these users happy with xe-on-dg2 in
> > upstream instead?
> 
> I don't know this, I think the user is already on i915.
> 
> > > Why are you saying that we are reloading/rebinding the driver?
> > 
> > That's the other alternate solution.
> 
> But that's not how XE does it, though.
> 
> The use case is that userspace has an environment variable that
> they change ondemand for choosing the CCS mode. They want to
> change the value of that variable on the fly and, as we are only
> adding or removing a few engines, this is done without reprobing
> the whole driver.
> 
> In a previous implementation (from where both I and Niranjana for
> XE took inspiration) the CCS mode was passed during compute
> execbuf.
> 
> > > I'm only removing the exposure of user engines, which is
> > > basically a flag in the engines data structure.
> > > 
> > > > There's some really gnarly locking and lifetime fun in there, and it needs
> > > > a corresponding justification.
> > > 
> > > What locking are you referring about?
> > > 
> > > I only added one single mutex that has a comment and a
> > > justification. If you think that's not enough, I can of course
> > > improve it (please note that the changes have a good amount of
> > > comments and I tried to be aso more descriptive as I could).
> > > 
> > > When I change the engines configurations only for the compute
> > > engines and only for DG2 platforms, I need to make sure that no
> > > other user is affected by the change. Thus I need to make sure
> > > that access to some of the strucures are properly serialized.
> > > 
> > > > Which needs to be enormous for this case,
> > > > meaning actual customers willing to shout on dri-devel that they really,
> > > > absolutely need this, or their machines will go up in flames.
> > > > Otherwise this is a nack from me.
> > > 
> > > Would you please tell me why are you nacking the patch? So that I
> > > address your comments for v4?
> > 
> > So for one, this is substantially more flexible than the solution merged
> > into xe. And the patch set doesn't explain why (the commit messages
> > actualy describe the design xe has).
> 
> I think in XE we might have missed a few things and my plan is to
> check the XE implementation once I'm done with i915 (I was one of
> the XE reviewers). And, many of the things in XE are so different
> that the solution can't be taken as it is.

Please fix XE first. XE is supposed to lead here with all new platform
support and uapi additions, i915-gem is supposed to be completely in
feature freeze. And exceptions need really good reasons.

> > That does not inspire confidence at all.
> 
> Consider that most of the patches are refactoring, only the last
> patch is doing the real job. That's because the first workaround
> was already merged a while ago. While XE didn't need the
> refactorings I made.
> 
> > Second, I don't think anyone understands the entire engine/ctx locking
> > design in i915-gem. And the fix for that was to make as much as absolutely
> > possible immutable. Yes the implementation looks correct, but when I
> > looked at the much, much simpler xe implementation I'm pretty sure I've
> > found an issue there too. Here I can't even tell.
> 
> The locking is fairly simple, when the user wants to set a
> specific CCS mode, I take the wakrefe lock and I check no one is
> holding it. This way I am sure that I am the only user of the GPU
> (otherwise the GPU would be off).
> 
> I added one single lock to be used for the for_each_uabi_engine.
> It's not really required but I really want to be sure that I am
> not changing the CCS mode while someone else is using the uabi
> engines.
> 
> I'm also adding Joonas in Cc with whom I discussed many details
> of the implementation. I would really appreaciate to know what
> exactly is wrong here and what are the necessary changes needed
> to get the series merged.

This is i915-gem. Locking is never simple here, and I think that entire
journey on the mmap confusion shows how little people understand how much
complexity and confusion and i915-gem being special there really is.

You're really not making a good case that you understand this stuff here,
which really doesn't make me enthusiastic about considering an uapi
extension for i915-gem.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch