[Intel-gfx] [PATCH 0/1] Fix gem_ctx_persistence failures with GuC submission

Daniel Vetter daniel at ffwll.ch
Wed Aug 18 09:49:38 UTC 2021


On Tue, Aug 17, 2021 at 05:08:02PM -0700, John Harrison wrote:
> On 8/9/2021 23:38, Daniel Vetter wrote:
> > On Wed, Jul 28, 2021 at 05:33:59PM -0700, Matthew Brost wrote:
> > > Should fix below failures with GuC submission for the following tests:
> > > gem_exec_balancer --r noheartbeat
> > > gem_ctx_persistence --r heartbeat-close
> > > 
> > > Not going to fix:
> > > gem_ctx_persistence --r heartbeat-many
> > > gem_ctx_persistence --r heartbeat-stop
> > After looking at that big thread and being very confused: Are we fixing an
> > actual use-case here, or is this another case of blindly following igts
> > tests just because they exist?
> My understanding is that this is established behaviour and therefore must be
> maintained because the UAPI (whether documented or not) is inviolate.
> Therefore IGTs have been written to validate this past behaviour and now we
> must conform to the IGTs in order to keep the existing behaviour unchanged.

No, we do not need to blindly conform to igts. We've found enough examples
in the past few months where the igt tests where just testing stuff
because it's possible, not because any UMD actually needs the behaviour.

And drm subsystem rules are very clear that low-level tests do _not_
qualify as userspace, so if they're wrong we just have to fix them.

> Whether anybody actually makes use of this behaviour or not is another
> matter entirely. I am certainly not aware of any vital use case. Others
> might have more recollection. I do know that we tell the UMD teams to
> explicitly disable persistence on every context they create.

Does that include mesa?

> > I'm leaning towards that we should stall on this, and first document what
> > exactly is the actual intention behind all this, and then fix up the tests
> I'm not sure there ever was an 'intention'. The rumour I heard way back when
> was that persistence was a bug on earlier platforms (or possibly we didn't
> have hardware support for doing engine resets?). But once the bug was
> realised (or the hardware support was added), it was too late to change the
> default behaviour because existing kernel behaviour must never change on
> pain of painful things. Thus the persistence flag was added so that people
> could opt out of the broken, leaky behaviour and have their contexts clean
> up properly.
> 
> Feel free to document what you believe should be the behaviour from a
> software architect point of view. Any documentation I produce is basically
> going to be created by reverse engineering the existing code. That is the
> only 'spec' that I am aware of and as I keep saying, I personally think it
> is a totally broken concept that should just be removed.

There is most likely no spec except "what does current userspace actually
expect". Yes this sucks. Also if you expect me to do this, I'm backlogged
by a few months on random studies here, and largely this boils down to
checking all the umds and checking what they actually need.

Important: What igt does doesn't matter if there's not a corresponding
real world umd use-case.

> > to match (if needed). And only then fix up GuC to match whatever we
> > actually want to do.
> I also still maintain there is no 'fix up the GuC'. This is not behaviour we
> should be adding to a hardware scheduler. It is behaviour that should be
> implemented at the front end not the back end. If we absolutely need to do
> this then we need to do it solely at the context management level not at the
> back end submission level. And the solution should work by default on any
> submission back end.

With "Fix up GuC" I dont mean necessarily the guc fw, but our entire
backend. We can very much fix that to fix most anything reasonable.

Also we don't actually need the same solution on all backends, because the
uapi can have slight differences across platforms. That's why changing the
defaults is so hard once they're set in stone.
-Daniel

> 
> John.
> 
> 
> > -Daniel
> > 
> > > As the above tests change the heartbeat value to 0 (off) after the
> > > context is closed and we have no way to detect that with GuC submission
> > > unless we keep a list of closed but running contexts which seems like
> > > overkill for a non-real world use case. We likely should just skip these
> > > tests with GuC submission.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > > 
> > > Matthew Brost (1):
> > >    drm/i915: Check if engine has heartbeat when closing a context
> > > 
> > >   drivers/gpu/drm/i915/gem/i915_gem_context.c   |  5 +++--
> > >   drivers/gpu/drm/i915/gt/intel_context_types.h |  2 ++
> > >   drivers/gpu/drm/i915/gt/intel_engine.h        | 21 ++-----------------
> > >   .../drm/i915/gt/intel_execlists_submission.c  | 14 +++++++++++++
> > >   .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 +++++-
> > >   .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 --
> > >   6 files changed, 26 insertions(+), 24 deletions(-)
> > > 
> > > -- 
> > > 2.28.0
> > > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


More information about the Intel-gfx mailing list