[Intel-gfx] [PATCH v3 21/22] drm/atomic: Introduce drm_atomic_helper_duplicate_commited_state()

Mon Jul 10 13:56:10 UTC 2017

On Mon, Jul 10, 2017 at 03:26:18PM +0200, Maarten Lankhorst wrote:
> Op 10-07-17 om 14:18 schreef Ville Syrjälä:
> > On Mon, Jul 10, 2017 at 11:31:55AM +0200, Maarten Lankhorst wrote:
> >> Op 10-07-17 om 08:43 schreef Daniel Vetter:
> >>> On Fri, Jul 07, 2017 at 06:18:12PM +0300, Ville Syrjälä wrote:
> >>>> On Fri, Jul 07, 2017 at 04:05:28PM +0200, Daniel Vetter wrote:
> >>>>> On Fri, Jul 7, 2017 at 3:21 PM, Ville Syrjälä
> >>>>> <ville.syrjala at linux.intel.com> wrote:
> >>>>>> On Fri, Jul 07, 2017 at 02:03:38PM +0200, Daniel Vetter wrote:
> >>>>>>> On Thu, Jul 06, 2017 at 11:24:41PM +0300, ville.syrjala at linux.intel.com wrote:
> >>>>>>>> From: Ville Syrjälä <ville.syrjala at linux.intel.com>
> >>>>>>>>
> >>>>>>>> For i915 GPU reset handling we'll want to be able to duplicate the state
> >>>>>>>> that was last commited to the hardware. For that purpose let's start to
> >>>>>>>> track the commited state for each object and provide a way to duplicate
> >>>>>>>> the commmited state into a new drm_atomic_state. The locking for
> >>>>>>>> .commited_state must to be provided by the driver.
> >>>>>>>>
> >>>>>>>> drm_atomic_helper_duplicate_commited_state() duplicates the state
> >>>>>>>> to both old_state and new_state. For the purposes of i915 GPU reset we
> >>>>>>>> would only need one of them, but we actually need two top level states;
> >>>>>>>> one for disabling everything (which would need the duplicated state to
> >>>>>>>> be old_state), and another to reenable everything (which would need the
> >>>>>>>> duplicated state to be new_state). So to make it less comples I figured
> >>>>>>>> I'd just always duplicate both. Might want to rethink this if for no
> >>>>>>>> other reason that reducing the chances of memory allocation failure.
> >>>>>>>> Due to the double state duplication we need
> >>>>>>>> drm_atomic_helper_clean_commited_state() to clean up the duplicated
> >>>>>>>> old_state since that's not handled by the normal drm_atomic_state
> >>>>>>>> cleanup code.
> >>>>>>>>
> >>>>>>>> TODO: do we want this in the helper, or maybe it should be just in i915?
> >>>>>>>>
> >>>>>>>> v2: s/commited/committed/ everywhere (checkpatch)
> >>>>>>>>     Handle state duplication errors better
> >>>>>>>> v3: Even more care in dealing with memory allocation errors
> >>>>>>>>     Handle private objs too
> >>>>>>>>     Deal with the potential ordering issues between swap_state()
> >>>>>>>>     and hw_done() by keeping track of which state was swapped in
> >>>>>>>>     last
> >>>>>>>>
> >>>>>>>> Signed-off-by: Ville Syrjälä <ville.syrjala at linux.intel.com>
> >>>>>>> I still don't get why we need to duplicate the committed state for gpu
> >>>>>>> reset. When I said I'm not against adding all that complexity long-term I
> >>>>>>> meant when we actually really need it. Imo g4x gpu reset isn't a good
> >>>>>>> justification for that, reworking the atomic world for that seems
> >>>>>>> massively out of proportion.
> >>>>>> Well, I still don't see what's so "massive" about a couple of extra state
> >>>>>> pointers hanging around.
> >>>>>>
> >>>>>> Also while doing that state duplication stuff, my old idea of
> >>>>>> splitting the crtc disable and enable phases into separate atomic
> >>>>>> commits popped up again in my head. For that being able to duplicate
> >>>>>> arbitrary states would seem like a nice thing to have. So the
> >>>>>> refactoring I had to do can have other uses.
> >>>>> I fully realize that you're unhappy with how atomic ended up getting
> >>>>> merged and that you think it's a grave mistake. And maybe it is, maybe
> >>>>> it isn't, I have no idea.
> >>>> I don't think I ever said that. I've said that it has certain design
> >>>> problems that we ought to fix. This one being one, and another being
> >>>> to separate the user state from the internal state. The latter I think
> >>>> we'll have to tackle rather soon thanks to some new hardware in the
> >>>> pipeline, or we need to come up with some other way to achieve the
> >>>> same effect.
> >>>>
> >>>>> But right now we have _nothing_ asking for
> >>>>> that reorg afaik, and using gen4 reset to justify it is in my opinion
> >>>>> simply not solid engineering practice. Maybe we need this in the
> >>>>> future, and then we can add it, but not before. Refactoring stuff to
> >>>>> prettify the architecture isn't really useful work.
> >>>> Neither is having to throw out code that already exists and works. If
> >>>> you're so worried about time being wasted on pre-g4x GPU reset, then
> >>>> we could just as well merge my code and move on to more productive
> >>>> endeavors.
> >>> I'm worried about future time wasted on this, not current time.
> >>>
> >>>>>>> Why exactly can't we do this simpler? I still don't get that part. Is
> >>>>>>> there really no solution that doesn't break atomic's current assumption
> >>>>>>> that commits are fully ordered on a given crtc?
> >>>>>> From the point of view of the old and new states it doesn't actually
> >>>>>> break that. The commits done from the reset path are essentially
> >>>>>> invisible to the normal modeset operation.
> >>>>> You insert something into a fully ordered queue. That does break the
> >>>>> entire concept and needs a pile of locks and stuff to make it work.
> >>>> Exactly one lock. Well two if you could the spinlock to protect the
> >>>> committed_state pointer update from parallel updates touching the same
> >>>> kms object. That latter one could be removed if atomic wouldn't allow
> >>>> parallel commits to touch the same object.
> >>>>
> >>>>> Yes it's doable, but it's a redesign with all the implications of
> >>>>> subtle breakage all over.
> >>>> What? It doesn't really even do anything unless you do the
> >>>> duplicate_committed state(). Everything else is just assigning pointers.
> >>>> So unless there's some really obvious bug somewhere it can't break
> >>>> anything outside the GPU reset path. And really the only way to break
> >>>> to GPU reset path is to have actual bugs in the normal display commit
> >>>> code.
> >>> It's the gpu reset I'm worried about. There's no point in fixing it if it
> >>> immediately breaks again.
> >>>
> >>>>> While we do have open bugs for the current
> >>>>> design. Rewriting the world to fix a bug needs seriously better
> >>>>> justification imo.
> >>>>>
> >>>>>> The one alternative proposed idea of allowing gem and kms sides go
> >>>>>> out of whack scares me a bit. I think that might land us in more
> >>>>>> trouble when I finally get around to making the video overlay a
> >>>>>> drm_plane.
> >>>>> We've run perfectly fine with this idea for years.
> >>>> Not perfectly. I've had to fix it several times. And I don't think I was
> >>>> the only one.
> >>> The problem is that no one tests against gen4, and everyone forgets that
> >>> it exists. That's why I'd like something with the minimal amount of
> >>> invasiveness, since that would at least be easier to patch up when we
> >>> inevitably break it. Also, something entirely contained to i915
> >>> conceptually, without imposing more restrictions on shared code.
> >>>>>> And I think trying to keep the GPU reset paths as similar as possible
> >>>>>> between all the platforms would be a nice thing. Just whacking
> >>>>>> everything on the head with a hammer on one platform but not on
> >>>>>> another one seems to me like extra variation in behaviour that we
> >>>>>> don't necessarily want.
> >>>>>>
> >>>>>> But like I said, if someone can come up with a better solution I
> >>>>>> probably wouldn't object too much. It's not going to be coming from me
> >>>>>> though since I have plenty of other things to do and vacation time is
> >>>>>> coming up very soon. So unless someone else comes up with something nice
> >>>>>> soon I think we should just go with my solution because a) it's already
> >>>>>> available, and b) works quite decently from what I can see.
> >>>>> I guess I'll have to retype the old thing in the new world, but it
> >>>>> really shouldn't be more than the quick draft I've laid down in the
> >>>>> old thread. This here is imo no-go with all the core changes, and even
> >>>>> just done within i915 I think it's highly dubious that it provides a
> >>>>> real benefit, since defacto it means we'll have to abandon the atomic
> >>>>> helpers entirely.
> >>>> Now I think you're just being difficult for the sake of it. Have you
> >>>> looked at the code at all? It's fully done from the atomic helpers
> >>>> right now. And even moving the committed state tracking to i915
> >>>> wouldn't require abandoning the helpers. We could just update the
> >>>> committed state pointers when we call hw_done(), and we'd have to
> >>>> update the state seqno/age timestamp when we call swap_state().
> >>>> That's all there is to this.
> >>> I'm concerned with the maintainenace burden you impose on all future i915
> >>> atomic work, that's all. Yes it looks simple to you and right now, but
> >>> it's another little thing to keep working, and we're really good at
> >>> breaking stuff all the time.
> >>>
> >>> But if you strongly think this is the best possible approach overall,
> >>> taking into account long-term impact, then go ahead with implementing this
> >>> in i915. Adding it the concept of a committed state and being able to
> >>> duplicate that and squeeze another commit in to the shared atomic helpers
> >>> doesn't make sense imo.
> >>> -Daniel
> >> I think the problem is about struct_mutex usage by atomic commit during reset.
> >> GPU reset has to wait for all previous atomic updates to complete, but
> >> cleanup_planes and prepare_plane_fb both require struct_mutex, which can lead
> >> to a deadlock. #99093
> > The deadlocks I've seen recently didn't necessarily involve
> > struct_mutex IIRC. Just the modeset locks.
> >
> >> The real fix is not taking struct_mutex during atomic commit, which will mean
> >> no deadlock can happen.
> >>
> >> Is this the bug being fixed here or am I missing something?
> > This would avoid both struct_mutex and modeset locks in the display
> > reset path, so I guess it should help with struct_mutex issues
> > as well.
> >
> I think fixing i915 to not require struct_mutex for vma pinning/unpinning will be a better use of our time, and it should also fix all deadlocks. :)

It won't help with the modeset locks.

> 
> And it's far better than duplicating drm_atomic_commit functionality in our reset handlers.

Well I'm not actually duplicating anything. Just reusing what's
already there. We just skip all the useless stuff and that allows
us to skip the locking which is where all the problems have been
up to now.

And it means we won't allow gem and kms to get totally out of sync,
which I think will be a good thing for stuff like the video
overlay which straddles both worlds.

-- 
Ville Syrjälä
Intel OTC