[Intel-gfx] [PATCH v3 21/22] drm/atomic: Introduce drm_atomic_helper_duplicate_commited_state()

Mon Jul 10 13:26:18 UTC 2017

Op 10-07-17 om 14:18 schreef Ville Syrjälä:
> On Mon, Jul 10, 2017 at 11:31:55AM +0200, Maarten Lankhorst wrote:
>> Op 10-07-17 om 08:43 schreef Daniel Vetter:
>>> On Fri, Jul 07, 2017 at 06:18:12PM +0300, Ville Syrjälä wrote:
>>>> On Fri, Jul 07, 2017 at 04:05:28PM +0200, Daniel Vetter wrote:
>>>>> On Fri, Jul 7, 2017 at 3:21 PM, Ville Syrjälä
>>>>> <ville.syrjala at linux.intel.com> wrote:
>>>>>> On Fri, Jul 07, 2017 at 02:03:38PM +0200, Daniel Vetter wrote:
>>>>>>> On Thu, Jul 06, 2017 at 11:24:41PM +0300, ville.syrjala at linux.intel.com wrote:
>>>>>>>> From: Ville Syrjälä <ville.syrjala at linux.intel.com>
>>>>>>>>
>>>>>>>> For i915 GPU reset handling we'll want to be able to duplicate the state
>>>>>>>> that was last commited to the hardware. For that purpose let's start to
>>>>>>>> track the commited state for each object and provide a way to duplicate
>>>>>>>> the commmited state into a new drm_atomic_state. The locking for
>>>>>>>> .commited_state must to be provided by the driver.
>>>>>>>>
>>>>>>>> drm_atomic_helper_duplicate_commited_state() duplicates the state
>>>>>>>> to both old_state and new_state. For the purposes of i915 GPU reset we
>>>>>>>> would only need one of them, but we actually need two top level states;
>>>>>>>> one for disabling everything (which would need the duplicated state to
>>>>>>>> be old_state), and another to reenable everything (which would need the
>>>>>>>> duplicated state to be new_state). So to make it less comples I figured
>>>>>>>> I'd just always duplicate both. Might want to rethink this if for no
>>>>>>>> other reason that reducing the chances of memory allocation failure.
>>>>>>>> Due to the double state duplication we need
>>>>>>>> drm_atomic_helper_clean_commited_state() to clean up the duplicated
>>>>>>>> old_state since that's not handled by the normal drm_atomic_state
>>>>>>>> cleanup code.
>>>>>>>>
>>>>>>>> TODO: do we want this in the helper, or maybe it should be just in i915?
>>>>>>>>
>>>>>>>> v2: s/commited/committed/ everywhere (checkpatch)
>>>>>>>>     Handle state duplication errors better
>>>>>>>> v3: Even more care in dealing with memory allocation errors
>>>>>>>>     Handle private objs too
>>>>>>>>     Deal with the potential ordering issues between swap_state()
>>>>>>>>     and hw_done() by keeping track of which state was swapped in
>>>>>>>>     last
>>>>>>>>
>>>>>>>> Signed-off-by: Ville Syrjälä <ville.syrjala at linux.intel.com>
>>>>>>> I still don't get why we need to duplicate the committed state for gpu
>>>>>>> reset. When I said I'm not against adding all that complexity long-term I
>>>>>>> meant when we actually really need it. Imo g4x gpu reset isn't a good
>>>>>>> justification for that, reworking the atomic world for that seems
>>>>>>> massively out of proportion.
>>>>>> Well, I still don't see what's so "massive" about a couple of extra state
>>>>>> pointers hanging around.
>>>>>>
>>>>>> Also while doing that state duplication stuff, my old idea of
>>>>>> splitting the crtc disable and enable phases into separate atomic
>>>>>> commits popped up again in my head. For that being able to duplicate
>>>>>> arbitrary states would seem like a nice thing to have. So the
>>>>>> refactoring I had to do can have other uses.
>>>>> I fully realize that you're unhappy with how atomic ended up getting
>>>>> merged and that you think it's a grave mistake. And maybe it is, maybe
>>>>> it isn't, I have no idea.
>>>> I don't think I ever said that. I've said that it has certain design
>>>> problems that we ought to fix. This one being one, and another being
>>>> to separate the user state from the internal state. The latter I think
>>>> we'll have to tackle rather soon thanks to some new hardware in the
>>>> pipeline, or we need to come up with some other way to achieve the
>>>> same effect.
>>>>
>>>>> But right now we have _nothing_ asking for
>>>>> that reorg afaik, and using gen4 reset to justify it is in my opinion
>>>>> simply not solid engineering practice. Maybe we need this in the
>>>>> future, and then we can add it, but not before. Refactoring stuff to
>>>>> prettify the architecture isn't really useful work.
>>>> Neither is having to throw out code that already exists and works. If
>>>> you're so worried about time being wasted on pre-g4x GPU reset, then
>>>> we could just as well merge my code and move on to more productive
>>>> endeavors.
>>> I'm worried about future time wasted on this, not current time.
>>>
>>>>>>> Why exactly can't we do this simpler? I still don't get that part. Is
>>>>>>> there really no solution that doesn't break atomic's current assumption
>>>>>>> that commits are fully ordered on a given crtc?
>>>>>> From the point of view of the old and new states it doesn't actually
>>>>>> break that. The commits done from the reset path are essentially
>>>>>> invisible to the normal modeset operation.
>>>>> You insert something into a fully ordered queue. That does break the
>>>>> entire concept and needs a pile of locks and stuff to make it work.
>>>> Exactly one lock. Well two if you could the spinlock to protect the
>>>> committed_state pointer update from parallel updates touching the same
>>>> kms object. That latter one could be removed if atomic wouldn't allow
>>>> parallel commits to touch the same object.
>>>>
>>>>> Yes it's doable, but it's a redesign with all the implications of
>>>>> subtle breakage all over.
>>>> What? It doesn't really even do anything unless you do the
>>>> duplicate_committed state(). Everything else is just assigning pointers.
>>>> So unless there's some really obvious bug somewhere it can't break
>>>> anything outside the GPU reset path. And really the only way to break
>>>> to GPU reset path is to have actual bugs in the normal display commit
>>>> code.
>>> It's the gpu reset I'm worried about. There's no point in fixing it if it
>>> immediately breaks again.
>>>
>>>>> While we do have open bugs for the current
>>>>> design. Rewriting the world to fix a bug needs seriously better
>>>>> justification imo.
>>>>>
>>>>>> The one alternative proposed idea of allowing gem and kms sides go
>>>>>> out of whack scares me a bit. I think that might land us in more
>>>>>> trouble when I finally get around to making the video overlay a
>>>>>> drm_plane.
>>>>> We've run perfectly fine with this idea for years.
>>>> Not perfectly. I've had to fix it several times. And I don't think I was
>>>> the only one.
>>> The problem is that no one tests against gen4, and everyone forgets that
>>> it exists. That's why I'd like something with the minimal amount of
>>> invasiveness, since that would at least be easier to patch up when we
>>> inevitably break it. Also, something entirely contained to i915
>>> conceptually, without imposing more restrictions on shared code.
>>>>>> And I think trying to keep the GPU reset paths as similar as possible
>>>>>> between all the platforms would be a nice thing. Just whacking
>>>>>> everything on the head with a hammer on one platform but not on
>>>>>> another one seems to me like extra variation in behaviour that we
>>>>>> don't necessarily want.
>>>>>>
>>>>>> But like I said, if someone can come up with a better solution I
>>>>>> probably wouldn't object too much. It's not going to be coming from me
>>>>>> though since I have plenty of other things to do and vacation time is
>>>>>> coming up very soon. So unless someone else comes up with something nice
>>>>>> soon I think we should just go with my solution because a) it's already
>>>>>> available, and b) works quite decently from what I can see.
>>>>> I guess I'll have to retype the old thing in the new world, but it
>>>>> really shouldn't be more than the quick draft I've laid down in the
>>>>> old thread. This here is imo no-go with all the core changes, and even
>>>>> just done within i915 I think it's highly dubious that it provides a
>>>>> real benefit, since defacto it means we'll have to abandon the atomic
>>>>> helpers entirely.
>>>> Now I think you're just being difficult for the sake of it. Have you
>>>> looked at the code at all? It's fully done from the atomic helpers
>>>> right now. And even moving the committed state tracking to i915
>>>> wouldn't require abandoning the helpers. We could just update the
>>>> committed state pointers when we call hw_done(), and we'd have to
>>>> update the state seqno/age timestamp when we call swap_state().
>>>> That's all there is to this.
>>> I'm concerned with the maintainenace burden you impose on all future i915
>>> atomic work, that's all. Yes it looks simple to you and right now, but
>>> it's another little thing to keep working, and we're really good at
>>> breaking stuff all the time.
>>>
>>> But if you strongly think this is the best possible approach overall,
>>> taking into account long-term impact, then go ahead with implementing this
>>> in i915. Adding it the concept of a committed state and being able to
>>> duplicate that and squeeze another commit in to the shared atomic helpers
>>> doesn't make sense imo.
>>> -Daniel
>> I think the problem is about struct_mutex usage by atomic commit during reset.
>> GPU reset has to wait for all previous atomic updates to complete, but
>> cleanup_planes and prepare_plane_fb both require struct_mutex, which can lead
>> to a deadlock. #99093
> The deadlocks I've seen recently didn't necessarily involve
> struct_mutex IIRC. Just the modeset locks.
>
>> The real fix is not taking struct_mutex during atomic commit, which will mean
>> no deadlock can happen.
>>
>> Is this the bug being fixed here or am I missing something?
> This would avoid both struct_mutex and modeset locks in the display
> reset path, so I guess it should help with struct_mutex issues
> as well.
>
I think fixing i915 to not require struct_mutex for vma pinning/unpinning will be a better use of our time, and it should also fix all deadlocks. :)

And it's far better than duplicating drm_atomic_commit functionality in our reset handlers.