[diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Wed Sep 1 08:18:12 UTC 2021

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk <Monk.Liu at amd.com> wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is
when your team showed up on dri-devel. That you already spent 6 months
on this within amd, on a code area that very much affects shared code,
without kicking of any thread on dri-devel isn't great, but also not
something we can fix, since time machines don't exist.

So we have to make the best out of the situation and move ahead where
we are. From my understanding you've done a bunch of changes to the
scheduler code. As far as I can see there's been two related things
your team has done:

- remove some allocations from scheduler code, because that can lead
to deadlocks. I've kicked up this topic quite a while ago here

https://lore.kernel.org/dri-devel/20200604081224.863494-10-daniel.vetter@ffwll.ch/

This is just one patch of the entire series. This is an area where we
really need a consistent solution across all drm/sched drivers, not
something that individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from
panfrost about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@collabora.com/

There's been more than just this, also quite some doc patches from
Boris that explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now
suddenly pops up out of nowhere and demands that your approach needs
to land asap. That's really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing
context for outsiders. The patches sometimes come with zero commit
message, for tricky concurrency bugs. And there's no context with what
you've done already on the amdgpu side (since that never showed up on
dri-devel), which makes constructive discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed
to work when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on
dri-devel (or whatever the topic really is about) about the problem
you have and how your trying to solve it. This can be just text if
it's a big thing, but it can also already include some proof of
concept solution in the form of patches.

- Then we iterate on the solution, across drivers and shared code
_together_. Not "merge amdgpu code first, then get annoyed when the
core changes don't land immediately after you've practially finished
the project".

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have
upstream buy-in before you start to put in real work on the project.

> Honestly speaking the email ways that we are using now is not friendly and quite painful to me ….

Yes this is painful :-(

I think the best way forward is to go through the above process again
and essentially restart. So submit a complete patch series with
problem descriptions, solution you picked, why you picked that, all
the amdgpu patches to get there and the core patches too. Since it
sounds like a bunch of this has all landed already you probably need a
patch 1 that goes back to 6 months ago so that we can see the overall
direction, and review whether that's the right one or not.

The not-so-painful approach would have been to do this from the start,
6 months ago. It would definitely have helped if the tdr discussion
we've had just a few months ago would have involved your team too, I'm
sure there would have been some good insights from amd's side. I'd
really want you and your engineers involved here, so let's do this
properly!

Cheers, Daniel

> Can we try to put all our opinions, suggestions, or even objects here together, let’s go through them one by one, it’s too hard for us to reply each email on different questions .
>
>
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
>
>
> This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.
>
>
>
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
>
>
> I think I already explained the questions raised by Daniel in other thread , regarding why I use __kthread_should_park()
>
> For other aspects, can we put all our opinion synthesized here ?
>
>
>
> Thanks !
>
>
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch