[diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Wed Sep 1 10:27:39 UTC 2021

[AMD Official Use Only]

>> For me your project exists since a few weeks at most, because that is when your team showed up on dri-devel. That you already spent 6 months on this within amd, on a code area that very much affects shared code, without kicking of any thread on dri-devel isn't great, but also not something we can fix, since time machines don't exist.
This is partially true, because in the past months our change only resident in AMD driver, it is till now that we found we had to make changes in SCHED level 

>> Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.
if our changes on DRI level part cannot get merged soon that's fine, we can discuss more, but that's not suddenly pops up from nowhere, we already worked on it for months inside of AMD drivers.

>> I think the best way forward is to go through the above process again and essentially restart. So submit a complete patch series with problem descriptions, solution you picked, why you picked that, all the amdgpu patches to get there and the core patches too. Since it sounds like a bunch of this has all landed already you probably need a patch 1 that goes back to 6 months ago so that we can see the overall direction, and review whether that's the right one or not.

We are not objecting this approach,  we didn't do that because previously all what we need to do is resident inside AMD driver ...   because we try to avoid change DRI/DRM interface part ... 

For the patches you shows to us with links I'm sorry that due to some IT infrastructure reason me and my team didn't see it before (we kind of work in AMD repo ... the patches you shows didn't get merged in our repo yet...)
One thing I also want to emphasis here: if any code need change inside AMD driver please always let us know and review.

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Daniel Vetter
Sent: Wednesday, September 1, 2021 4:18 PM
To: Liu, Monk <Monk.Liu at amd.com>
Cc: Koenig, Christian <Christian.Koenig at amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, JingWen <JingWen.Chen2 at amd.com>; DRI Development <dri-devel at lists.freedesktop.org>; amd-gfx at lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk <Monk.Liu at amd.com> wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your team showed up on dri-devel. That you already spent 6 months on this within amd, on a code area that very much affects shared code, without kicking of any thread on dri-devel isn't great, but also not something we can fix, since time machines don't exist.

So we have to make the best out of the situation and move ahead where we are. From my understanding you've done a bunch of changes to the scheduler code. As far as I can see there's been two related things your team has done:

- remove some allocations from scheduler code, because that can lead to deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3D&reserved=0

This is just one patch of the entire series. This is an area where we really need a consistent solution across all drm/sched drivers, not something that individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from panfrost about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3D&reserved=0

There's been more than just this, also quite some doc patches from Boris that explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing context for outsiders. The patches sometimes come with zero commit message, for tricky concurrency bugs. And there's no context with what you've done already on the amdgpu side (since that never showed up on dri-devel), which makes constructive discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed to work when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on dri-devel (or whatever the topic really is about) about the problem you have and how your trying to solve it. This can be just text if it's a big thing, but it can also already include some proof of concept solution in the form of patches.

- Then we iterate on the solution, across drivers and shared code _together_. Not "merge amdgpu code first, then get annoyed when the core changes don't land immediately after you've practially finished the project".

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have upstream buy-in before you start to put in real work on the project.

> Honestly speaking the email ways that we are using now is not friendly and quite painful to me ....

Yes this is painful :-(

I think the best way forward is to go through the above process again and essentially restart. So submit a complete patch series with problem descriptions, solution you picked, why you picked that, all the amdgpu patches to get there and the core patches too. Since it sounds like a bunch of this has all landed already you probably need a patch 1 that goes back to 6 months ago so that we can see the overall direction, and review whether that's the right one or not.

The not-so-painful approach would have been to do this from the start,
6 months ago. It would definitely have helped if the tdr discussion we've had just a few months ago would have involved your team too, I'm sure there would have been some good insights from amd's side. I'd really want you and your engineers involved here, so let's do this properly!

Cheers, Daniel

> Can we try to put all our opinions, suggestions, or even objects here together, let's go through them one by one, it's too hard for us to reply each email on different questions .
>
>
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
>
>
> This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.
>
>
>
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
>
>
> I think I already explained the questions raised by Daniel in other 
> thread , regarding why I use __kthread_should_park()
>
> For other aspects, can we put all our opinion synthesized here ?
>
>
>
> Thanks !
>
>
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
>

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NA3iopIUYFOuTokczRA%2BNBcwVrvMMMHGPM96%2B%2Bm0nEg%3D&reserved=0