[diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Wed Sep 1 10:19:32 UTC 2021

[AMD Official Use Only]

Daniel 

>From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo.... 

It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here .... 

This one changes AMD's code: https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@collabora.com/
And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vetter@ffwll.ch/
Which is conflicting with one patch we submitted (in our repo rightnow), and neither see AMDder gave a review-by on this one (let me know if I missed it)

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Daniel Vetter
Sent: Wednesday, September 1, 2021 4:18 PM
To: Liu, Monk <Monk.Liu at amd.com>
Cc: Koenig, Christian <Christian.Koenig at amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, JingWen <JingWen.Chen2 at amd.com>; DRI Development <dri-devel at lists.freedesktop.org>; amd-gfx at lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk <Monk.Liu at amd.com> wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, the feature we are working on (diagnostic TDR scheme) is pending there for more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your team showed up on dri-devel. That you already spent 6 months on this within amd, on a code area that very much affects shared code, without kicking of any thread on dri-devel isn't great, but also not something we can fix, since time machines don't exist.

So we have to make the best out of the situation and move ahead where we are. From my understanding you've done a bunch of changes to the scheduler code. As far as I can see there's been two related things your team has done:

- remove some allocations from scheduler code, because that can lead to deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3D&reserved=0

This is just one patch of the entire series. This is an area where we really need a consistent solution across all drm/sched drivers, not something that individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from panfrost about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3D&reserved=0

There's been more than just this, also quite some doc patches from Boris that explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops up out of nowhere and demands that your approach needs to land asap. That's really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing context for outsiders. The patches sometimes come with zero commit message, for tricky concurrency bugs. And there's no context with what you've done already on the amdgpu side (since that never showed up on dri-devel), which makes constructive discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed to work when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on dri-devel (or whatever the topic really is about) about the problem you have and how your trying to solve it. This can be just text if it's a big thing, but it can also already include some proof of concept solution in the form of patches.

- Then we iterate on the solution, across drivers and shared code _together_. Not "merge amdgpu code first, then get annoyed when the core changes don't land immediately after you've practially finished the project".

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have upstream buy-in before you start to put in real work on the project.

> Honestly speaking the email ways that we are using now is not friendly and quite painful to me ....

Yes this is painful :-(

I think the best way forward is to go through the above process again and essentially restart. So submit a complete patch series with problem descriptions, solution you picked, why you picked that, all the amdgpu patches to get there and the core patches too. Since it sounds like a bunch of this has all landed already you probably need a patch 1 that goes back to 6 months ago so that we can see the overall direction, and review whether that's the right one or not.

The not-so-painful approach would have been to do this from the start,
6 months ago. It would definitely have helped if the tdr discussion we've had just a few months ago would have involved your team too, I'm sure there would have been some good insights from amd's side. I'd really want you and your engineers involved here, so let's do this properly!

Cheers, Daniel

> Can we try to put all our opinions, suggestions, or even objects here together, let's go through them one by one, it's too hard for us to reply each email on different questions .
>
>
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
>
>
> This is a fixing patch on the timeout timer in scheduler, can we complete this one first ? it should already resolved all the questions and suggestions.
>
>
>
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
>
>
> I think I already explained the questions raised by Daniel in other 
> thread , regarding why I use __kthread_should_park()
>
> For other aspects, can we put all our opinion synthesized here ?
>
>
>
> Thanks !
>
>
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
>

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NA3iopIUYFOuTokczRA%2BNBcwVrvMMMHGPM96%2B%2Bm0nEg%3D&reserved=0