[diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Jingwen Chen Jingwen.Chen2 at amd.com
Mon Sep 6 10:35:20 UTC 2021


Hi Christian/Andrey/Daniel,

I read Boris's patch about ordered workqueue and I think maybe we can
leverage this change.
https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezillon@collabora.com/

As the TDR race condition we are talking about is caused by a bailing
job being deleted from pending list. While if we use the ordered
workqueue for timedout in the driver, there will be no bailing job.

Do you have any suggestions?

Best Regards,
JingWen Chen

On Mon Sep 06, 2021 at 02:36:52PM +0800, Liu, Monk wrote:
> [AMD Official Use Only]
> 
> > I'm fearing that just repeating what Alex said, but to make it clear 
> > once more: That is *not* necessary!
> >
> > The shared repository is owned by upstream maintainers and they are 
> > usually free to do restructuring work without getting acknowledge from 
> > every single driver maintainer.
> 
> Hi Daniel
> 
> Anyway thanks for officially confirm to me of working model & policy in community, I don't want to put my opinion here due to that's not my call to change no matter how.
> I only want to let this diagnostic TDR scheme going to a good end for AMD or even for all DRM vendor.
> 
> How about this way, we still have a final patch not landed in DRM scheduler and I would like jingwen to present it to you and AlexD/Christian/Andrey,  I believe you will have concerns or objections regarding this patch, but that's fine, let us figure it out together, how to make it acceptable by you and other vendors that working with DRM scheduler.
> 
> P.S.:  I had to repeat myself again, we are not popping up new idea suddenly, it is disconnection issue, we didn't have changes (or plan to have changes) in DRM scheduler before, but eventually we found we must make job_timeout and sched_main to work in a serialized otherwise it won't work based on current scheduler's code structure.
> 
> Thanks 
> 
> ------------------------------------------
> Monk Liu | Cloud-GPU Core team
> ------------------------------------------
> 
> -----Original Message-----
> From: Daniel Vetter <daniel at ffwll.ch> 
> Sent: Friday, September 3, 2021 12:11 AM
> To: Koenig, Christian <Christian.Koenig at amd.com>
> Cc: Liu, Monk <Monk.Liu at amd.com>; Dave Airlie <airlied at gmail.com>; Alex Deucher <alexdeucher at gmail.com>; Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Chen, JingWen <JingWen.Chen2 at amd.com>; DRI Development <dri-devel at lists.freedesktop.org>; amd-gfx at lists.freedesktop.org
> Subject: Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
> 
> On Thu, Sep 2, 2021 at 1:00 PM Christian König <christian.koenig at amd.com> wrote:
> >
> > Hi Monk,
> >
> > Am 02.09.21 um 07:52 schrieb Liu, Monk:
> > > [AMD Official Use Only]
> > >
> > > I'm not sure I can add much to help this along, I'm sure Alex has 
> > > some internal training, Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
> > > Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
> > > The best time to interact with upstream was 6 months ago, the second best time is now.
> > > <<<
> > >
> > > Daniel/AlexD
> > >
> > > I didn't mean your changes on AMD driver need my personal approval 
> > > or review ... and  I'm totally already get used that our driver is not 100% under control by AMDers, but supposedly any one from community (including you) who tend to change AMD's driver need at least to get approvement from someone in AMD, e.g.: AlexD or Christian, doesn't that reasonable?
> >
> > I'm fearing that just repeating what Alex said, but to make it clear 
> > once more: That is *not* necessary!
> >
> > The shared repository is owned by upstream maintainers and they are 
> > usually free to do restructuring work without getting acknowledge from 
> > every single driver maintainer.
> >
> > Anybody can of course technically object to upstream design decisions, 
> > but that means that you need to pay attention to the mailing lists in 
> > the first place.
> >
> > > just like we need your approve if we try to modify DRM-sched, or need panfrost's approval if we need to change panfrost code ...
> > >
> > > by only CC AMD's engineers looks not quite properly, how do you know if your changes (on AMD code part) are conflicting with AMD's on-going internal features/refactoring or not ?
> >
> > Well because AMD is supposed to work in public as much as possible and 
> > ask upstream before doing changes to the code base.
> >
> > Additional to that design decisions are supposed to be discussed on 
> > the mailing list and *not* internally.
> 
> Yeah I'm honestly really surprised about the course of this discussion here. With Alex, Christian and others amd has a lot of folks with years/decades of experience in how to collaborate in upstream, when to pull in others proactively and when that's not needed, and in general how to plan upstream work with the lest amount of risk and surprises.
> 
> I think step zero here needs to be some training at amd and then re-planning this effort, before we get back to technical stuff.
> Otherwise we'll just get bogged down in pain because expectations about the process don't pan out.
> -Daniel
> 
> >
> > Regards,
> > Christian.
> >
> > >
> > > Thanks
> > >
> > > ------------------------------------------
> > > Monk Liu | Cloud-GPU Core team
> > > ------------------------------------------
> > >
> > > -----Original Message-----
> > > From: Dave Airlie <airlied at gmail.com>
> > > Sent: Thursday, September 2, 2021 2:51 AM
> > > To: Alex Deucher <alexdeucher at gmail.com>
> > > Cc: Liu, Monk <Monk.Liu at amd.com>; Daniel Vetter <daniel at ffwll.ch>; 
> > > Koenig, Christian <Christian.Koenig at amd.com>; Grodzovsky, Andrey 
> > > <Andrey.Grodzovsky at amd.com>; Chen, JingWen <JingWen.Chen2 at amd.com>; 
> > > DRI Development <dri-devel at lists.freedesktop.org>; 
> > > amd-gfx at lists.freedesktop.org
> > > Subject: Re: [diagnostic TDR mode patches] unify our solution 
> > > opinions/suggestions in one thread
> > >
> > > On Thu, 2 Sept 2021 at 01:20, Alex Deucher <alexdeucher at gmail.com> wrote:
> > >> On Wed, Sep 1, 2021 at 6:19 AM Liu, Monk <Monk.Liu at amd.com> wrote:
> > >>> [AMD Official Use Only]
> > >>>
> > >>> Daniel
> > >>>
> > >>>  From the link you share it looks you(or someone else) have quite a bunch patches that changes DRM_SCHED or even amdgpu, by that case before they are merged to kernel tree I'm wondering if any AMD develop reviewed them ?
> > >>>
> > >>> They looks to me somehow conflicting with what we changed in our repo....
> > >>>
> > >>> It is really a chaos for AMDer if someone else out side of AMD changes our kernel driver (or/and scheduler) without reviewed by AMDer, just like we are requiring your review if we tend to change scheduler's logic here ....
> > >>>
> > >>> This one changes AMD's code:
> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > >>> lo 
> > >>> re.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezill
> > >>> on
> > >>> %40collabora.com%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d
> > >>> 18 
> > >>> d65341ef53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C
> > >>> 0% 
> > >>> 7C637661190727875969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
> > >>> CJ 
> > >>> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BWJSk
> > >>> KN
> > >>> y2%2BwjxbQrfxGPzuJ5PBpBwB4aV0ZH6QoJGEg%3D&reserved=0
> > >>> And I didn't see any reviewed-by from AMDers ...
> > >>>
> > >>> This one also touches AMD's code:
> > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > >>> lo
> > >>> re.kernel.org%2Fdri-devel%2F20200604081224.863494-12-daniel.vetter
> > >>> %4 
> > >>> 0ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C6c507d18d6534
> > >>> 1e
> > >>> f53bb08d96d7976e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637
> > >>> 66
> > >>> 1190727885929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoi
> > >>> V2 
> > >>> luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2F8vIVXCWjH
> > >>> kM
> > >>> 56pcYI9EvuzhbsZhV9WczkKaBJE67KQ%3D&reserved=0
> > >>> Which is conflicting with one patch we submitted (in our repo 
> > >>> rightnow), and neither see AMDder gave a review-by on this one 
> > >>> (let me know if I missed it)
> > >>>
> > >> Monk, this is not how upstream works.  You need to participate.
> > >> That's how communities work.  There's a reason all these 
> > >> discussions happen on public mailing lists.  The patch author can't 
> > >> be expected to know every person on every vendor team to CC with a 
> > >> patch.  If you have concerns, you need to raise them when the 
> > >> patches are being discussed.
> > >>
> > > I'm not sure I can add much to help this along, I'm sure Alex has 
> > > some internal training,
> > >
> > > Once your driver is upstream, it belongs to upstream, you can maintain it, but you no longer control it 100%, it's a tradeoff, it's not one companies always understand.
> > >
> > > Usually people are fine developing away internally, but once interaction with other parts of the kernel/subsystem is required they have the realisation that they needed to work upstream 6 months earlier.
> > >
> > > The best time to interact with upstream was 6 months ago, the second best time is now.
> > >
> > > Dave.
> >
> 
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7CMonk.Liu%40amd.com%7C1de8110d43194346d9b908d96e2c5459%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637661958966011423%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BvtBN1lBJnUoeSyj6aXTDRNHVQDQP8kPRdSUrhR1MVk%3D&reserved=0


More information about the dri-devel mailing list