It appears drm-next TTM cleanup broke something . . .

Kevin Brace kevinbrace at gmx.com
Mon Oct 19 16:20:42 UTC 2020


Hi Christian,

I looked into a few more things, and figured out why OpenChrome DRM was not booting X Server.
Now the situation is under control in my side of the world (OpenChrome development world), and I got X Server working again with drm-next 5.10 code and OpenChrome DRM.
Code will be committed into OpenChrome DRM upstream repository's drm-next-5.10 branch in the next few days.
    I can see that OpenChrome DRM "just happened to" work without ttm_tt_create/destroy callbacks being specified.
I can fix this issue easily.
The other issue I was facing was that commit 48e07c23cbeba2a2cda7ca73be0015e727818536 (drm/ttm: nuke memory type flags) discontinued TTM_PL_FLAG_* flags.
This caused incompatibility between OpenChrome DDX and the updated OpenChrome DRM that incorporated the changes made with commit 48e07c2.
OpenChrome DDX was sending TTM_PL_FLAG_* based flags to OpenChrome DRM.
I changed OpenChrome DDX code to use TTM_PL_* instead for allocating memory via TTM, and OpenChrome DRM based Linux kernel was able to boot X Server properly.
Again, the code change to the DDX will happen in a few days.
    Before I moved on from this issue, I will like to ask for one request regarding TTM and its callbacks.
Is it too much to ask for using more BUG_ON null pointer assertions for TTM callbacks?
Although I did not bring it up with you back then, I did get hit with an issue similar to this 3 years ago.

https://cgit.freedesktop.org/openchrome/drm-openchrome/commit/?h=drm-next-4.13&id=3c08ec601bb1ccd5ff58a9101317b728aa7204bd
https://cgit.freedesktop.org/openchrome/drm-openchrome/commit/?h=drm-next-4.13&id=2064175f977d9859831be653df16e3ea10415a8a


I did not send you an e-mail about it, I do recall complaining to Alex Deucher about the above io_mem_pfn null pointer bug at XDC2017 (I did attend it, and I presented about the state of OpenChrome Project.).
I brought this up with Alex because I stuck for more than 2 weeks to figure out the issue (the commit for overcoming io_mem_pfn callback issue was made on 2017-09-13, but the commit prior to that was on 2017-08-28).

https://cgit.freedesktop.org/openchrome/drm-openchrome/log/?h=drm-next-4.13&ofs=50


Anyway, someone else got hit by the same issue a month or two later

https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html
https://lists.freedesktop.org/archives/dri-devel/2017-December/161168.html


These are the commits that fixed the issue.

https://cgit.freedesktop.org/drm/drm/commit/?id=ea642c3216cb2a60d1c0e760ae47ee85c9c16447
https://cgit.freedesktop.org/drm/drm/commit/?id=c67fa6edc8b11afe22c88a23963170bf5f151acf


If you compare the commit date, I figured it out io_mem_pfn null pointer bug 2 to 3 months earlier.
I think the problem we got with the latest code change to the TTM is, many TTM callbacks do not check for a null pointer for mandatory callbacks.
Also, the ttm_bo_driver struct comment section inside ttm_bo_driver.h does not make it 100% clear that what is mandatory and what is not.
If these were done, I am sure I would have left ttm_tt_create/destroy callbacks intact.

https://cgit.freedesktop.org/openchrome/drm-openchrome/commit/?h=drm-next-5.10&id=64947142a1cf8b83faa73da7aa35a17f6a24568a
(scroll down to openchrome/openchrome_ttm.c)


    Regarding your suggestion that I should abandon OpenChrome DRM's current GEM / TTM implementation and instead I should use the GEM VRAM helpers, since I was able to figure out the issues with some suggestions from you and Dave, I do not think it is worth switching to GEM VRAM helpers at this point.
Obviously, the current implementation does not support acceleration, hence, it appears to be a good candidate for switching to GEM VRAM helpers.
While that may be true, I do plan to implement acceleration later, and this is why I do not want to settle with the GEM VRAM helpers.

Regards,

Kevin Brace
Brace Computer Laboratory blog
https://bracecomputerlab.com


> Sent: Monday, October 19, 2020 at 3:13 AM
> From: "Christian König" <ckoenig.leichtzumerken at gmail.com>
> To: "Kevin Brace" <kevinbrace at gmx.com>, "Dave Airlie" <airlied at gmail.com>, "dri-devel" <dri-devel at lists.freedesktop.org>
> Subject: Re: It appears drm-next TTM cleanup broke something . . .
>
> Hi Kevin,
> 
> the basic problem you are facing is that ttm_tt_create/destroy is 
> mandatory (It always was). You need an implementation or otherwise you 
> won't be able to use the system domain (additional to the optional GTT 
> domain).
> 
> My best guess is that the difference is that we now force to initiate 
> the system domain for all drivers.
> 
> If that is correct you just that you never ran into because you never 
> correctly initialized TTM to support buffer moves.
> 
> I'm not sure what exactly the OpenChrome DRM driver is doing, but I 
> strongly suggest to just drop TTM support completely and use the GEM 
> VRAM helper layer instead.
> 
> Regards,
> Christian.
> 
> Am 19.10.20 um 09:23 schrieb Kevin Brace:
> > Hi Dave,
> >
> > Yeah, with the workaround I mentioned in my previous e-mail, OpenChrome DRM does not crash for "ttm_tt_create" member being null.
> > It is still not able to boot X Server due to some other TTM related memory allocation issue it is suffering from.
> > I think making huge changes to TTM during this development cycle broke OpenChrome DRM.
> >      Following up on the question I raised during the previous e-mail.
> > Shouldn't "use_tt" parameter being "false" for ttm_range_man_init() disable TTM TT functionality?
> > I feel like that should be the expected behavior.
> > Again, there is only 5 to 6 more days left until Linux 5.10-rc2, so I decided to contact you on Sunday (I consider this bug to be urgent.).
> > Assuming what I am asserting is correct, I think the reason why this was not discovered earlier was due to the following reasons.
> >
> > 1) nouveau, radeon, and amdgpu already use TTM TT functionality.
> > 2) ast uses GEM VRAM helper that internally uses TTM. It populates "ttm_tt_create" and "ttm_tt_destroy" members, hence, the developers did not notice the breakage.
> > 3) OpenChrome DRM is still not in the mainline tree, so no one other than myself noticed the problem until now.
> >
> >
> > Regarding the TTM TT functionality, OpenChrome DRM currently does not support acceleration, hence, I did not believe it was necessary to populate "ttm_tt_create" and "ttm_tt_destroy" members.
> > That implementation worked fine until the previous development cycle code.
> > Of course, I will eventually add support for acceleration, hence, TTM TT functionality will be utilized at some point.
> >
> > Regards,
> >
> > Kevin Brace
> > Brace Computer Laboratory blog
> > https://bracecomputerlab.com
> >
> >
> >> Sent: Sunday, October 18, 2020 at 12:50 PM
> >> From: "Dave Airlie" <airlied at gmail.com>
> >> To: "Kevin Brace" <kevinbrace at gmx.com>, "Christian König" <ckoenig.leichtzumerken at gmail.com>
> >> Cc: "dri-devel" <dri-devel at lists.freedesktop.org>, "Dave Airlie" <airlied at redhat.com>
> >> Subject: Re: It appears drm-next TTM cleanup broke something . . .
> >>
> >> On Mon, 19 Oct 2020 at 05:15, Kevin Brace <kevinbrace at gmx.com> wrote:
> >>> Hi Dave,
> >>>
> >>> It is a little urgent, so I am writing this right now.
> >>> As usual, I pulled in DRM repository code for an out of tree OpenChrome DRM repository a few days ago.
> >>> While going through the changes I need to make to OpenChrome DRM to compile with the latest Linux kernel, I noticed that ttm_bo_init_mm() was discontinued, and it was replaced with ttm_range_man_init().
> >>> ttm_range_man_init() has a parameter called "bool use_tt", but honestly, I do not think it is functioning correctly.
> >>> If I keep "ttm_tt_create" member of ttm_bo_driver struct null by not specifying it, TTM still tries to call it, and crashes due to a null pointer access.
> >>> The workaround I found so far is to specify the "ttm_tt_create" member by copying bo_driver_ttm_tt_create() from drm/drm_gem_vram_helper.c.
> >>> This is what the call trace looks like without specifying the "ttm_tt_create" member (i.e., this member is null).
> >> cc'ing Christian,
> >>
> >> I can't remember if we did this deliberately or if just worked by
> >> accident previously.
> >>
> >> Either way, you should probably need a ttm_tt_create going forward.
> >>
> >> Dave.
> >>
> >>> _______________________________________________
> >>> . . .
> >>> kernel: [   34.310674] [drm:openchrome_bo_create [openchrome]] Entered openchrome_bo_create.
> >>> kernel: [   34.310697] [drm:openchrome_ttm_domain_to_placement [openchrome]] Entered openchrome_ttm_domain_to_placement.
> >>> kernel: [   34.310706] [drm:openchrome_ttm_domain_to_placement [openchrome]] Exiting openchrome_ttm_domain_to_placement.
> >>> kernel: [   34.310737] BUG: kernel NULL pointer dereference, address: 0000000000000000
> >>> kernel: [   34.310742] #PF: supervisor instruction fetch in kernel mode
> >>> kernel: [   34.310745] #PF: error_code(0x0010) - not-present page
> >>> . . .
> >>> kernel: [   34.310807] Call Trace:
> >>> kernel: [   34.310827]  ttm_tt_create+0x5f/0xa0 [ttm]
> >>> kernel: [   34.310839]  ttm_bo_validate+0xb8/0x140 [ttm]
> >>> kernel: [   34.310886]  ? drm_vma_offset_add+0x56/0x70 [drm]
> >>> kernel: [   34.310897]  ? openchrome_gem_create_ioctl+0x150/0x150 [openchrome]
> >>> . . .
> >>> _______________________________________________
> >>>
> >>> The erroneous call to  "ttm_tt_create" member happens right after TTM placement is performed (openchrome_ttm_domain_to_placement()).
> >>> Currently, OpenChrome DRM's TTM implementation does not use "ttm_tt_create" member, and this arrangement worked fine until Linux 5.9's drm-next code.
> >>> It appears that Linux 5.10's drm-next code broke the code.
> >>>
> >>> Regards,
> >>>
> >>> Kevin Brace
> >>> Brace Computer Laboratory blog
> >>> https://bracecomputerlab.com
> >>>
> >>> _______________________________________________
> >>> dri-devel mailing list
> >>> dri-devel at lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
>


More information about the dri-devel mailing list