[Intel-gfx] Regression on linux-next (next-20231107)

Borah, Chaitanya Kumar chaitanya.kumar.borah at intel.com
Thu Nov 9 17:00:09 UTC 2023


Hello Krister,
 
Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
 
This mail is regarding a regression we are seeing in our CI runs[1] for some machines (dg2 and adl-p) on linux-next  repository.

Since the version next-20231107 [2], we are seeing the following error
```````````````````````````````````````````````````````````````````````````````
<4>[   32.015910] stack segment: 0000 [#1] PREEMPT SMP NOPTI
<4>[   32.021048] CPU: 15 PID: 766 Comm: fusermount Not tainted 6.6.0-next-20231107-next-20231107-g5cd631a52568+ #1
<4>[   32.031135] Hardware name: Intel Corporation Raptor Lake Client Platform/RPL-S ADP-S DDR5 UDIMM CRB, BIOS RPLSFWI1.R00.4221.A00.2305271351 05/27/2023
<4>[   32.044657] RIP: 0010:fuse_evict_inode+0x61/0x150 [fuse]
`````````````````````````````````````````````````````````````````````````````````

Details log can be found in [3].

After bisecting the tree, the following patch [4] seems to be the first "bad" commit

 `````````````````````````````````````````````````````````````````````````````````````````````````````````
513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5 is the first bad commit
commit 513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5
Author: Krister Johansen kjlx at templeofstupid.com
Date:   Fri Nov 3 10:39:47 2023 -0700

    fuse: share lookup state between submount and its parent

    Fuse submounts do not perform a lookup for the nodeid that they inherit
    from their parent.  Instead, the code decrements the nlookup on the
    submount's fuse_inode when it is instantiated, and no forget is
    performed when a submount root is evicted.

    Trouble arises when the submount's parent is evicted despite the
    submount itself being in use.  In this author's case, the submount was
    in a container and deatched from the initial mount namespace via a
    MNT_DEATCH operation.  When memory pressure triggered the shrinker, the
    inode from the parent was evicted, which triggered enough forgets to
    render the submount's nodeid invalid.

    Since submounts should still function, even if their parent goes away,
    solve this problem by sharing refcounted state between the parent and
    its submount.  When all of the references on this shared state reach
    zero, it's safe to forget the final lookup of the fuse nodeid.

 `````````````````````````````````````````````````````````````````````````````````````````````````````````
 
We also verified that if we revert the patch the issue is not seen.

Could you please check why the patch causes this regression and provide a fix if necessary?

Thank you.

Regards

Chaitanya

[1] https://intel-gfx-ci.01.org/tree/linux-next/combined-alt.html?
[2] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231107
[3] http://gfx-ci.igk.intel.com/tree/linux-next/next-20231109/bat-dg2-14/boot0.txt
[4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20231107&id=513dfacefd712bcbfab64e1a9c9c3e0d51c2dca5


More information about the Intel-gfx mailing list