[Intel-xe] [PATCH] drm/xe: fix pvc unload issue
Lucas De Marchi
lucas.demarchi at intel.com
Tue Apr 4 20:22:49 UTC 2023
On Tue, Apr 04, 2023 at 03:45:05PM -0400, Rodrigo Vivi wrote:
>On Tue, Apr 04, 2023 at 11:48:39AM -0700, Lucas De Marchi wrote:
>> On Mon, Apr 03, 2023 at 10:20:31PM +0000, Chang, Bruce wrote:
>> > Currently, unload pvc driver will generate a null dereference
>> > and the call stack is as below.
>> >
>> > [ 4850.618000] Call Trace:
>> > [ 4850.620740] <TASK>
>> > [ 4850.623134] ttm_bo_cleanup_memtype_use+0x3f/0x50 [ttm]
>> > [ 4850.628661] ttm_bo_release+0x154/0x2c0 [ttm]
>> > [ 4850.633317] ? drm_buddy_fini+0x62/0x80 [drm_buddy]
>> > [ 4850.638487] ? __kmem_cache_free+0x27d/0x2c0
>> > [ 4850.643054] ttm_bo_put+0x38/0x60 [ttm]
>> > [ 4850.647190] xe_gem_object_free+0x1f/0x30 [xe]
>> > [ 4850.651945] drm_gem_object_free+0x1e/0x30 [drm]
>> > [ 4850.656904] ggtt_fini_noalloc+0x9d/0xe0 [xe]
>> > [ 4850.661574] drm_managed_release+0xb5/0x150 [drm]
>> > [ 4850.666617] drm_dev_release+0x30/0x50 [drm]
>> > [ 4850.671209] devm_drm_dev_init_release+0x3c/0x60 [drm]
>> >
>> > There are a couple issues, but the main one is due to TTM has only
>> > one TTM_PL_TT region, but since pvc has 2 tiles and tries to setup
>> > 1 TTM_PL_TT each tile. The second will overwrite the first one.
>> >
>> > During unload time, the first tile will reset the TTM_PL_TT manger
>> > and when the second tile is trying to free Bo and it will generate
>> > the null reference since the TTM manage is already got reset to 0.
>> >
>> > The fix is to use one global TTM_PL_TT manager.
>> >
>> > v2: make gtt mgr global and change the name to sys_mgr
>> >
>> > Cc: Stuart Summers <stuart.summers at intel.com>
>> > Cc: Matthew Brost <matthew.brost at intel.com>
>> > Cc: Vivi, Rodrigo <rodrigo.vivi at intel.com>
>> > Signed-off-by: Bruce Chang <yu.bruce.chang at intel.com>
>> > ---
>> > drivers/gpu/drm/xe/Makefile | 2 +-
>> > drivers/gpu/drm/xe/xe_device.c | 2 +
>> > drivers/gpu/drm/xe/xe_device.h | 1 +
>> > drivers/gpu/drm/xe/xe_device_types.h | 2 +
>> > drivers/gpu/drm/xe/xe_gt.c | 18 -----
>> > drivers/gpu/drm/xe/xe_gt_types.h | 2 -
>> > drivers/gpu/drm/xe/xe_ttm_gtt_mgr.h | 16 -----
>> > drivers/gpu/drm/xe/xe_ttm_gtt_mgr_types.h | 18 -----
>> > .../xe/{xe_ttm_gtt_mgr.c => xe_ttm_sys_mgr.c} | 67 +++++++------------
>> > 9 files changed, 31 insertions(+), 97 deletions(-)
>> > delete mode 100644 drivers/gpu/drm/xe/xe_ttm_gtt_mgr.h
>> > delete mode 100644 drivers/gpu/drm/xe/xe_ttm_gtt_mgr_types.h
>> > rename drivers/gpu/drm/xe/{xe_ttm_gtt_mgr.c => xe_ttm_sys_mgr.c} (52%)
>> >
>> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>> > index 669c7b6f552c..7e19a15af3dc 100644
>> > --- a/drivers/gpu/drm/xe/Makefile
>> > +++ b/drivers/gpu/drm/xe/Makefile
>> > @@ -83,7 +83,7 @@ xe-y += xe_bb.o \
>> > xe_step.o \
>> > xe_sync.o \
>> > xe_trace.o \
>> > - xe_ttm_gtt_mgr.o \
>> > + xe_ttm_sys_mgr.o \
>> > xe_ttm_stolen_mgr.o \
>> > xe_ttm_vram_mgr.o \
>> > xe_tuning.o \
>> > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> > index a79f934e3d2d..4fb9aff27686 100644
>> > --- a/drivers/gpu/drm/xe/xe_device.c
>> > +++ b/drivers/gpu/drm/xe/xe_device.c
>> > @@ -279,6 +279,8 @@ int xe_device_probe(struct xe_device *xe)
>> > if (err)
>> > goto err_irq_shutdown;
>> >
>> > + xe_ttm_sys_mgr_init(xe);
>> > +
>> > for_each_gt(gt, xe, id) {
>> > err = xe_gt_init_noalloc(gt);
>> > if (err)
>> > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
>> > index d277f8985f7b..d9d1b09a8e38 100644
>> > --- a/drivers/gpu/drm/xe/xe_device.h
>> > +++ b/drivers/gpu/drm/xe/xe_device.h
>> > @@ -116,4 +116,5 @@ static inline bool xe_device_has_flat_ccs(struct xe_device *xe)
>> > }
>> >
>> > u32 xe_device_ccs_bytes(struct xe_device *xe, u64 size);
>> > +int xe_ttm_sys_mgr_init(struct xe_device *xe);
>>
>> CC [M] drivers/gpu/drm/xe/tests/xe_bo_test.o
>> ../drivers/gpu/drm/xe/xe_ttm_sys_mgr.c:99:5: error: no previous prototype for ‘xe_ttm_sys_mgr_init’ [-Werror=missing-prototypes]
>> 99 | int xe_ttm_sys_mgr_init(struct xe_device *xe)
>> | ^~~~~~~~~~~~~~~~~~~
>>
>
>that's awkward... I didn't get issue here, not CI got it...
I enable W=1 every time I'm going to push so I get this kind of error.
We used to have it in the gitlab CI, but lost when migrated to the
mailing list. I talked with Ryszard to setup CI to run
"dev-provided-checks" so we can enable this back and probably run some
more. Hopefully we can add it back soon.
FWIW these are the commands I use:
alias m='make -j$(nproc)'
alias m64='m O=build64'
m64
./scripts/config --file build64/.config --disable CONFIG_DRM_XE_DISPLAY && \
m64 modules_prepare && m64 M=drivers/gpu/drm/xe W=1
./scripts/config --file build64/.config --enable CONFIG_DRM_XE_DISPLAY
which does the trick of "build everything, rebuild only xe.ko with
display disabled and W=1 (otherwise it fails), then move
CONFIG_DRM_XE_DISPLAY back on.
Lucas De Marchi
>
>>
>> couple of issues:
>>
>> 1) you moved the implementation to a new file and didn't include
>> xe_device.h, creating the build issue above
>
>I was wondering about gcc version, but this wouldn't be magic in older gcc...
>
>> 2) The decl shouldn't be here though. You need a xe_ttm_sys_mgr.h, just
>> like we have xe_ttm_stolen_mgr.h, xe_ttm_vram_mgr.h, xe_ttm_vram_mgr_types.h
>
>yeap, that sounds the way to go...
>
>>
>> Lucas De Marchi
More information about the Intel-xe
mailing list