[Intel-gfx] [PATCH 1/5] drm/i915: audit bo->resource usage v3

Wed Jan 25 12:53:15 UTC 2023

On Wed, 25 Jan 2023 at 11:35, Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
>
> Am 25.01.23 um 11:21 schrieb Matthew Auld:
> > On Wed, 25 Jan 2023 at 10:07, Christian König
> > <ckoenig.leichtzumerken at gmail.com> wrote:
> >> Am 25.01.23 um 10:56 schrieb Matthew Auld:
> >>> On Tue, 24 Jan 2023 at 17:15, Matthew Auld
> >>> <matthew.william.auld at gmail.com> wrote:
> >>>> On Tue, 24 Jan 2023 at 13:48, Matthew Auld
> >>>> <matthew.william.auld at gmail.com> wrote:
> >>>>> On Tue, 24 Jan 2023 at 12:57, Christian König
> >>>>> <ckoenig.leichtzumerken at gmail.com> wrote:
> >>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> >>>>>>
> >>>>>> Make sure we can at least move and alloc TT objects without backing store.
> >>>>>>
> >>>>>> v2: clear the tt object even when no resource is allocated.
> >>>>>> v3: add Matthews changes for i915 as well.
> >>>>>>
> >>>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
> >>>>> Reviewed-by: Matthew Auld <matthew.auld at intel.com>
> >>>> Ofc that assumes intel-gfx CI is now happy with the series.
> >>> There are still some nasty failures it seems (in the extended test
> >>> list). But it looks like the series is already merged. Can we quickly
> >>> revert and try again?
> >> Ah, crap. I thought everything would be fine after the CI gave it's go.
> >>
> >> Which patch is causing the fallout?
> > I'm not sure. I think all of the patches kind of interact with each
> > other, but for sure there is an issue with the first patch. There is
> > one splat like:
>
> Well I would rather like to revert as less as possible.
>
> Are you sure that this isn't only on some i915 specific branch with not
> yet upstream changes?

Yeah, that splat is taken directly from the CI results reported with
this series. So it's just your series applied on top of drm-tip.

Can you take a look at the first patch here:
https://patchwork.freedesktop.org/series/113332/

Maybe you have a better idea? For reference the IGTs that we have for
verifying userspace object clearing are now failing, so hoping that
fixes it. The other two patches I'm hoping will fix the splat.

>
> I can't even find the i915_gem_obj_copy_ttm function in drm-misc-next
> nor drm-next.
>
> Regards,
> Christian.
>
> >
> > <1>[  109.735148] BUG: kernel NULL pointer dereference, address:
> > 0000000000000010
> > <1>[  109.735151] #PF: supervisor read access in kernel mode
> > <1>[  109.735152] #PF: error_code(0x0000) - not-present page
> > <6>[  109.735153] PGD 0 P4D 0
> > <4>[  109.735155] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > <4>[  109.735157] CPU: 1 PID: 92 Comm: kworker/u12:6 Not tainted
> > 6.2.0-rc5-Patchwork_113269v1-gc4d436608c4e+ #1
> > <4>[  109.735159] Hardware name: Gigabyte Technology Co., Ltd. GB-Z390
> > Garuda/GB-Z390 Garuda-CF, BIOS IG1c 11/19/2019
> > <4>[  109.735160] Workqueue: events_unbound async_run_entry_fn
> > <4>[  109.735163] RIP: 0010:i915_ttm_resource_mappable+0x4/0x30 [i915]
> > <4>[  109.735286] Code: b8 f9 ff ff ff eb c2 e8 aa 5e 52 e1 e9 4f 0f
> > 18 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> > 66 0f 1f 00 <8b> 57 10 b8 01 00 00 00 85 d2 74 15 48 8b 47 08 48 05 ff
> > 0f 00 00
> > <4>[  109.735288] RSP: 0018:ffffc90000f339a8 EFLAGS: 00010246
> > <4>[  109.735289] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> > ffff88810cea3a00
> > <4>[  109.735290] RDX: 0000000000000000 RSI: ffffc90000f33af0 RDI:
> > 0000000000000000
> > <4>[  109.735292] RBP: ffff88811645d7c0 R08: 0000000000000000 R09:
> > ffff888123afa940
> > <4>[  109.735292] R10: 0000000000000001 R11: ffff888104b70040 R12:
> > 0000000000000000
> > <4>[  109.735293] R13: 0000000000000000 R14: ffffc90000f33b08 R15:
> > ffffc90000f33af0
> > <4>[  109.735294] FS:  0000000000000000(0000)
> > GS:ffff8884ad680000(0000) knlGS:0000000000000000
> > <4>[  109.735295] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4>[  109.735296] CR2: 0000000000000010 CR3: 000000011f9c6003 CR4:
> > 00000000003706e0
> > <4>[  109.735297] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > <4>[  109.735298] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > <4>[  109.735299] Call Trace:
> > <4>[  109.735300]  <TASK>
> > <4>[  109.735301]  __i915_ttm_move+0x128/0x940 [i915]
> > <4>[  109.735408]  ? dma_resv_iter_next+0x91/0xb0
> > <4>[  109.735412]  ? dma_resv_iter_first+0x42/0xb0
> > <4>[  109.735414]  ? i915_deps_add_resv+0x4c/0xc0 [i915]
> > <4>[  109.735520]  i915_gem_obj_copy_ttm+0x12f/0x250 [i915]
> > <4>[  109.735625]  i915_ttm_restore+0x167/0x250 [i915]
> > <4>[  109.735759]  i915_gem_process_region+0x27a/0x3b0 [i915]
> > <4>[  109.735881]  i915_ttm_restore_region+0x4b/0x70 [i915]
> > <4>[  109.735999]  lmem_restore+0x3a/0x60 [i915]
> > <4>[  109.736101]  i915_gem_resume+0x4c/0x100 [i915]
> > <4>[  109.736202]  i915_drm_resume+0xc2/0x170 [i915]
> >
> > Plus some other less obvious issue(s) with some tests failing.
> >
> >> Christian.
>