[Intel-xe] [PATCH] drm/xe: Allow atomic operations in pages for Xe2

Thu Oct 5 23:34:44 UTC 2023

On Thu, Oct 05, 2023 at 07:59:00AM -0700, Matt Roper wrote:
>On Wed, Oct 04, 2023 at 11:29:49PM -0500, Lucas De Marchi wrote:
>> On Wed, Oct 04, 2023 at 02:06:24PM -0700, Jose Souza wrote:
>> > Without this if a atomic operations is executed it causes engine
>> > memory catastrophic error.
>> >
>> > This bit was already being added in the pte_encode_vma() path but not
>> > in the pde_encode_bo() path.
>> >
>> > This fixes at least the 3 failures in piglit sanity and 2 failures in
>> > crucible.
>> >
>> > Signed-off-by: José Roberto de Souza <jose.souza at intel.com>
>> > Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>> > ---
>> > drivers/gpu/drm/xe/xe_vm.c | 3 +++
>> > 1 file changed, 3 insertions(+)
>> >
>> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>> > index ea17d026546cd..6be5d92706932 100644
>> > --- a/drivers/gpu/drm/xe/xe_vm.c
>> > +++ b/drivers/gpu/drm/xe/xe_vm.c
>> > @@ -1261,6 +1261,9 @@ static u64 xelp_pte_encode_bo(struct xe_bo *bo, u64 bo_offset,
>> > 	pte |= pte_encode_cache(xe, cache);
>> > 	pte |= pte_encode_ps(pt_level);
>> >
>> > +	if (GRAPHICS_VERx100(xe) >= 20)
>> > +		pte |= XE_USM_PPGTT_PTE_AE;
>>
>> I think there's a clash with doing this here and doing it where it was
>> done in commit 6193a2ba4704 ("drm/xe: set PTE_AE for all platforms supporting it")
>>
>> There it's adding the flag  in the gpuva.flags which is later added to
>> the pte. However, in xe_pt_stage_bind() that checks for XE_VMA_ATOMIC_PTE_BIT
>> is only inside doing that for devmem. I was thinking that I might had screwed it
>> with my recent refactors, but looking at the log it was always like
>> that and the fix in the commit above didn't fix it for *all* the
>> platforms.
>>
>> Does it fix it if you simply change that function to consider system
>> memory? Otherwise we should probably change it only in the pte encode
>> instead of doing it in both places.
>
>If we set this bit on system memory aren't we going to stop getting page
>faults?  My understanding was that we need to migrate from smem->vram
>before performing GPU atomic operations, which is why we leave that bit
>clear on sram pages today.

I don't see a limitation on it being devmem only. See bspec 68148,
"Atomic Destination Memory" - Destination memory can be Shared Local
Memory (SLM), local device memory, a remote device memory or System
memory.

Looking at previous implementations I'm not sure the handling for "this
only works in devmem" was correct and I'd think it's because the feature
started with PVC.

Lucas De Marchi