Multilevel page tables broken for high addresses
Deucher, Alexander
Alexander.Deucher at amd.com
Tue Mar 28 20:25:10 UTC 2017
> -----Original Message-----
> From: Kuehling, Felix
> Sent: Tuesday, March 28, 2017 4:15 PM
> To: amd-gfx at lists.freedesktop.org; Koenig, Christian; Zhou,
> David(ChunMing); Deucher, Alexander
> Cc: Russell, Kent
> Subject: Multilevel page tables broken for high addresses
>
> It looks like the multi-level page table changes have been submitted.
> They're causing problems when we're trying to integrate them into our
> KFD branch.
>
> We resolved the obvious changes and it's working on older ASICs without
> problems. But we're getting hangs on Vega10. With my patch to enable
> UTCL2 interrupts, I'm seeing lots of VM faults (see below). The
> VM_L2_PROTECTION_FAULT_STATUS indicates a WALKER_ERROR (3 = PDE1
> value).
>
> If I set adev->vm_manager.num_level = 1 in gmc_v9_0_vm_init, the
> problem
> goes away (basically reverting b98e6b5 drm/amdgpu: enable four level
> VMPT for gmc9).
>
> I suspect an issue that's exposed by how the KFD Thunk library manages
> shared virtual address space? We typically start at fairly high virtual
> addresses and reserve the lower 1/4 of our address space for coherent
> mappings (aperture-based scheme for pre-gfx9). The address in the fault
> below is 0x0000001000d80000, so a bit above 64GB, near the start of our
> non-coherent range.
>
> Simple KFD tests that don't use the non-coherent (high) address range
> seem to be working fine. That tells me that the multi-level page table
> code has a problem with high addresses.
>
> I'll keep digging ...
Do you have multiple GPUs in the system? There might be issues since some of the vm related settings come from global variables.
Alex
>
> Regards,
> Felix
>
> [ 24.768477] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [ 24.777361] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27
> [ 24.784204] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [ 24.791418] amdgpu 0000:03:00.0: IH ring buffer overflow (0x00083E00,
> 0x00000740, 0x00003E20)
> [ 24.791421] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [ 24.800299] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27
> [ 24.807154] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [ 24.814370] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [ 24.823251] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27
> [ 24.830098] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [ 24.837312] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [ 24.846190] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27
> [ 24.853056] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [ 24.860273] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [ 24.869151] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27
> [ 24.875994] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [ 24.883209] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [ 24.892087] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27
> [ 24.898933] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [ 24.906170] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [ 24.915059] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27
> [ 24.921910] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [ 24.929143] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [ 24.938021] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27
> [ 24.944874] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [ 24.952089] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [ 24.960967] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27
> [ 24.967810] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00841157
> [ 29.610925] gmc_v9_0_process_interrupt: 3402060 callbacks suppressed
> [ 29.610926] amdgpu 0000:03:00.0: [gfxhub] VMC page fault (src_id:0 ring:0
> vm_id:8 pas_id:1)
> [ 29.628202] amdgpu 0000:03:00.0: at page 0x0000001000d80000 from 27
> [ 29.641520] amdgpu 0000:03:00.0:
> VM_L2_PROTECTION_FAULT_STATUS:0x00000000
>
>
> On 17-03-27 01:53 AM, Chunming Zhou wrote:
> > *** BLURB HERE ***
> > From Vega, ascis start to support multiple level vmpt, the series is to
> implement it.
> >
> > Tested successfully with 2/3/4 levels.
> >
> > V2: address Christian comments.
> >
> > Max vm size 256TB tested ok.
> >
> >
> > Christian König (10):
> > drm/amdgpu: rename page_directory_fence to last_dir_update
> > drm/amdgpu: add the VM pointer to the amdgpu_pte_update_params as
> well
> > drm/amdgpu: add num_level to the VM manager
> > drm/amdgpu: generalize page table level
> > drm/amdgpu: handle multi level PD size calculation
> > drm/amdgpu: handle multi level PD during validation
> > drm/amdgpu: handle multi level PD in the LRU
> > drm/amdgpu: handle multi level PD updates V2
> > drm/amdgpu: handle multi level PD during PT updates
> > drm/amdgpu: add alloc/free for multi level PDs V2
> >
> > Chunming Zhou (5):
> > drm/amdgpu: abstract block size to one function
> > drm/amdgpu: limit block size to one page
> > drm/amdgpu: adapt vm size for multi vmpt
> > drm/amdgpu: set page table depth by num_level
> > drm/amdgpu: enable four level VMPT for gmc9
> >
> > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 6 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 67 ++--
> > drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 474
> +++++++++++++++++++----------
> > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 16 +-
> > drivers/gpu/drm/amd/amdgpu/gfxhub_v1_0.c | 3 +-
> > drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 1 +
> > drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 1 +
> > drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 1 +
> > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 7 +
> > drivers/gpu/drm/amd/amdgpu/mmhub_v1_0.c | 2 +-
> > 11 files changed, 380 insertions(+), 200 deletions(-)
> >
More information about the amd-gfx
mailing list