[PATCH] drm/radeon: fix VM page table setup on SI

Fri Jun 29 11:07:19 PDT 2012

On Fri, Jun 29, 2012 at 12:14 PM, Michel Dänzer <michel at daenzer.net> wrote:
> On Fre, 2012-06-29 at 11:28 -0400, Jerome Glisse wrote:
>> On Fri, Jun 29, 2012 at 11:23 AM, Alex Deucher <alexdeucher at gmail.com> wrote:
>> > On Fri, Jun 29, 2012 at 10:49 AM, Michel Dänzer <michel at daenzer.net> wrote:
>> >> On Don, 2012-06-28 at 17:53 -0400, alexdeucher at gmail.com wrote:
>> >>> From: Alex Deucher <alexander.deucher at amd.com>
>> >>>
>> >>> Cayman and trinity allow for variable sized VM page
>> >>> tables, but SI requires that all page tables be the
>> >>> same size.  The current code assumes variablely sized
>> >>> VM page tables so SI may end up with part of each page
>> >>> table overlapping with other memory which could end
>> >>> up being interpreted by the VM hw as garbage.
>> >>>
>> >>> Change the code to better accomodate SI.  Allocate enough
>> >>> space for at least 2 full page tables and always set
>> >>> last_pfn to max_pfn on SI so each VM is backed by a full
>> >>> page table.  This limits us to only 2 VMs active at any
>> >>> given time on SI.  This will be rectified and the code can
>> >>> be reunified once we move to two level page tables.
>> >>>
>> >>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>> >>
>> >> This change breaks the radeonsi driver for me. egltri_screen (the
>> >> 'golden' test for radeonsi at least basically working) locks up the
>> >> GPU.
>> >>
>> >> I don't have any details about the lockup yet, as the GPU reset attempt
>> >> hangs the machine. Any ideas offhand what radeonsi might be doing wrong?
>> >
>> > Maybe trying to access an unmapped page that happened to work by
>> > accident before and now causes a fault in the VM which halts the MC?
>
> Indeed, looks like it:
>
> radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000FF01B
> radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0202400C
>
> Oddly, while I have seen similar errors before (so at
> least some access to unmapped pages was caught even before your patch),
> I hadn't noticed them for a while with egltri_screen...
>
>
> Anyway, some more experimentation shows that it doesn't happen if I skip
> the clear, and it still happens when doing only a clear. I'll look into
> what might be wrong with the clears next week.
>
>
>> Yeah only thing i can think of, can you get dump of various mc fault
>> reg after lockup ?
>
> Did you have any particular registers in mind?
>

I am guessing it's related to default page behavior, previously to
this patch you would likely ended up writting/reading to the dummy
page and thus not getting the segfault you deserved. With this patch
you get the segfault you deserve ;)

Cheers,
Jerome