Raciness with page table shadows being swapped out

Christian König deathsimple at vodafone.de
Tue Dec 13 09:48:49 UTC 2016


Am 13.12.2016 um 00:35 schrieb Nicolai Hähnle:
> On 12.12.2016 19:22, Christian König wrote:
>> Am 12.12.2016 um 16:20 schrieb Nicolai Hähnle:
>>> Hi all,
>>>
>>> I just sent out two patches that hopefully make the kernel module more
>>> robust in the face of page table shadows being swapped out.
>>>
>>> However, even with those patches, I can still fairly reliably
>>> reproduce crashes with a backtrace of the shape
>>>
>>> amdgpu_cs_ioctl
>>>  -> amdgpu_vm_update_page_directory
>>>  -> amdgpu_ttm_bind
>>>  -> amdgpu_gtt_mgr_alloc
>>>
>>> The plausible reason for these crashes is that nothing seems to
>>> prevent the shadow BOs from being moved between the calls to
>>> amdgpu_cs_validate in amdgpu_cs_parser_bos and the calls to
>>> amdgpu_ttm_bind.
>>
>> The shadow BOs use the same reservation object than the real BOs. So as
>> long as the real BOs can't be evicted the shadows can't be evicted 
>> either.
>
> Ah okay. I don't see where the page table BOs are added to the buffer 
> list. Do they also share the reservation object with the page directory?

Yes, exactly. All use the same reservation object, so when you hold the 
lock on the PD nothing in the VM will be evicted any more.

>
>
>>> The attached patch has fixed these crashes for me so far, but it's
>>> very heavy-handed: it collects all page table shadows and the page
>>> directory shadow and adds them all to the reservations for the callers
>>> of amdgpu_vm_update_page_directory.
>>
>> That is most likely just a timing change, cause the shadows should end
>> up in the duplicates list anyway. So the patch shouldn't have any 
>> effect.
>
> Okay, so the reason for the remaining crash is still unclear at least 
> for me.

Yeah, that's a really good question. Can you share the call stack of the 
problem once more?

Thanks,
Christian.

>
> Cheers,
> Nicolai
>
>>
>>>
>>> I feel like there should be a better way. In part, I wonder why the
>>> shadows are needed in the first place. I vaguely recall the
>>> discussions about GPU reset and such, but I don't remember why the
>>> page tables can't just be rebuilt in some other way.
>>
>> It's just the simplest and fastest way to keep a copy of the page tables
>> around.
>>
>> The problem with rebuilding the page tables from the mappings is that
>> the housekeeping structures already have the future state when a reset
>> happens, not the state we need to rebuild the tables.
>>
>> We could obviously change the housekeeping a bit to keep both states,
>> but that would complicate mapping and unmapping of BOs significantly.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>



More information about the amd-gfx mailing list