Raciness with page table shadows being swapped out

Nicolai Hähnle nhaehnle at gmail.com
Mon Dec 12 23:35:59 UTC 2016


On 12.12.2016 19:22, Christian König wrote:
> Am 12.12.2016 um 16:20 schrieb Nicolai Hähnle:
>> Hi all,
>>
>> I just sent out two patches that hopefully make the kernel module more
>> robust in the face of page table shadows being swapped out.
>>
>> However, even with those patches, I can still fairly reliably
>> reproduce crashes with a backtrace of the shape
>>
>> amdgpu_cs_ioctl
>>  -> amdgpu_vm_update_page_directory
>>  -> amdgpu_ttm_bind
>>  -> amdgpu_gtt_mgr_alloc
>>
>> The plausible reason for these crashes is that nothing seems to
>> prevent the shadow BOs from being moved between the calls to
>> amdgpu_cs_validate in amdgpu_cs_parser_bos and the calls to
>> amdgpu_ttm_bind.
>
> The shadow BOs use the same reservation object than the real BOs. So as
> long as the real BOs can't be evicted the shadows can't be evicted either.

Ah okay. I don't see where the page table BOs are added to the buffer 
list. Do they also share the reservation object with the page directory?


>> The attached patch has fixed these crashes for me so far, but it's
>> very heavy-handed: it collects all page table shadows and the page
>> directory shadow and adds them all to the reservations for the callers
>> of amdgpu_vm_update_page_directory.
>
> That is most likely just a timing change, cause the shadows should end
> up in the duplicates list anyway. So the patch shouldn't have any effect.

Okay, so the reason for the remaining crash is still unclear at least 
for me.

Cheers,
Nicolai

>
>>
>> I feel like there should be a better way. In part, I wonder why the
>> shadows are needed in the first place. I vaguely recall the
>> discussions about GPU reset and such, but I don't remember why the
>> page tables can't just be rebuilt in some other way.
>
> It's just the simplest and fastest way to keep a copy of the page tables
> around.
>
> The problem with rebuilding the page tables from the mappings is that
> the housekeeping structures already have the future state when a reset
> happens, not the state we need to rebuild the tables.
>
> We could obviously change the housekeeping a bit to keep both states,
> but that would complicate mapping and unmapping of BOs significantly.
>
> Regards,
> Christian.
>
>>
>> Cheers,
>> Nicolai
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>


More information about the amd-gfx mailing list