[PATCH] drm/[amdgpu|radeon]: fix memset on io mem

Fri Dec 18 14:17:35 UTC 2020

On 2020-12-17 14:02, Christian König wrote:
> Am 17.12.20 um 14:45 schrieb Robin Murphy:
>> On 2020-12-17 10:25, Christian König wrote:
>>> Am 17.12.20 um 02:07 schrieb Chen Li:
>>>> On Wed, 16 Dec 2020 22:19:11 +0800,
>>>> Christian König wrote:
>>>>> Am 16.12.20 um 14:48 schrieb Chen Li:
>>>>>> On Wed, 16 Dec 2020 15:59:37 +0800,
>>>>>> Christian König wrote:
>>>>>>> [SNIP]
>>>>>> Hi, Christian. I'm not sure why this change is a hack here. I 
>>>>>> cannot see the problem and wll be grateful if you give more 
>>>>>> explainations.
>>>>> __memset is supposed to work on those addresses, otherwise you 
>>>>> can't use the
>>>>> e8860 on your arm64 system.
>>>> If __memset is supposed to work on those adresses, why this 
>>>> commit(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux%2Fcommit%2Fba0b2275a6781b2f3919d931d63329b5548f6d5f&data=04%7C01%7Cchristian.koenig%40amd.com%7C3551ae4972b044bb831608d8a291f81c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438095114292394%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xns81uCGfN1tjsVn5LBU8QhmUinZRJQlXz8w%2FJ7%2FGTM%3D&reserved=0) 
>>>> is needed? (I also notice drm/radeon didn't take this change though) 
>>>> just out of curiosity.
>>>
>>> We generally accept those patches as cleanup in the kernel with the 
>>> hope that we can find a way to work around the userspace restrictions.
>>>
>>> But when you also have this issue in userspace then there isn't much 
>>> we can do for you.
>>>
>>>>> Replacing the the direct write in the kernel with calls to writel() or
>>>>> memset_io() will fix that temporary, but you have a more general 
>>>>> problem here.
>>>> I cannot see what's the more general problem here :( u mean 
>>>> performance?
>>>
>>> No, not performance. See standards like OpenGL, Vulkan as well as 
>>> VA-API and VDPAU require that you can mmap() device memory and 
>>> execute memset/memcpy on the memory from userspace.
>>>
>>> If your ARM base board can't do that for some then you can't use the 
>>> hardware with that board.
>>
>> If the VRAM lives in a prefetchable PCI bar then on most sane 
>> Arm-based systems I believe it should be able to mmap() to userspace 
>> with the Normal memory type, where unaligned accesses and such are 
>> allowed, as opposed to the Device memory type intended for MMIO 
>> mappings, which has more restrictions but stricter ordering guarantees.
> 
> Do you have some background why some ARM boards fail with that?
> 
> We had a couple of reports that memset/memcpy fail in userspace (usually 
> system just spontaneously reboots or becomes unresponsive), but so far 
> nobody could tell us why that happens?

Part of it is that Arm doesn't really have an ideal memory type for 
mapping RAM behind PCI (much like we also struggle with the vague 
expectations of what write-combine might mean beyond x86). Device memory 
can be relaxed to allow gathering, reordering and write-buffering, but 
is still a bit too restrictive in other ways - aligned, non-speculative, 
etc. - for something that's really just RAM and expected to be usable as 
such. Thus to map PCI memory as "write-combine" we use Normal 
non-cacheable, which means the CPU MMU is going to allow software to do 
all the things it might expect of RAM, but we're now at the mercy of the 
menagerie of interconnects and PCI implementations out there.

Atomic operations, for example, *might* be resolved by the CPU coherency 
mechanism or in the interconnect, such that the PCI host bridge only 
sees regular loads and stores, but more often than not they'll just 
result in an atomic transaction going all the way to the host bridge. A 
super-duper-clever host bridge implementation might even support that, 
but the vast majority are likely to just reject it as invalid.

Similarly, unaligned accesses, cache line fills/evictions, and such will 
often work, since they're essentially just larger read/write bursts, but 
some host bridges can be picky and might reject access sizes they don't 
like (there's at least one where even 64-bit accesses don't work. On a 
64-bit system...)

If an invalid transaction does reach the host bridge, it's going to come 
back to the CPU as an external abort. If we're really lucky that could 
be taken synchronously, attributable to a specific instruction, and just 
oops/SIGBUS the relevant kernel/userspace thread. Often though, 
(particularly with big out-of-order CPUs) it's likely to be asynchronous 
and no longer attributable, and thus taken as an SError event, which in 
general roughly translates to "part of the SoC has fallen off". The only 
reasonable response we have to that is to panic the system.

Robin.