[Mesa-dev] [PATCH 1/2] radeonsi: set a per-buffer flag that disables inter-process sharing (v4)

Fri Sep 8 08:28:23 UTC 2017

On 07/09/17 07:24 PM, Christian König wrote:
> Am 07.09.2017 um 12:14 schrieb Marek Olšák:
>> On Sep 7, 2017 12:08 PM, "Christian König" <deathsimple at vodafone.de
>> <mailto:deathsimple at vodafone.de>> wrote:
>>     Am 07.09.2017 um 11:23 schrieb Michel Dänzer:
>>         On 01/09/17 07:40 PM, Christian König wrote:
>>             Am 01.09.2017 um 12:28 schrieb Michel Dänzer:
>>                 On 01/09/17 07:23 PM, Nicolai Hähnle wrote:
>>                     On 01.09.2017 11:58, Michel Dänzer wrote:
>>                         On 29/08/17 11:47 PM, Christian König wrote:
>>
>>                             From: Marek Olšák <marek.olsak at amd.com
>>                             <mailto:marek.olsak at amd.com>>
>>
>>                             For lower overhead in the CS ioctl.
>>                             Winsys allocators are not used with
>>                             interprocess-sharable resources.
>>
>>                             v2: It shouldn't crash anymore, but the
>>                             kernel will reject the new
>>                             flag.
>>                             v3 (christian): Rename the flag, avoid
>>                             sending those buffers in the
>>                             BO list.
>>                             v4 (christian): Remove setting the kernel
>>                             flag for now
>>
>>                         This change seems to have caused a GPU hang
>>                         when running piglit on my
>>                         Kaveri with the radeon kernel driver.
>>
>>         I think we can remove "seems to have". I'm still reliably
>>         getting the
>>         GPUVM fault and hang with current master, but not if I revert this
>>         commit (and the one after it).
>>
>>                         Haven't been able to isolate it to a specific
>>                         test, seems to only
>>                         happen when running multiple tests concurrently.
>>
>>         I reproduced the problem with piglit process separation
>>         enabled as well,
>>         and all four tests running when it hung were textureGather tests.
>>         Before, reproducing the problem twice with piglit process
>>         separation
>>         disabled, three textureGather tests were running when it hung
>>         both times
>>         as well. I've been unable to reproduce the problem by manually
>>         running
>>         the same textureGather tests in parallel though.
>>
>>
>>                         There's a GPUVM fault before the hang, I
>>                         suspect it's related:
>>
>>                         radeon 0000:00:01.0: GPU fault detected: 146
>>                         0x0ae6760c
>>                         radeon 0000:00:01.0: 
>>                          VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000001D7
>>                         radeon 0000:00:01.0: 
>>                          VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0607600C
>>                         VM fault (0x0c, vmid 3) at page 471, read from
>>                         'CPF' (0x43504600) (118)
>>
>>
>>                         Any ideas?
>>
>>             Not the slightest, but I'm still investigating problems
>>             with that on
>>             amdgpu.
>>
>>             If we can't find the root cause till Monday it might be a
>>             good idea to
>>             revert the patches for now.
>>
>>         What's the status on that?
>>
>>
>>
>>     I've found and fixed the remaining kernel bugs over the last
>>     weekend/beginning of this week.
>>
>>     Still need to commit the fix for UVD/VCE, but that one shouldn't
>>     affect GFX at all.
>>
>>
>> Michel is seeing hangs on the radeon KMD, which should be unaffected
>> by you kernel work I think.
>>
>> We could revert this to unbreak Michel's Kaveri,

FWIW, there's no need to do anything for my Kaveri development system in
particular; it's going out of service soon, and in the meantime I can
revert these changes locally.

My concern is that the underlying issue might cause other problems in
real world scenarios.

>> but I think it shouldn't be so difficult to find the culprit in this
>> patch if there is one.
> 
> The only crux is that the userspace patch shouldn't affect radeon at
> all. So the real question is what the heck is going on here?

Maybe some buffers that were previously allocated directly are now
sub-allocated or re-used from the BO cache, or vice versa, or something
like that?

-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer