SDMA out-of-bounds write access of tiled surface (was: Re: [amd-gfx] AMD Carrizo - GPU fault detected: 146 0x0842b714)
Marek Olšák
maraeo at gmail.com
Wed Jun 22 12:21:03 UTC 2016
I don't think so.
The VM faults can only occur when accessing the linear texture, and
the Mesa code should use the correct workarounds already.
The tiled texture is just a collection of 1D tiles (8x8 pixels) and
SDMA operates on those 1D tiles. It doesn't access memory outside of
1D tile boundaries it's supposed to access. 2D tiling is just a
different ordering of 1D tiles with greater alignment requirements.
The 2D tile parameters such as bank_height and macro_tile_aspect only
affect that ordering. 1D tiles are always the same regardless of the
higher tile mode. Given that, I don't see how SDMA can behave
differently here.
There are 2 possible explanations for VM faults from tiled access:
- The tile parameters passed to SDMA don't agree with the parameters
determined by addrlib. (or there can be a bug in passing those between
processes)
- Unknown or undiscovered SDMA bug.
Note that no docs describe the VM fault bug from linear access.
If you both have Carrizo, you should get the same 2D tile parameters.
If you don't, it's weird.
Marek
On Wed, Jun 22, 2016 at 9:50 AM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> Hi Mads,
>
> setting R600_DEBUG=nodma in the X server should work around your problem for
> now.
>
> Marek, perhaps an out-of-bounds check for tiled texture memory access
> similar to the linear access check is necessary? I wonder if you've seen
> something about that in the docs.
>
> I've annotated the sDMA IB dump. It's a linear-to-display-tiled copy on
> Carrizo. I tried to reproduce with the attached patch, but failed to do so
> even with amdgpu.vm_debug=1. With the patch, I get DMA copies that are
> identical to the one that causes the VM fault except for a different
> bank_height and macro_tile_aspect, so the issue is likely related to those.
>
> Nicolai
>
> On 21.06.2016 19:32, Nicolai Hähnle wrote:
>>
>> On 21.06.2016 19:16, Mads wrote:
>>>
>>> I sent this for 1.5 hours ago, but since it hasn't arrived to the
>>> mailing list yet, I try again...
>>
>>
>> It arrived, no worries :)
>>
>> I'll take a look later.
>>
>> Nicolai
>>
>>>
>>> On 2016-06-21 17:48, Mads wrote:
>>>
>>>> On 2016-06-21 10:12, Mads wrote:
>>>>
>>>> On 2016-06-21 09:39, Nicolai Hähnle wrote:
>>>>
>>>> Thanks. However, I still don't think this is going to help. Your
>>>> earlier trace experiments showed that the problematic SDMA commands
>>>> came from the X server, _not_ from plasmashell.
>>>>
>>>> So what we see here is likely just the first set of GPU commands sent
>>>> by plasmashell after the VM fault occurred. Since the plasmashell
>>>> process is unable to tell who caused the VM fault, it takes the blame
>>>> incorrectly. Are you sure the X server is using your self-compiled
>>>> radeonsi_dri.so and has the environment variable set? If it creates a
>>>> ddebug_dump, it might be somewhere else (it's based off the HOME
>>>> environment variable, which may be different).
>>>> I'll take a second look to see if there's an X dump there too, but
>>>> unfortunately it'll be in about ~8 hours before I have the machine at
>>>> hand again..
>>>>
>>>> And yes, I'm sure, everything is built through portage, so there is no
>>>> "self-compiled" on the system per se. There's always just one lib
>>>> available at any time :)
>>>
>>>
>>> You were right! X didn't have R600_DEBUG=check_vm in environment (no
>>> login shell/sourcing of /etc/profile).
>>>
>>> Here's what i ran:
>>>
>>>> $ XAUTHORITY=.Xauthority DISPLAY=:0 LIBGL_DEBUG=verbose dolphin
>>>> libGL: pci id for fd 9: 1002:9874, driver radeonsi
>>>> libGL: OpenDriver: trying /usr/lib64/dri/tls/radeonsi_dri.so
>>>> libGL: OpenDriver: trying /usr/lib64/dri/radeonsi_dri.so
>>>> si_vm_fault_occured: failed to parse line ' Either
>>>> enable ECC checking or force module loading by setting
>>>> 'ecc_enable_override'.
>>>> '
>>>> libGL: Using DRI3 for screen 0
>>>> Trying to convert empty KLocalizedString to QString.
>>>> Cannot creat accessible child interface for object:
>>>> PlacesView(0x118d670) index: 5
>>>> QPixmap::scaled: Pixmap is a null pixmap
>>>> QPixmap::scaled: Pixmap is a null pixmap
>>>> (... etc ...)
>>>> The X11 connection broke (error 1). Did the X11 server die?
>>>
>>>
>>> Attaching dmesg and ddebug_dump.
>>>
>>> - Mads
>
>
More information about the amd-gfx
mailing list