GPU lockup dumping

Jerome Glisse j.glisse at gmail.com
Wed May 23 10:02:35 PDT 2012


On Wed, May 23, 2012 at 12:41 PM, Dave Airlie <airlied at gmail.com> wrote:
> On Wed, May 23, 2012 at 5:26 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
>> On Wed, May 23, 2012 at 12:08 PM, Dave Airlie <airlied at gmail.com> wrote:
>>> On Wed, May 23, 2012 at 3:48 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
>>>> On Wed, May 23, 2012 at 8:34 AM, Christian König
>>>> <deathsimple at vodafone.de> wrote:
>>>>> On 23.05.2012 11:27, Dave Airlie wrote:
>>>>>>
>>>>>> On Thu, May 17, 2012 at 7:28 PM,<j.glisse at gmail.com>  wrote:
>>>>>>>
>>>>>>> So here is improved patchset, where i splited ground work necessary
>>>>>>> for the dumping into their own patch. The debugfs improvement could
>>>>>>> probably be usefull to intel instead of having i915 have it's own
>>>>>>> debugfs file stuff.
>>>>>>>
>>>>>>> The lockup dumping public api have been move into radeon_drm.h
>>>>>>>
>>>>>>> Stressing the fact again that dump are self contained ie they have
>>>>>>> all the data needed to be replayed (vertex, indices, shader, texture,
>>>>>>> ...).
>>>>>>>
>>>>>>> Would really like to get this into 3.5, the new API is pretty much
>>>>>>> straightforward and userspace tools can easily be made to convert
>>>>>>> it to other format. The change to the driver is self contained.
>>>>>>
>>>>>> I really don't like introducing this at this stage into 3.5,
>>>>>>
>>>>>> I'd really like a good review of the API and what information we provide
>>>>>> along with how extensible it is.
>>>>>>
>>>>>> I'm still not convinced replay is what we want in the field, I know its
>>>>>> what
>>>>>> *you* want, but I think apitrace stuff in userspace pretty much covers
>>>>>> the replaying situation. So I'd have to look at this and see how easy
>>>>>> it makes disecting command streams etc.
>>>>>>
>>>>>> Dave.
>>>>>
>>>>>
>>>>> I agree that it might not be a good idea to push that into 3.5, since at
>>>>> least I (and I also think Alex) didn't had time to look into it yet. On the
>>>>> other hand the patches look quite reasonable.
>>>>>
>>>>> But I still wanted to throw in a requirement from my day to day work, maybe
>>>>> that helps finding a more general solution:
>>>>> When we start to work with more parts of the chip it might be necessary to
>>>>> dump everything that is currently "in the fly". For example I had a whole
>>>>> bunch of problems where copying data around with a 3D Blit and then missing
>>>>> a sync between this job and a job on another rings causes a "hiccup" in the
>>>>> hardware.
>>>>>
>>>>> I know that this isn't your focus and that is absolutely ok with me, cause
>>>>> the format you are introducing is just used in debugfs and so not part of
>>>>> any stable API (at least not in my understanding), but you should still keep
>>>>> in mind that we might need to extend it into that direction in the future.
>>>>>
>>>>> Christian.
>>>>
>>>> Note that my format is also done with that in mind, it can capture ib
>>>> from all rings. The only thing i don't think worth capturing are the
>>>> ring themself because there would be no way to replay them without
>>>> adding some new special API.
>>>
>>> I'd like to dump the rings as well, as I said I'd rather we didn't
>>> limit this to replay, but make it useful for getting as much info as
>>> possible out
>>>
>>> Dave.
>>
>> Ring will contains very little, like ib schedule and fence, i don't
>> see how useful this can be.
>>
>
> In case we have a bug in our ib scheduling or fencing :-0
>
> Dave.

Well i think we have several kind of lockup, the most basic one is
userspace sending broken shader, vertex, or something in that line.
The more complex one is timing related, like a bo move or some cache
invalidation that didn't happen properly and GPU endup reading either
wrong data or old cached data. I don't see how to capture useful
information for this second case, beside doing snapshot of memory.

For multi-ring i agree that dumping the ring might prove useful spot
inter-ring semaphore deadlock, or possibly inter-ring absence of
synchronization (but that would be a bad kernel bug).

Cheers,
Jerome


More information about the dri-devel mailing list