Reworking of GPU reset logic

Wed Apr 25 06:46:44 PDT 2012

2012/4/25 Dave Airlie <airlied at gmail.com>:
> 2012/4/25 Christian König <deathsimple at vodafone.de>:
>> On 21.04.2012 16:14, Jerome Glisse wrote:
>>>
>>> 2012/4/21 Christian König<deathsimple at vodafone.de>:
>>>>
>>>> On 20.04.2012 01:47, Jerome Glisse wrote:
>>>>>
>>>>> 2012/4/19 Christian König<deathsimple at vodafone.de>:
>>>>>>
>>>>>> This includes mostly fixes for multi ring lockups and GPU resets, but
>>>>>> it
>>>>>> should general improve the behavior of the kernel mode driver in case
>>>>>> something goes badly wrong.
>>>>>>
>>>>>> On the other hand it completely rewrites the IB pool and semaphore
>>>>>> handling, so I think there are still a couple of problems in it.
>>>>>>
>>>>>> The first four patches were already send to the list, but the current
>>>>>> set
>>>>>> depends on them so I resend them again.
>>>>>>
>>>>>> Cheers,
>>>>>> Christian.
>>>>>
>>>>> I did a quick review, it looks mostly good, but as it's sensitive code
>>>>> i would like to spend sometime on
>>>>> it. Probably next week. Note that i had some work on this area too, i
>>>>> mostly want to drop all the debugfs
>>>>> related to this and add some new more usefull (basicly something that
>>>>> allow you to read all the data
>>>>> needed to replay a locking up ib). I also was looking into Dave reset
>>>>> thread and your solution of moving
>>>>> reset in ioctl return path sounds good too but i need to convince my
>>>>> self that it encompass all possible
>>>>> case.
>>>>>
>>>>> Cheers,
>>>>> Jerome
>>>>>
>>>> After sleeping a night over it I already reworked the patch for improving
>>>> the SA performance, so please wait at least for v2 before taking a look
>>>> at
>>>> it :)
>>>>
>>>> Regarding the debugging of lockups I had the following on my "in mind
>>>> todo"
>>>> list:
>>>> 1. Rework the chip specific lockup detection code a bit more and probably
>>>> clean it up a bit.
>>>> 2. Make the timeout a module parameter, cause compute task sometimes
>>>> block a
>>>> ring for more than 10 seconds.
>>>> 3. Keep track of the actually RPTR offset a fence is emitted to
>>>> 3. Keep track of all the BOs a IB is touching.
>>>> 4. Now if a lockup happens start with the last successfully signaled
>>>> fence
>>>> and dump the ring content after that RPTR offset till the first not
>>>> signaled
>>>> fence.
>>>> 5. Then if this fence references to an IB dump it's content and the BOs
>>>> it
>>>> is touching.
>>>> 6. Dump everything on the ring after that fence until you reach the RPTR
>>>> of
>>>> the next fence or the WPTR of the ring.
>>>> 7. If there is a next fence repeat the whole thing at number 5.
>>>>
>>>> If I'm not completely wrong that should give you practically every
>>>> information available, and we probably should put that behind another
>>>> module
>>>> option, cause we are going to spam syslog pretty much here. Feel free to
>>>> add/modify the ideas on this list.
>>>>
>>>> Christian.
>>>
>>> What i have is similar, i am assuming only ib trigger lockup, before each
>>> ib
>>> emit to scratch reg ib offset in sa and ib size. For each ib keep bo list.
>>> On
>>> lockup allocate big memory to copy the whole ib and all the bo referenced
>>> by the ib (i am using my bof format as i already have userspace tools).
>>>
>>> Remove all the debugfs file. Just add a new one that gave you the first
>>> faulty
>>> ib. On read of this file kernel free the memory. Kernel should also free
>>> the
>>> memory after a while or better would be to enable the lockup copy only if
>>> some kernel radeon option is enabled.
>>
>>
>> Just resent my current patchset to the mailing list, it's not as complete as
>> your solution, but seems to be a step into the right direction. So please
>> take a look at them.
>>
>> Being able to generate something like a "GPU crash dump" on lockup sounds
>> like something very valuable to me, but I'm not sure if debugfs files are
>> the right direction to go. Maybe something more like a module parameter
>> containing a directory, and if set we dump all informations (including bo
>> content) available in binary form (instead of the current human readable
>> form of the debugfs files).
>
> Do what intel driver does, create a versioned binary debugfs file with
> all the error state in it for a lockup,
> store only one of these at a time, run a userspace tool to dump it out
> into something you can
> upload or just cat the file and upload it.
>
> You don't want the kernel writing to dirs on disk under any circumstances
>

We have an internal binary format for dumping command streams and
associated buffers, we should probably use that so that we can better
take advantage of existing internal tools.

Alex

> Dave.
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel