[Nouveau] Debugging INVALID_OPCODE / MULTIPLE_WARP_ERRORS ?

Hans de Goede hdegoede at redhat.com
Fri Dec 18 04:57:50 PST 2015


Hi,

On 16-12-15 18:34, Ilia Mirkin wrote:
> BTW, you may be interested in
> https://github.com/imirkin/mesa/commits/atomic3 which has working
> ARB_shader_atomic_counters and ARB_shader_storage_buffer_object
> support (while ripping out things like TGSI_FILE_RESOURCE).

Interesting, good to see progress on this.

> Still
> working on proper memory qualifier support, and obviously need to do
> some cleanup before upstreaming. Should be getting into a pushable
> state probably early January.

I'm looking forward to seeing this upstream, and I'll keep this in
mind during my own work.

Regards,

Hans





>
> Cheers,
>
>    -ilia
>
> On Wed, Dec 16, 2015 at 12:24 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
>> I believe that your problem is this:
>>
>>          /*01a0*/                   LD R8, [R8];
>>             /* 0x8000000000821c85 */
>>
>> That needs to be LD.E (and your ST's need to be ST.E). You're using a
>> 32-bit gmem address, but you need to be using a 64-bit one. I believe
>> the 32-bit ones work on fermi, but afaik not on Kepler.
>>
>> Cheers,
>>
>>    -ilia
>>
>>
>>
>> On Wed, Dec 16, 2015 at 12:06 PM, Hans de Goede <hdegoede at redhat.com> wrote:
>>> Hi,
>>>
>>> On 15-12-15 20:04, Ilia Mirkin wrote:
>>>>
>>>> Also, where's the exit op? Perhaps what's happening is that you don't
>>>> have an exit and it just goes off executing into the ether?
>>>
>>>
>>> Sorry I only included a small bit of the program in my original mail
>>> because I found the use of "MOV" instructions to load constants
>>> suspicious, is that normal ?
>>>
>>> I've put a log with NV50_PROG_DEBUG=1 output here:
>>>
>>> https://fedorapeople.org/~jwrdegoede/nbody.log
>>>
>>> nvdisasm -b SM30 for the generated binary code is here:
>>>
>>> https://fedorapeople.org/~jwrdegoede/nbody.disasm
>>>
>>> There are already .tgsi, .hex and .bin files there if
>>> you find those easier to use then the
>>> NV50_PROG_DEBUG=1 output.
>>>
>>>
>>>>
>>>> On Tue, Dec 15, 2015 at 12:00 PM, Ilia Mirkin <imirkin at alum.mit.edu>
>>>> wrote:
>>>>>
>>>>> A few things that stand out:
>>>>>
>>>>>     0: ld u32 %r219 c0[0x0000000000000000+0x0] (0)
>>>>>
>>>>> wtf is that 0x0000000000000 thing doing there? Was it a %rX which got
>>>>> constant-folded into 0? That indirectness should have then been
>>>>> removed... that said, the final encoding looks fine.
>>>
>>>
>>> I don't know, maybe there is a hint in the log file?
>>>
>>> Regards,
>>>
>>> Hans
>>>
>>>
>>>
>>>>>
>>>>> I believe that kepler has this launch descriptor thing too... is that
>>>>> being set correctly? Please generate a mmt trace, and we can see if
>>>>> anything stands out compared to a blob trace that also does compute.
>>>>>
>>>>> Cheers,
>>>>>
>>>>>     -ilia
>>>>>
>>>>> On Tue, Dec 15, 2015 at 9:15 AM, Hans de Goede <hdegoede at redhat.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> As part of my compute work I'm trying to get some TGSI compute
>>>>>> code to work. The code from mesa/src/gallium/tests/trivial.c
>>>>>> works.
>>>>>>
>>>>>> So now I'm trying to get a "native" tgsi kernel to run via
>>>>>> clover, I'm using Francisco's nbody.c example for this:
>>>>>>
>>>>>> https://fedorapeople.org/~jwrdegoede/nbody.c
>>>>>>
>>>>>> Which does not work, at first I thought there was an issue
>>>>>> with the setup of the input / output buffers, but that seems to
>>>>>> work fine, and moreover I finally got the smart idea to look
>>>>>> in dmesg, which says:
>>>>>>
>>>>>> [ 9920.802435] nouveau 0000:01:00.0: gr: TRAP ch 6 [007f7fa000
>>>>>> nbody[31881]]
>>>>>> [ 9920.802449] nouveau 0000:01:00.0: gr: GPC0/TPC0/MP trap: global
>>>>>> 00000000
>>>>>> [] warp 10009 [INVALID_OPCODE]
>>>>>> [ 9920.802456] nouveau 0000:01:00.0: gr: GPC0/TPC1/MP trap: global
>>>>>> 00000004
>>>>>> [MULTIPLE_WARP_ERRORS] warp 20009 [INVALID_OPCODE]
>>>>>>
>>>>>> and repeats that for every "step" in the nobody simulation, this is on a
>>>>>> gk107 card.
>>>>>>
>>>>>> So that seems to be the real problem, since the
>>>>>> error says "INVALID_OPCODE", I've put the tgsi code from nbody.c
>>>>>> through "nouveau_compiler -a e4" and then run "nvdisasm -b SM30"
>>>>>> on it, but the output looks ok. There is a 8 byte sequence which does
>>>>>> not get decoded every 64 bytes but AFAIK that is the scheduling info,
>>>>>> so that should be fine.
>>>>>>
>>>>>> One thing which does stand out is that this:
>>>>>>
>>>>>>     0: ld u32 %r219 c0[0x0000000000000000+0x0] (0)
>>>>>>     1: ld u32 %r222 c0[0x4] (0)
>>>>>>     2: ld u64 { %r225 %r228 } c0[0x8] (0)
>>>>>>     3: ld u32 %r234 c0[0x10] (0)
>>>>>>
>>>>>> Gets translated into (nvdisasm output) :
>>>>>>
>>>>>>           /*0008*/                   LDC R4, c[0x0][0x0];
>>>>>> /* 0x1400000003f11c86 */
>>>>>>           /*0010*/                   MOV R2, c[0x0][0x4];
>>>>>> /* 0x2800400010009de4 */
>>>>>>           /*0018*/                   LDC.64 R0, c[0x0][0x8];
>>>>>> /* 0x1400000023f01ca6 */
>>>>>>           /*0020*/                   MOV R3, c[0x0][0x10];
>>>>>> /* 0x280040004000dde4 */
>>>>>>
>>>>>> Where I would expect for LDC instructions, could that be the problem ?
>>>>>>
>>>>>> If that is not the problem, then hints how to debug this further would
>>>>>> be
>>>>>> greatly appreciated.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Hans
>>>>>> _______________________________________________
>>>>>> Nouveau mailing list
>>>>>> Nouveau at lists.freedesktop.org
>>>>>> http://lists.freedesktop.org/mailman/listinfo/nouveau


More information about the Nouveau mailing list