[Intel-gfx] GPU hang with high media workload on BSW

Fri Sep 2 03:21:55 UTC 2016

One more thing to add, if allocate the ringbuffer not from stolen memory but normal memory, issue is gone.

static int intel_alloc_ringbuffer_obj(struct drm_device *dev,
                                      struct intel_ringbuffer *ringbuf)
{
        struct drm_i915_gem_object *obj;

        obj = NULL;
#if 0
        if (!HAS_LLC(dev))
                obj = i915_gem_object_create_stolen(dev, ringbuf->size);
#endif
        if (obj == NULL)
                obj = i915_gem_alloc_object(dev, ringbuf->size);
        if (obj == NULL)
                return -ENOMEM;

        /* mark ring buffers as read-only from GPU side by default */
        obj->gt_ro = 1;

        ringbuf->obj = obj;

        return 0;
}

Can anyone give me some directions to check, thanks!

-James

-----Original Message-----
From: Tang, Jun 
Sent: Friday, July 1, 2016 1:14 PM
To: 'intel-gfx at lists.freedesktop.org' <intel-gfx at lists.freedesktop.org>
Subject: GPU hang with high media workload on BSW

Hi Guys,

Thanks for the help in advanced!

I'm encountering a GPU hang issue while running multiple channel H264 video decoding + VPP composition, display and also one channel H264 encoding on BSW.
It's a render ring stuck like below:
[58503.223700] [drm] stuck on render ring [58503.246340] [drm] GPU HANG: ecode 8:0:0x7f1d7e3d, in Challenge [3259], reason: Ring hung, action: reset

There is a part of the /sys/class/drm/card0/error as below, I suspect the hang is caused by the incorrect render ring buffer content:
In below line with 'where I suspect', the value of ring buffer is 18800001 (MI_BATCH_BUFFER_START_GEN8), but the next DWORD is 00100002. 
Since MI_BATCH_BUFFER_START_GEN8 should be followed by batch buffer address, I think the content of ring buffer is not correct.

==========part of the /sys/class/drm/card0/error=========
render ring --- 3 requests
  seqno 0x020dc83a, emitted 4353167966, tail 0x00000070
  seqno 0x020dc83b, emitted 4353167969, tail 0x000000f0
  seqno 0x020dc83e, emitted 4353167982, tail 0x00000170 render ring --- ringbuffer = 0x00015000
00000000 :  18800001 // where I suspect
00000004 :  00100002 // where I suspect
00000008 :  00000000
0000000c :  00000000
00000010 :  00000000
00000014 :  00000000
00000018 :  7a000004
0000001c :  01144c1c
00000020 :  00036080
00000024 :  00000000
00000028 :  00000000
0000002c :  00000000
00000030 :  04000000
00000034 :  00000000
00000038 :  0c000000
0000003c :  1382c10c
==========part of the /sys/class/drm/card0/error=========

To identify when the ring buffer is incorrectly programmed, I added some code to read the first DWORD of ring buffer back after intel_ring_emit in gen8_emit_pipe_control while tail of ring buffer is zero.
The result is: the read-back first DWORD of ring buffer is sometimes different from the data intel_ring_emit just writes when tail is 0. And just after this, GPU hang may happen.

Here is the output of my print:
[ 3409.067402] rcs b:0x18800001 d:0x7a000004 t:0

'b' - ioread32 (ringbuf->virtual_start)
'd' - intel_ring_emit wants to write
't' - the value of tail

I'm aware that ringbuf->virtual_start is write combine,  the read may led to write-combine buffer flush and slow read performance. But don't know why it's different from the value intel_ring_emit just writes? 

Another test, when the value read back is not correct, I wrote it again. Then read back again, most of the time, it will become correct.

Thanks a lot!
-James