[Intel-gfx] GPU hang with high media workload on BSW
Tang, Jun
jun.tang at intel.com
Fri Sep 2 03:21:55 UTC 2016
One more thing to add, if allocate the ringbuffer not from stolen memory but normal memory, issue is gone.
static int intel_alloc_ringbuffer_obj(struct drm_device *dev,
struct intel_ringbuffer *ringbuf)
{
struct drm_i915_gem_object *obj;
obj = NULL;
#if 0
if (!HAS_LLC(dev))
obj = i915_gem_object_create_stolen(dev, ringbuf->size);
#endif
if (obj == NULL)
obj = i915_gem_alloc_object(dev, ringbuf->size);
if (obj == NULL)
return -ENOMEM;
/* mark ring buffers as read-only from GPU side by default */
obj->gt_ro = 1;
ringbuf->obj = obj;
return 0;
}
Can anyone give me some directions to check, thanks!
-James
-----Original Message-----
From: Tang, Jun
Sent: Friday, July 1, 2016 1:14 PM
To: 'intel-gfx at lists.freedesktop.org' <intel-gfx at lists.freedesktop.org>
Subject: GPU hang with high media workload on BSW
Hi Guys,
Thanks for the help in advanced!
I'm encountering a GPU hang issue while running multiple channel H264 video decoding + VPP composition, display and also one channel H264 encoding on BSW.
It's a render ring stuck like below:
[58503.223700] [drm] stuck on render ring [58503.246340] [drm] GPU HANG: ecode 8:0:0x7f1d7e3d, in Challenge [3259], reason: Ring hung, action: reset
There is a part of the /sys/class/drm/card0/error as below, I suspect the hang is caused by the incorrect render ring buffer content:
In below line with 'where I suspect', the value of ring buffer is 18800001 (MI_BATCH_BUFFER_START_GEN8), but the next DWORD is 00100002.
Since MI_BATCH_BUFFER_START_GEN8 should be followed by batch buffer address, I think the content of ring buffer is not correct.
==========part of the /sys/class/drm/card0/error=========
render ring --- 3 requests
seqno 0x020dc83a, emitted 4353167966, tail 0x00000070
seqno 0x020dc83b, emitted 4353167969, tail 0x000000f0
seqno 0x020dc83e, emitted 4353167982, tail 0x00000170 render ring --- ringbuffer = 0x00015000
00000000 : 18800001 // where I suspect
00000004 : 00100002 // where I suspect
00000008 : 00000000
0000000c : 00000000
00000010 : 00000000
00000014 : 00000000
00000018 : 7a000004
0000001c : 01144c1c
00000020 : 00036080
00000024 : 00000000
00000028 : 00000000
0000002c : 00000000
00000030 : 04000000
00000034 : 00000000
00000038 : 0c000000
0000003c : 1382c10c
==========part of the /sys/class/drm/card0/error=========
To identify when the ring buffer is incorrectly programmed, I added some code to read the first DWORD of ring buffer back after intel_ring_emit in gen8_emit_pipe_control while tail of ring buffer is zero.
The result is: the read-back first DWORD of ring buffer is sometimes different from the data intel_ring_emit just writes when tail is 0. And just after this, GPU hang may happen.
Here is the output of my print:
[ 3409.067402] rcs b:0x18800001 d:0x7a000004 t:0
'b' - ioread32 (ringbuf->virtual_start)
'd' - intel_ring_emit wants to write
't' - the value of tail
I'm aware that ringbuf->virtual_start is write combine, the read may led to write-combine buffer flush and slow read performance. But don't know why it's different from the value intel_ring_emit just writes?
Another test, when the value read back is not correct, I wrote it again. Then read back again, most of the time, it will become correct.
Thanks a lot!
-James
More information about the Intel-gfx
mailing list