[Nouveau] H.264 engine differences between fermi and tesla cards

Ilia Mirkin imirkin at alum.mit.edu
Fri Dec 6 17:06:33 PST 2013


On Fri, Dec 6, 2013 at 7:36 PM, Benjamin Morris <bmorris at nvidia.com> wrote:
> I've gathered a few hints regarding H264 video decoding on our hardware.  Hopefully some of them will be useful.

Very useful!

>
> First off, regarding naming in general.  Our internal names for our video engines differ from the names you've been using.  Below is a translation map between the names on http://nouveau.freedesktop.org/wiki/VideoAcceleration/ and our internal names.  This is more of an FYI than anything else, to help translation; I don't expect it to help with this particular H264 hang.
>
> VP2 (same)
> VP3   -> MSDEC
> VP4.0 -> MSDEC2
> VP4.2 -> MSDEC3
> VP5   -> MSDEC4
>
> Looking at your code, it seems that you're instantiating all 3 engines (VLD, PDEC, PPP) on the same channel.  This probably isn't causing the hang, but it's bad practice in general, as it prevents the engines from running in parallel.  It's also impossible to use multiple engines on the same channel like this on MSDEC4 (VP5) GPUs, so the same separate channel usage that you need to have for MSDEC4 should also be used for everything down to G84.

First off, thank you so much for diving into our code! Hope it wasn't
_too_ dirty :)

Yeah, for Kepler, we put things on separate channels. When I did the
VP2 implementation, I also put them on separate channels since it
seemed like that was what was being done from the traces (and I knew
much less about all these things back then). But when I was doing the
initial pass to get MSDEC[12] working based on existing MSDEC[34]
code, I just left it alone, and it had them on the same channel for
pre-kepler. There's some limitation in nouveau that prevents multiple
channels from being active anyways (or something like that, it was
explained to me, but I don't quite remember right now), so it won't
matter either way for now. But in the future the plan is definitely to
move to separate channels.

>
> Regarding "non-obvious" values for H264 decoding, looking at nouveau_vp3_video_vp.c, it looks like there are several unknown values in the H264 picture parameter structure, especially for the DPB reference table.  This seems like a potential cause for your MSDEC[12]-specific hangs; incorrect DPB state can be difficult to figure out from picture parameter dumps, and the PDEC response to incorrect DPB state is generally to simply hang.  There are no significant differences between MSDEC[12] (your VP3/VP4.0) and MSDEC3 (your VP4.2) regarding DPB state, but improvements in error resilience/concealment may simply be masking the problem on MSDEC3.  Below I've filled in our names for unnamed fields in that structure.  Hopefully this allows you to make some quick progress; you can apply the same logic you already have for G84 to your G98 code path.

Great. There are few acronyms in there that I'm not familiar with, but
I suspect sufficient documentation-reading will fix the problem.

One additional question is whether you have any comments on our
inter-engine buffer sizing/usage (inter_data). Specifically the 0x720
method for VP (PDEC?) and 0x414 method on BSP (VLD?). Right now we set
that to the same address/size, but I noticed that you offset them by
0x2100 (iirc). However when I did that, it just caused the engine to
hang faster. (But then I noticed that on 331.20, which is what I used
for my latest traces, the "kernel" fuc code had been updated from the
one that we're extracting with my fw-cutter script, so perhaps the ABI
has changed. Or perhaps the fw-cutter has an insufficiently-precise
signature... I'll check it out later.)

I will look at your comments on our picparm data structure and will
adjust our code. Hopefully that's all that's needed.

Thanks again!

  -ilia

>
> Thanks,
> Ben
>
> struct h264_picparm_vp { // 700..a00
>         uint16_t width, height;
>         uint32_t stride1, stride2; // 04 08
>         uint32_t ofs[6]; // 0c..24 in-image offset
>
>         uint32_t u24; // nfi ac8 ? -> ColocBufferSize
>         uint32_t bucket_size; // 28 bucket size
>         uint32_t inter_ring_data_size; // 2c
>
>         unsigned f0 : 1; // 0 0x01: into 640 shifted by 3, 540 shifted by 5, half size something? -> MbaffFrameFlag
>         unsigned f1 : 1; // 1 0x02: into vuc ofs 56 -> direct_8x8_inference_flag
>         unsigned weighted_pred_flag : 1; // 2 0x04
>         unsigned f3 : 1; // 3 0x08: into vuc ofs 68 -> constrained_intra_pred_flag
>         unsigned is_reference : 1; // 4
>         unsigned interlace : 1; // 5 field_pic_flag
>         unsigned bottom_field_flag : 1; // 6
>         unsigned f7 : 1; // 7 0x80: nfi yet -> second_field (second field of complementary reference field)
>
>         signed log2_max_frame_num_minus4 : 4; // 31 0..3
>         unsigned u31_45 : 2; // 31 4..5 -> chroma_format_idc
>         unsigned pic_order_cnt_type : 2; // 31 6..7
>         signed pic_init_qp_minus26 : 6; // 32 0..5
>         signed chroma_qp_index_offset : 5; // 32 6..10
>         signed second_chroma_qp_index_offset : 5; // 32 11..15
>
>         unsigned weighted_bipred_idc : 2; // 34 0..1
>         unsigned fifo_dec_index : 7; // 34 2..8
>         unsigned tmp_idx : 5; // 34 9..13 -> CurrColIdx (index of associated co-located motion data buffer)
>         unsigned frame_number : 16; // 34 14..29
>         unsigned u34_3030 : 1; // 34 30..30 pp.u34[30:30]
>         unsigned u34_3131 : 1; // 34 31..31 pad?
>
>         uint32_t field_order_cnt[2]; // 38, 3c
>
>         struct { // 40
>                 // 0x00223102
>                 // nfi (needs: top_is_reference, bottom_is_reference, is_long_term, maybe some other state that was saved..
>                 unsigned fifo_idx : 7; // 00 0..6 -> buffer id
>                 // tmp_idx is the index of the associated co-located motion data buffer
>                 // for simplest management, ensure that this is always equal to the buffer id
>                 unsigned tmp_idx : 5; // 00 7..11
>                 unsigned unk12 : 1; // 00 12 not seen yet, but set, maybe top_is_reference -> top_is_reference
>                 unsigned unk13 : 1; // 00 13 not seen yet, but set, maybe bottom_is_reference? -> bottom_is_reference
>                 unsigned unk14 : 1; // 00 14 skipped? -> is_long_term
>                 unsigned notseenyet : 1; // 00 15 pad? -> not_existing
>                 unsigned unk16 : 1; // 00 16 -> is_field_pair
>                 unsigned unk17 : 4; // 00 17..20 -> top_field_marking (top_is_reference ? 1+is_long_term : 0)
>                 unsigned unk21 : 4; // 00 21..24 -> bottom_field_marking
>                 unsigned pad : 7; // 00 d25..31
>
>                 uint32_t field_order_cnt[2]; // 04,08
>                 uint32_t frame_idx; // 0c
>         } refs[0x10];
>
>         uint8_t m4x4[6][16]; // 140
>         uint8_t m8x8[2][64]; // 1a0
>         // most of the remaining is MVC or SVC setup info, filled zero if not MVC or SVC
>         uint32_t u220; // 220 number of extra reorder_list to append?
>         uint8_t u224[0x20]; // 224..244 reorder_list append ?
>         uint8_t nfi244[0xb0]; // add some pad to make sure nulls are read
> };


More information about the Nouveau mailing list