[Mesa-dev] hardware xvmc video decoding with nouveau

Fri Jul 29 16:38:02 PDT 2011

On 07/30/2011 12:48 AM, Jimmy Rentz wrote:
> On Fri, 29 Jul 2011 15:37:19 +0200
> Maarten Lankhorst <m.b.lankhorst at gmail.com> wrote:
>
>> Hi guys,
>>
>> With some help from the nouveau team I managed to get video
>> acceleration working for my nv96 card. The video buffer api works
>> well enough for nouveau, I added flags to vl_video_buffer_create_ex
>> so I could force a linear surface with a nouveau specific resource
>> flag, which I only specified when hardware that potentially supported
>> hardware decoding was found. With the video buffer API, I only needed
>> to specify that and I could get it to work. This made it easy for me,
>> I only had to write code to talk to the decoder.
>>
>> The api for implementing the decoder I'm less happy about. I know
>> this is because there is no real support yet for other decoders, but
>> I think pipe_video_decode_buffer api is wrong right now. It assumes
>> that the state tracker knows enough about how the decoder wants to
>> interpret the macroblocks.
>> The nouveau hardware decoder has to interpret it in it's own way, so
>> that makes it need a different api. I think the best thing would be
>> to pass information about the macroblock with a pointer to the data
>> blocks, and then let the decoder buffer decide how to interpret it.
>> Also is it the intention to only start decoding when XvMCPutSurface
>> is called? If the reference surfaces are passed, I can start decoding
>> in XvMCRenderSurface. I'd also like it if flush_buffer is removed,
>> and instead the video buffers are passed to end_frame.
>>
>> Some of the methods to pipe_video_buffer also appear to be g3dvl
>> specific, so could it be split out?
>>
>> I was thinking of something like this for pipe_video_decode_buffer,
>> with flush_buffer in the decoder gone:
>>
>> struct pipe_video_decode_buffer
>> {
>>    struct pipe_video_decoder *decoder;
>>
>>    /* Should not leak even when begin_frame was called */
>>    void (*destroy)(struct pipe_video_decode_buffer *decbuf);
>>
>>    void (*begin_frame)(struct pipe_video_decode_buffer *decbuf);
>>
>>    /* *ONLY* called on bitstream acceleration, makes no sense to
>>     * call for XvMC, this allows it to be set to NULL */
>>    void (*set_quant_matrix)(struct pipe_video_decode_buffer *decbuf,
>>                             const uint8_t intra_matrix[64],
>>                             const uint8_t non_intra_matrix[64]);
>>    /* Same story here */
>>    void (*decode_bitstream)(struct pipe_video_decode_buffer *decbuf,
>>                             unsigned num_bytes, const void *data,
>>                             struct pipe_picture_desc *picture,
>>                             unsigned num_ycbcr_blocks[3]);
>>
>>    /* Can be NULL when bitstream acceleration is used.
>>     * Append a single macroblock to the list for decoding */
>>    void (*decode_macroblock)(struct pipe_video_decode_buffer *decbuf,
>>                              struct pipe_video_macroblock *mb, short
>> *datablocks); 
>>    /* If end frame is not set, it means more macroblocks may be
>>     * queued after this, and this is just an intermediate render,
>>     * if its beneficial to do so. Otherwise just return without
>>     * doing anything.
>>     */
>>    void (*render_frame)(struct pipe_video_decode_buffer *decbuf,
>>                         struct pipe_video_buffer *frames[3],
>>                         bool end_frame);
>> };
>>
>> Comments are welcome. The functions I removed should probably just be
>> moved to a g3dvl specific struct vl_mpeg12_video_decode_buffer.
>>
>> If you feel like testing xvmc with a capable card, I put my tree at
>> http://repo.or.cz/w/mesa/nouveau-pmpeg.git .
>>
>> I attached 2 patches, 1 is to clean up xvmc/test_rendering.c, the
>> other allows me to specify a custom flag to force a linear surface.
>> Should be mergeable right now.
>>
>> Special thanks to calim and mwk for their patience and help and to
>> jb17bsome for the original code which I based this on, even though
>> this code is significantly different from the original. :)
>>
>> Cheers,
>> Maarten
>>
> Nice job.  
> I believe the reason I did the xvmc that way was since I re each xvmc
> call.  I remember hacking together every combination of values even if
> they didn't make sense.  Sort of how renouveau worked originally. 
> That is why there were some odd subtractions, etc.  Another thing to
> understand is that I really knew nothing about mpeg2 video (or anything
> video related) back then so even if the combination of values made
> no sense I didn't care.  I just wanted fast hw accel mpeg2 video. 
> I wanted to work on this stuff again, but I really don't have time at
> the present. 
It took some time for me to understand the motion vectors part
of the original code, but those subtractions you call odd completely
make sense.

Say you have a 2 macro blocks at -2 and -1,
The decoder groups 2 of them, at position -1.

Now try dividing (-2)/2 and (-1)/2 in C, you'll get -1 and 0. But
0/2 and 1/2 are already there.. oops ;) div_down handles that
for me by doing (x & ~(n-1))/n

The 2 mv multiplying by 2 for the vertical coordinate also
makes sense, since it's interlaced and only even or odd
lines are counted. :)

Since chroma has half width and height, you div by 2 for them too.

These things allowed me to compact all the functions you used to
generate all the mvs to 1 small function instead.

> That idct portion is the most most processor intensive from my
> oprofile traces.  You probably see that but that was the only part I
> optimized to make it match/beat the NV driver besides other tweaks.  
Yeah, the PMPEG channel uses a zigzag pattern which is unfortunate. While this
could be problematic, it's probably not the real bottleneck. The real one is
currently that too much time is spent on waiting. This is probably a fault of the
current hack which submits data too often, and waits for completion before returning,
but I hope to get the api fixed.
> The only annoying thing is that vdpau has sort of made using mpeg2 xvmc
> unecessary for me.  It was fun to do the re since I got to learn about
> low-level hw which I will never do in my real job.
>
It was. The channel mode is slightly different though. For me the hardest
part was figuring out how to get the things initialized through the channel,
what the voodoo in the motion vectors was for, and how to get idct working
right, motion compensation worked by just memcpy'ing all the blocks into
the data buffer. IDCT has its own funny format which it wants in the data
channel.

Cheers,
Maarten