[Mesa-dev] reworking pipe_video_decoder / pipe_video_buffer

Fri Nov 25 17:06:42 PST 2011

Hey Christian,

On 11/21/2011 05:49 PM, Christian König wrote:
> On 16.11.2011 15:38, Maarten Lankhorst wrote:
>> Hey,
>>
>> On 11/16/2011 02:47 PM, Christian König wrote:
>>> On 15.11.2011 17:52, Maarten Lankhorst wrote:
>>>> Deleted:
>>>> - begin_frame/end_frame: Was only useful for XvMC, should be folded into flush..
>>> I'm not completely happy with the current interface also, but if you remove the state tracker ability to control how many buffers are used, which in turn lets the VDPAU and XvMC state tracker use the same buffering algorithm. This is quite bad, because for XvMC we are forced to use a buffer for each destination surface, while in the VDPAU/VAAPI case a much simpler method should be used.
>> I wasn't aware of that, the patch could easily be changed for entrypoint<= bitstream.
>> Only create a single decoder buffer, otherwise attach it to the surface.
>> I still think it should be removed from the state tracker though, and it would
>> still be possible..
> I would rather pass another parameter to the decoder create function, like "scatter_decode" or something like that. Younes originally had a parameter just to describe the possible different calling conventions for decode, but that covered only slice vs. frame based buffering and I didn't understand at what he tried todo with it at that time.
>
> I would also suggest to separate the detection the start of a new frame into a function and make this function a public interface in vl_mpeg12_decoder_h, so that other drivers can use this one for their handling as well.
Well, the vdpau decode api is only called once on a surface for progressive frame,
once for a frame coded interlaced field-pair (why vc1, WHY?!), or twice for separately
encoded fields. (All codecs including vc-1 have a mode for this, giving vc-1 3 modes..)   

>>>> - set_quant_matrix/set_reference_frames:
>>>>       they should become part of picture_desc,
>>>>       not all codecs deal with it in the same way,
>>>>       and some may not have all of the calls.
>>> That is true, but also intended. The idea behind it is that it is not necessary for all codecs to set all buffer types, but instead inform the driver that a specific buffer type has changed. This gives the driver the ability to know which parameters has changed between calls to decode, and so only perform the (probably expensive) update of those parameters on the hardware side.
>> This is why I used the flush call for XvMC as a way to signal parameter changes. If you looked at the
>> decode_macroblock for g3dvl, it checks if (dec->current_buffer == target->decode_buffer) in
>> begin_frame, so it doesn't flush state.
>>
>> For VA-API, I haven't had the .. pleasure(?) of playing around with it, and it seems
>> to be only stubs.
> That's correct, the VA-API implementation isn't completed, and I wouldn't call it a pleasure to work with it. Intels VA-API specification seems to be a bit in a flux, moving structure members around and sometimes even changing some calling conventions from one version to another.
>
> But they still have one very nice idea to set certain parameters once and then never change seem, for example:
>
> |/*
>  *  For application, e.g. set a new bitrate
>  *    VABufferID buf_id;
>  *    VAEncMiscParameterBuffer *misc_param;
>  *    VAEncMiscParameterRateControl *misc_rate_ctrl;
>  *
>  *    vaCreateBuffer(dpy, context, VAEncMiscParameterBufferType,
>  *              sizeof(VAEncMiscParameterBuffer) + sizeof(VAEncMiscParameterRateControl),
>  *              1, NULL,&buf_id);
>  *
>  *    vaMapBuffer(dpy,buf_id,(void **)&misc_param);
>  *    misc_param->type = VAEncMiscParameterTypeRateControl;
>  *    misc_rate_ctrl= (VAEncMiscParameterRateControl *)misc_param->data;
>  *    misc_rate_ctrl->bits_per_second = 6400000;
>  *    vaUnmapBuffer(dpy, buf_id);
>  *    vaRenderPicture(dpy, context,&buf_id, 1);
>  */
>
> |
>
> For me that means a big simplification, because I don't need to check every frame if certain parameters has changed that would otherwise lead to a whole bunch of overhead like flushes or even the need to reset the whole GPU block and load new parameters etc....
However, vdpau doesn't expose that information, so just assume everything has changed. :-)
>> If the decode_bitstream interface is changed to get all bitstream buffers at the same time,
>> there wouldn't be overhead to doing it like this. For a single picture it's supposed to stay constant,
>> so for vdpau the sane way would be: set picture parameters for hardware (includes EVERYTHING),
>> write all bitstream buffers to a hardware bo, wait until magic is done. Afaict, there isn't even a sane
>> way to only submit partial buffers, so it's just a bunch of overhead for me.
>>
>> nvidia doesn't support va-api, it handles the entire process from picture parameters
>> to a decoded buffer internally so it always convert the picture parameters into
>> something the hardware can understand, every frame.
> I'm not arguing against removing the scattered calls to decode_bitstream. I just don't want to lose information while passing the parameters from the state tracker down to the driver. But we can also add this information as a flag to the function later, so on a second thought that seems to be ok.
:-)
>>>> - set_picture_parameters: Can be passed to decode_bitstream/macroblock
>>> Same as above, take a look at the enum VABufferType  in the VAAPI specification (http://cgit.freedesktop.org/libva/tree/va/va.h).
>>>> - set_decode_target: Idem
>>>> - create_decode_buffer/set_decode_buffer/destroy_decode_buffer:
>>>>       Even if a decoder wants it, the state tracker has no business knowing
>>>>       about it.
>>> Unfortunately that isn't true, only VDPAU hasn't a concept of giving the application a certain control over the buffers. XvMC has the implicit association of internal decode buffers with destination surfaces, and VAAPI gives the application both an one shot copy and a map/unmap interface to provide buffer content.
>>>
>>> When we want to avoid unnecessary coping around of buffer content then this interface has to become even more complicated, like new buffer types, using create user buffer, mapping/unmapping of to be decoded data buffers etc.
>> There is no guarantee a user created video buffer can be used as reference frame, I honestly don't even know if it's possible.
> Oh, I'm not talking about reference frames here, I'm talking about ALL kind of buffers. As far as I know using a manually uploaded video buffer as reference frame isn't possible with any of the major hardware vendors, and I can tell that it definitely doesn't work for Radeon hardware.
Good, in my current version I keep a shadow copy of the unpostprocessed video buffer and just use the pointer as handle.
O(n) is not really an issue for n <= 16 I suppose. :-)

>> Furthermore I don't see any code in state_trackers/va except some stubs, so I cant discuss an interface that's not even there to begin with.
> The VA-API state tracker is only a stub, but the specification is fairly complete, and it was a merge requirement to support at least XvMC,VDPAU,VA-API and even DXVA with it.
>
>>>> flush is changed to only flush a single pipe_video_buffer,
>>>> this should reduce the state that needs to be maintained for XvMC otherwise.
>>> I can't really see how is that improving the situation, all it does is reducing the interface to what is needed for VDPAU, but totally ignoring requirements for VAAPI for example.
>> See above.
>>
>>>> Note: internally you can still use those calls, as long as the *public* api
>>>> would be reduced to this, pipe_video_buffer is specific to each driver,
>>>> so you can put in a lot of state there if you choose to do so. For example,
>>>> quant matrix is not used in vc-1, and h264 will need a different
>>>> way for set_reference_frames, see struct VdpReferenceFrameH264. This is why
>>>> I want to remove those calls, and put them into picture_desc..
>>> Correct, those calls need to be extended to support more codecs, but just stuffing everything into a single structure also doesn't seems to be a solution that will work for a wider view of different state trackers or hardware.
>> All hardware has its own unique approach with their own api, so it would make sense if there is a separate call for decode_va too,
>> with maybe some bits that tell what changed, so it only has to upload those specific parts..
> Hui? Are you kidding? Gallium3Ds goal is to provide an unified and abstracted view of todays video hardware, and by introducing just a specific call to implement a single API we definitely don't work into that direction.
>
>>>> struct pipe_video_buffer I'm less sure about, especially wrt interlacing
>>>> and alignment it needs more thought. height and width are aligned to 64
>>>> on nvidia, and requires a special tiling flag. The compositor will need
>>>> to become aware of interlacing, to be able to play back interlaced videos.
>>> Alignment isn't so much of an issue. Take a look at vl_video_buffer_create, it is already aligning the width/height to a power of two if the hardware needs that. So it is perfectly legal for a video buffer to be larger than the requested size, this is already tested with the current color conversion code and should work fine with both XvMC and VDPAU.
>>>
>>>> I honestly haven't read up on interlacing yet, and haven't found a video
>>>> to test it with, so that part is just speculative crap, and might change
>>>> when I find out more about it.
>>> Regarding deinterlacing I'm also not really sure what we should do about it, but all interfaces (XvMC/VDPAU/VAAPI/DXVA) seems to treat surfaces as progressive. So I think the output of a decode call should already be in a progressive format (or can be at least handled as one for sampling).
>> No, they present the surfaces as progressive, but that doesn't necessarily mean they are,
>> nothing would prevent it from just seeing GetBitsYCbCr as a special (slow) case.
> Yeah, exactly that was what I had in mind, but since we doesn't have a consensus about how the internal memory layout of the buffers really are don't win anything by making them a public interface.
>
>> Considering how different the approach for vdpau and va-api is,
>> wouldn't it just make sense to handle va-api separately then?
> No I don't think so, the rest of the gallium interfaces are just made around mapping/unmapping of buffers with the help of transfer objects. This interface has proven to be flexible to implement a couple of different apis, so I don't really see a need to create another interface just for VDPAU.
>
>>>> Testing interlaced  videos that decode correctly with nvidia vdpau would help
>>>> a lot to figure out what the proper way to handle interlacing would be, so if
>>>> someone has a bunch that play with nvidia accelerated vdpau&   mplayer correctly,
>>>> could you please link them? ;)
>>> I will try to look around some more, but at least for MPEG2 I haven't found any interlaced videos for testing also. I will leave you a note if I find something.
>>>> /**
>>>>    * output for decoding / input for displaying
>>>>    */
>>>> struct pipe_video_buffer
>>>> {
>>>>      struct pipe_context *context;
>>>>
>>>>      enum pipe_format buffer_format;
>>>>      // Note: buffer_format may change as a result of put_bits, or call to decode_bitstream
>>>>      // afaict there is no guarantee a buffer filled with put_bits can be used as reference
>>>>      // frame to decode_bitstream
>>> Altering the video buffer format as a result to an application putting video content manually into it is something very VDPAU specific. So I think it would be better to let the state tracker check if the new content matches what is in the buffer currently and if the need arise recreate the buffer with a new format.
>> This is only an optimization, I did it for g3dvl because it's faster in the case where PutBitsYCbCr is used when
>> mplayer is not using vdpau. Eventually all buffers will have the correct layout, and it ends up being a few
>> memcpy's. For nvidia the format will have to stay NV12, so I can't change format in that case..
>>
>>>>      enum pipe_video_chroma_format chroma_format;
>>>>      unsigned width;
>>>>      unsigned height;
>>>>
>>>>      enum pipe_video_interlace_type layout;
>>>>      // progressive
>>>>      // even and odd lines are split
>>>>      // interlaced, top field valid only (half height)
>>>>      // interlaced, bottom field valid only
>>>>      // I'm really drawing a blank what would be sane here, since interlacing has a ton of
>>>>      // codec specific information, and current design doesn't handle it at all..
>>> The concrete memory layout of the buffer is implementation specific, so I think things like interlacing and/or special tilling modes are something that the driver needs to handle and should not be exposed to the state tracker.
>> I don't think so..
>>
>>   * \defgroup VdpVideoMixer VdpVideoMixer; Video Post-processing \
>>   *           and Compositing object
>>   *
>>   * VdpVideoMixer can perform some subset of the following
>>   * post-processing steps on video:
>>   * - De-interlacing
>>   *   - Various types, with or without inverse telecine
>>   * - Noise-reduction
>>   * - Sharpness adjustment
>>   * - Color space conversion to RGB
>>   * - Chroma format upscaling to 4:4:4
>>
>> Regardless of being va-api/vdpau/xvmc, those steps need to be performed, for all hardware.
>> Deinterlacing shouldn't be done in hardware specific bits, and this step could also easily
>> be changed to include frames that are split up too..
> Wait a moment, you are mixing two different things here:
>
> It's one thing to provide a top/bottom/planar view to use a video buffer as a source for a shader, that's what get_sampler_view_planes is good for (we still have to add that parameter to that function).
>
> And it is a complete different thing to have a internal memory layout of separated top and bottom lines, because this layout is hardware specific, for example you could have all top lines first (luma and chroma) and the the bottom lines, or you can have luma top, then luma bottom, then chroma top then chroma bottom etc... Paired with different tilling modes that gives you probably a couple of hundred different memory layouts. It doesn't make sense to export this to a state tracker.
Well, it depends. I read up a bit more, vdpau interlaced mode requires both fields to
be decoded to the same surface. I've researched interlaced a bit more, so what I said
about top/bottom field earlier can be ignored. I do think it makes sense to allow support
for separate top/bottom fields though, if only because of deinterlacing algorithms and
handling inverse telecine in a sane way.. to handle those you can't think of progressive
frames any more.
I think if you want to handle things like deinterlacing and inverse telecine sanely, you
would have to change your approach from handling in full frames to separate fields.
You're right that memory layouts shouldn't be part of the state tracker. I think having
support for separate field views that get merged in the video mixer renderer would
be required to handle deinterlacing sanely, though. Nouveau could then just ask for
simple weave deinterlacing on a progressive frame, while other deinterlacing
algorithms could be used for interlaced videos. For example bobbing could just be
done by stretching a field, and that algorithm is required to have for vdpau,
according to the vdpau header.

Cheers,
Maarten