[gst-devel] Text rendering

Sun Feb 20 09:19:33 CET 2005

On Sun, 20 Feb 2005 15:09:34 +0100, Maciej Katafiasz <ml at mathrick.org> wrote:
> Dnia 20-02-2005, nie o godzinie 13:26 +0100, Gergely Nagy napisał:
> > > > Anyway, for fancy effects (think not only in colors, but textures too),
> > > > the approach outlined here will work fine, I guess.
> > >
> > > In a word, no. It's not gonna work for subtitles at all (even simple
> > > colouring of different lines in different colours will be difficult, and
> > > individually coloured letters will be downright impossible).
> >
> > For this, I imagined that the text description would include color markup
> > and such. The text renderer would ignore that, and there would be another
> > element, which would set up the coloring layer.
> 
> Ugh. Now, totally trivial (not at all, but it gives very good idea about
> what's needed) and really basic (ie, if we don't have it, we have can as
> well have no formatted subtitles at all) example: karaoke text. In the
> simplest case it looks like this:
> 
> Some more or less interesting lyrics here
> ----------------------^
> [Already sung colour] | [Yet to be sung colour]

For karaoke text, one renders the text char-by-char. Then we know the
size of all chars. Then, put that together to form the complete text
rendering. Somewhere else, we have two images, with the exact same
dimensions as the lyrics text. One with the already sung colour, one
with the yet to be sung colour. Then, merge the yet-to-be-sung image
onto the text, so we have our text rendered in yet-to-be-sung colour.
We'll use this buffer in each iteration from now on.

Then, we take the already-sung image, and merge it onto the text
buffer (after marking that buffer read-only, so imagemixer will make a
copy) at such a position, that the colouring will end at the right
position (ie, we start at a very negative xpos, and end at 0). Since
we know the size of each char (we rendered the lyrics char-by-char
because we need this info), we know where the song currently is, we
can calculate the merge position.

This way, one does not need an element to parse an image description
over and over again, nor one does need an element that understand a
complex rendering protocol.

> ^ represents "cursor" -- point where already sung and yet-to-be-sung
> colours meet. Note it's not between chars, but in the middle of char --
> if there's "I looooooooooooooove you" sung, you'll end with "o" in
> "love" slowly shifting left-to-right from one colour into another.
> That's basic case, but already fairly complicated with your design.

If you know the width of each char, and have a description of when a
part of the lyrics starts, and when it ends, it's pretty easy to code
a karaoke app (or element)

> And now for a bit of real life: the characters will jump, twirl, shrink,
> enlarge, flash and pounce as cursor passes by them. I'm not making this
> up, that's just a small sample of effects you'll see in any random anime
> fansub. I can't imagine any way to do that with text renderer "ignoring"
> colour markup.

You render by char.. Though, you're right when you say that is doomed
to be dog slow.

> > > That's because whilst presented model is rather flexible when it works,
> > > it's also extremely static (you'd need to replug pipeline *each* time you
> > > wanted to have differently coloured line added, yuck!), and makes
> > > complex things even more complex, losing lots of info in the process.
> >
> > Don't think so... just fiddle some properties of the coloring element,
> > and you're fine.
> 
> Umm, no. What I mean by losing info is "you get text + some formatting
> (size, rotation, position), and now you have to transform formatting
> info from *stream* into *pipeline*". Because coloured text needs
> additional element, you need to replug when coloured text is introduced
> for first time, etc. It's going to be *hard*, and (IMHO) inherently
> limited. You can't really express (very dynamic) information from text
> stream by (static) pipeline. You can replug pipeline, but it doesn't
> make it dynamic, only static in discrete time spans.

Ah! Now I see your point! Thanks!

> > > Now, application/x-gst-subtitles is a protocol that would support:
> > >
> > > - creating objects with unique ID
> > > - manipulating (move, resize, rotate, colourise, maybe arbitrary
> > > transformation) objects with given ID
> > > - rendering (it should be operation separate from creation) objects
> > > - destroying objects
> >
> > Now, this is something I don't completely agree with. This sounds like
> > subtitle rendering would be performed onto a canvas that has the size
> > of the video, while the subtitles themselves might only be a small
> > fraction of the whole thing. Now, blending a 720x576 image onto another
> > is much more costy than blending a 720x48 image onto a 720x576 at
> > position (0,650) (for example).
> 
> Is it really going to be expensive if most of that image will be 100%
> alpha anyway?

Yes, unless you do some RLE, in which case you're overdoing stuff,
methinks. If the image you generate has empty spaces, you could just
skip generating those, and tell the mixer where to merge the image,
instead of positioning it yourself. (This way, the user can have
subtitles on the top of the video if so he wants, and the renderer
does not need to know about it at all).

> > Some moving might have a place in the x-gst-subtitles protocol, but..
> > support for scrolling should not be there, imho. What has its place
> > there in my opinion, is line alignment or the like..
> >
> > Hrm.. I think I'll think a bit more about this, and send another
> > reply again, later today. I hope to be able to have some nice
> > pipeline ideas by then, to illustrate what I have in mind.
> 
> Honestly, I thought a little about a situation when it'd be "renderbin"
> rather than single element, with textrenderer as I outlined above, and
> additional video effects elements like ones you'd like to see. There
> would be something like application/x-gst-subtitles protocol parsing
> element which would read entire input stream, and then dispatch relevant
> bits of it to appropriate elements inside bin. This is a bit dodgy, and
> will no doubt require heavy thinking to get it sanely, but might be
> doable, who knows.

I'd be much more happy with renderbin, than with renderelement.. On
the other hand, I'm quite clueless when it comes to multimedia, so...
dunno.

After reading your mail, it seems to me, that many things mentioned in
this discussion can be done in various ways, and they all have their
pros and cons. Eg, the very simple text renderer I have is 12k lines
(with comments and everything), and if I strip out some stuff that is
obsolete, I can get it down to 9-10k I guess. With some clever
programming tricks, it can be used in many cases discussed here. On
the other hand, it really is not fit for some more fancy stuff like
the anime fan-subs.

I had a few things in my mind against a generic cairo renderer, but..
most of that can be argued, and I don't know enough to explain them
anyway. I guess I'll see what comes out of this discussion, and see if
I can use the result :] (if I can't, then either the thing can be
fixed so I will be able to use it, or I can continue using my
pangotextsrc :)