[poppler] Vertical or horizontal writing?

Sat Aug 14 13:18:56 PDT 2010

A Dissabte, 31 de juliol de 2010, mpsuzuki at hiroshima-u.ac.jp va escriure:
> Hi,
> 
> Sorry for a silence in a while. Checking the source,
> I found following points.
> 
> 1) poppler-qt4 page object issue
> 
> In Page::getText() method, poppler's TextOutputDev
> object is created, and its getText() method is invoked.
> In the creation of TextOutputDev, we can tune its
> configuration to enable/disable physical layout,
> enable/disable raw order mode, etc. I think, when
> the vertical text is re-layouted for horizontal text
> renderer, the result is logically broken ordered
> when MS Office's tricky vertical text.
> 
> If I test TextOutputDev::displayPageSlice() method,
> especially with rawOrder option, the text is not
> re-layouted. For MS Office's tricky vertical text,
> this is slightly better. However, displayPageSlice()
> method is designed for FILE stream. If we can pass
> the memory buffer to be filled by displayPageSlice(),
> it is useful, but such change requires many modifications,
> because displayPageSlice() is pan-device method.
> 
> # changing TextOutputDev.cc is insufficient, I
> # have to change SplashOutputDev.cc, PSOutputDev.cc,
> # CairoOutputDev.cc, ArthurOutputDev.cc, ABWOutputDev.cc...
> # I cannot test all of them.
> 
> On the other hand, getText() is device specific method,
> only in TextOutputDev.cc, so changing getText() is
> easier.
> 
> 2) TextOutputDev::getText() issue
> 
> Because most PDF generator does not draw spaces by font
> but moves the current point simply, the tack of TextOutputDev
> is not only the objects drawn by fonts. It cares about
> the moving of current point to insert space character
> (U+0020) at appropriate position. Thus, TextOutputDev is
> also layout-aware device as other output devices.
> 
> TextOutputDev has optional switches for "force physical
> layout" and "force raw order" of the internal text processing.
> The results of "pdftotext -layout msword2007-vert.pdf -"
> and "pdftotext -raw msword2007-vert.pdf -" shows the exist
> of layout-aware routines in TextOutputDev very clearly.
> 
> I think, raw-ordered text from MS Office's tricky vertical
> text can be applicable for text search, but physically-
> layouted text cannot be applicable for text search.
> 
> 2-a) re-layout in vertical writing mode is required?
> 
> We can find several interesting "TODO" comments in
> TextOutputDev.cc:
> 
>    2342 void TextPage::coalesce(GBool physLayout, GBool doHTML) {
>         ...
>    2535   //----- assemble the blocks
>    2536
>    2537   //~ add an outer loop for writing mode (vertical text)
>    2538
>    2539   // build blocks for each rotation value
>    2540   for (rot = 0; rot < 4; ++rot) {
>         ...
>    2830       //~ need to compute the primary writing mode (horiz/vert) in
>    2831       //~ addition to primary rotation
>         ...
>    3316   // build the flows
>    3317   //~ this needs to be adjusted for writing mode (vertical text)
>    3318   //~ this also needs to account for right-to-left column ordering
>    3319   flow = NULL;
>    3320   while (flows) {
>    3321     flow = flows;
>    3322     flows = flows->next;
>    3323     delete flow;
>    3324   }
>    3325   flows = lastFlow = NULL;
>    3326   // assume blocks are already in reading order,
>    3327   // and construct flows accordingly.
> 
>    ...
> 
>    3589 GooString *TextPage::getText(double xMin, double yMin,
>    3590                            double xMax, double yMax) {
>         ...
>    3632   //~ writing mode (horiz/vert)
>    3633
>    3634   // collect the line fragments that are in the rectangle
> 
>    ...
> 
>    4651 void TextPage::dump(void *outputStream, TextOutputFunc outputFunc,
>    4652                     GBool physLayout) {
> 
>    ...
> 
>    4689   //~ writing mode (horiz/vert)
>    4690
>    4691   // output the page in raw (content stream) order
>    4692   if (rawOrder) {
>         ...
> 
> From the comments, the authors of TextOutputDev.cc seem to
> be aware that the current layout analysis is specific to
> horizontal text. I think it's a homework for CJK people,
> but now I don't have sufficient time to work this issue fully.
> 
> # also we can find a few comments for right-to-left script.
> 
> But, if we restrict our scope to the text search on PDF,
> I think raw-ordered extraction can work for most cases.
> 
> 2-b) getText() for rawOrder TextOutputDev?
> 
> As I've written in above, the default, or, rawOrder mode
> of pdftotext is useful for MS Office's tricky vertical text.
> The rawOrder mode can be specified when TextOutputDev object
> is created. But... When I create TextOutputDev object in
> poppler-qt4 to extract raw-ordered text, TextOutputDev::getText()
> returns NULL text. Oops. It is designed behaviour of
> TextOutputDev::getText(). You can find following line in
> TextOutputDev.cc.
> 
>    3589 GooString *TextPage::getText(double xMin, double yMin,
>    3590                            double xMax, double yMax) {
> 
>         ...
> 
>    3605
>    3606   s = new GooString();
>    3607
>    3608   if (rawOrder) {
>    3609     return s;
>    3610   }
> 
> Yet I'm not sure why rawOrder case is discarded. As an
> experiment, I wrote a rawOrder text extraction code aslike:
> 
> diff --git a/poppler/TextOutputDev.cc b/poppler/TextOutputDev.cc
> index f244639..1803629 100644
> --- a/poppler/TextOutputDev.cc
> +++ b/poppler/TextOutputDev.cc
> @@ -3702,10 +3702,6 @@ GooString *TextPage::getText(double xMin, double
> yMin,
> 
>    s = new GooString();
> 
> -  if (rawOrder) {
> -    return s;
> -  }
> -
>    // get the output encoding
>    if (!(uMap = globalParams->getTextEncoding())) {
>      return s;
> @@ -3726,6 +3722,23 @@ GooString *TextPage::getText(double xMin, double
> yMin, break;
>    }
> 
> +  if (rawOrder) {
> +    TextWord*  word;
> +    for (word = rawWords; word && word <= rawLastWord; word = word->next)
> { +      for (j = 0; j < word->getLength(); ++j) {
> +        double gXMin, gXMax, gYMin, gYMax;
> +        word->getCharBBox(j, &gXMin, &gYMin, &gXMax, &gYMax);
> +        if (xMin <= gXMin && gXMax <= xMax && yMin <= gYMin && gYMax <=
> yMax) +        {
> +          char mbc[16]; /* XXX: uMap should know the limit !*/
> +          int  mbc_len = uMap->mapUnicode( *(word->getChar(j)), mbc,
> sizeof(mbc) ); +          s->append(mbc, mbc_len);
> +        }
> +      }
> +    }
> +    return s;
> +  }
> +
>    //~ writing mode (horiz/vert)
> 
>    // collect the line fragments that are in the rectangle
> 
> Now TextOutputDev::getText() can extract the text from
> TextOutputDev object in rawOrdered mode.
> 
> 2-c) Line-joining issue in TextOutputDev::getText()
> 
> The raw text in rawOrdered TextOutputDev object has no spaces
> between words. Here, "word" means a group of glyphs drawn by
> fonts without external current point shifting. My experimental
> patch in above inserts the spaces between words. The insertion
> of spaces between words makes English text better, but causes
> bad effects in MS Office's tricky vertical text. In MS Office's
> tricky vertical text, each glyph is drawn after vertical shift
> of current point, so all words consist from 1 glyph.
> 
> At present, I have 2 ideas to prevent such bad insertion of
> spaces between tricky vertical text.
> 
> idea i:
> Tracking the current point and the distance between glyphs,
> and determine 2 glyphs are belonging 1 vertical or horizontal
> line.
> 
> idea ii:
> Referring line breaking algorithm in Unicode and determine
> whether the space should be inserted between the glyphs.
> - If the codepoints are Latin, the space is inserted.
> - If the codepoints are CJK Ideographs, the space is NOT inserted.
> - ...
> 
> I think idea ii is so simple and good to start an experiment,
> although it can be acceptable for poppler.

WoW, that's a huge mail :D

So my understanding is that "proper" CJK searching is a lot of work and you 
advocate for just exposing the raw text to the upper layers (users of poppler-
qt4) so they can do the work if they need it?

Albert

> 
> Regards,
> mpsuzuki
> 
> P.S.
> I've attached a patch "20100801a.diff" to extend
> 1) TextOutputDev::getText() to support rawOrder mode.
> 2) Qt4 Page::text() to take extra flag for rawOrder boolean.
> 3) a test program for poppler-qt's text extraction.
> 
> On Wed, 28 Jul 2010 16:32:20 +0900
> 
> mpsuzuki at hiroshima-u.ac.jp wrote:
> >Hi,
> >
> >On Wed, 28 Jul 2010 15:04:53 +0800 (CST)
> >
> >"cobra.yu" <cobra.yu at hyweb.com.tw> wrote:
> >>    Of course, such fake vertical writing mode is unacceptable.
> >
> >Thanks.
> >
> >>So, it shows that we can't only count on the wMode of the font
> >>information, but also take the real arrangent of text words on
> >>pages into consideration?
> >
> >Yes, WMode is insufficient. As Deri analyzed, MS Office addin
> >draws vertical text by repeating "draw a glyph, move current
> >point vertically, draw a glyph...". So, it might be possible
> >to detect the text flow direction by tracking the moving of
> >current point. But, if our interest is only text search, the
> >tracking of current point won't be essential, I think. Maybe
> >collecting all glyphs in drawing order is sufficient for text
> >search. I will check more detail in poppler-qt4 binding.
> >
> >Regards,
> >mpsuzuki
> >
> >>-----Original message-----
> >>From:suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp>
> >>To:cobra.yu at hyweb.com.tw
> >>Cc:poppler <poppler at lists.freedesktop.org>
> >>Date:Wed, 28 Jul 2010 15:18:58 +0900
> >>Subject:Re: [poppler] Vertical or horizontal writing?
> >>
> >>
> >>Hi,
> >>
> >>Please find attached fake vertical text produced by MS Excel
> >>2007. Is it acceptable for you to exclude such fake vertical
> >>text from your target?
> >>
> >>If you try to select the text on Adobe Reader, you can find
> >>that the order of glyph drawing is horizontal, it is stupid
> >>fake from the viewpoint of page rendering language.
> >>
> >>Regards,
> >>mpsuzuki
> >>
> >>cobra.yu wrote:
> >>> Hi,
> >>> 
> >>>      The original requirement to detect the direction of text flow is
> >>>      for "searching". The present "search" function of Poppler::Page
> >>>      is searching horizontally only. So, for CJK users, I must add one
> >>>      vertical search function for the vertical writing mode. I could
> >>>      sort out all the textboxes in every page by (x,y) of the bounding
> >>>      box to make a vertical-like textbox list, but I encountered a
> >>>      fundamental problem: If I can't know the exact direction of text
> >>>      flow first, how could I know when to use vertical or horizontal
> >>>      search? BTW, I've accomplished the vertical text selection by the
> >>>      same way as my vertical search right now, but it's rather simpler
> >>>      than searching indeed.
> >>>      
> >>>           Cobra
> >>> 
> >>> -----Original message-----
> >>> From:mpsuzuki at hiroshima-u.ac.jp
> >>> To:Deri James <deri at chuzzlewit.demon.co.uk>
> >>> Cc:poppler at lists.freedesktop.org,cobra.yu at hyweb.com.tw
> >>> Date:Wed, 28 Jul 2010 01:59:40 +0900
> >>> Subject:Re: [poppler] Vertical or horizontal writing?
> >>> 
> >>> Dear Deri,
> >>> 
> >>> On Tue, 27 Jul 2010 17:22:14 +0100
> >>> 
> >>> Deri James <deri at chuzzlewit.demon.co.uk> wrote:
> >>>> When looking at the two PDFs you are using with acroread using the
> >>>> text selection tool:-
> >>>> 
> >>>> P1 of 'vert-horiz-ipa-std.pdf' selection caret is drawn horizontally.
> >>>> 'msword2010-vert2.pdf' selection caret is drawn vertically.
> >>>> 
> >>>> So, it seems acroread can't detect the vertical text in this file,
> >>>> i.e. it is actually horizontal text placed one glyph at a time (apart
> >>>> from 'MS Word 2010' which is horizontal text rotated 90 degrees).
> >>>> 
> >>>> The contents of the stream confirms this:-
> >>>> 
> >>>> stream
> >>>> /P <</MCID 0/Lang (en-US)>> BDC BT
> >>>> /F1 10.56 Tf
> >>>> 0.000000001 -1 1 0.000000001 496.54 756.84 Tm
> >>>> 0 g
> >>>> 0 G
> >>>> [(MS)6( )5(W)61(ord)-4( )5(20)10(10)] TJ
> >>>> ET
> >>>> EMC  /P <</MCID 1>> BDC BT
> >>>> /F2 10.56 Tf
> >>>> 1 0.000000017 -0.000000017 1 495.29 673.7 Tm
> >>>> <085B>Tj
> >>>> ET
> >>>> EMC  /P <</MCID 2>> BDC BT
> >>>> 1 0.000000017 -0.000000017 1 495.29 663.14 Tm
> >>>> <29AA>Tj
> >>>> 
> >>>> 
> >>>> 
> >>>> ...
> >>>> 
> >>>> So this PDF does not have any true vertical text.
> >>> 
> >>> Yes, yes, just I've reached exactly same conclusion.
> >>> Thank you for checking the content of PDF.
> >>> 
> >>> The PDF generated by MS Office addin uses the font object
> >>> for horizontal writing mode, in PDF design, at least. So
> >>> the text flow detection in PDF font level does not work
> >>> with such PDF. Higher level recognization is needed.
> >>> 
> >>> It brings a philosophical question: what is vertical text?
> >>> Some people makes vertical serie of CJK glyphs by using
> >>> very very narrow text box, is this wrong vertical text?
> >>> If they are not vertical text, why we should distinguish?
> >>> The invalid shape of the punctuations & arrows? Or...
> >>> 
> >>> I have to ask Cobra about what is the original requirement
> >>> why the text direction should be detected. Cobra, could
> >>> you describe why you needed to detect the direction of
> >>> text flow?
> >>> 
> >>> Regards,
> >>> mpsuzuki
> >
> >_______________________________________________
> >poppler mailing list
> >poppler at lists.freedesktop.org
> >http://lists.freedesktop.org/mailman/listinfo/poppler