[poppler] Vertical or horizontal writing?
Albert Astals Cid
aacid at kde.org
Sat Aug 14 13:18:56 PDT 2010
A Dissabte, 31 de juliol de 2010, mpsuzuki at hiroshima-u.ac.jp va escriure:
> Hi,
>
> Sorry for a silence in a while. Checking the source,
> I found following points.
>
> 1) poppler-qt4 page object issue
>
> In Page::getText() method, poppler's TextOutputDev
> object is created, and its getText() method is invoked.
> In the creation of TextOutputDev, we can tune its
> configuration to enable/disable physical layout,
> enable/disable raw order mode, etc. I think, when
> the vertical text is re-layouted for horizontal text
> renderer, the result is logically broken ordered
> when MS Office's tricky vertical text.
>
> If I test TextOutputDev::displayPageSlice() method,
> especially with rawOrder option, the text is not
> re-layouted. For MS Office's tricky vertical text,
> this is slightly better. However, displayPageSlice()
> method is designed for FILE stream. If we can pass
> the memory buffer to be filled by displayPageSlice(),
> it is useful, but such change requires many modifications,
> because displayPageSlice() is pan-device method.
>
> # changing TextOutputDev.cc is insufficient, I
> # have to change SplashOutputDev.cc, PSOutputDev.cc,
> # CairoOutputDev.cc, ArthurOutputDev.cc, ABWOutputDev.cc...
> # I cannot test all of them.
>
> On the other hand, getText() is device specific method,
> only in TextOutputDev.cc, so changing getText() is
> easier.
>
> 2) TextOutputDev::getText() issue
>
> Because most PDF generator does not draw spaces by font
> but moves the current point simply, the tack of TextOutputDev
> is not only the objects drawn by fonts. It cares about
> the moving of current point to insert space character
> (U+0020) at appropriate position. Thus, TextOutputDev is
> also layout-aware device as other output devices.
>
> TextOutputDev has optional switches for "force physical
> layout" and "force raw order" of the internal text processing.
> The results of "pdftotext -layout msword2007-vert.pdf -"
> and "pdftotext -raw msword2007-vert.pdf -" shows the exist
> of layout-aware routines in TextOutputDev very clearly.
>
> I think, raw-ordered text from MS Office's tricky vertical
> text can be applicable for text search, but physically-
> layouted text cannot be applicable for text search.
>
> 2-a) re-layout in vertical writing mode is required?
>
> We can find several interesting "TODO" comments in
> TextOutputDev.cc:
>
> 2342 void TextPage::coalesce(GBool physLayout, GBool doHTML) {
> ...
> 2535 //----- assemble the blocks
> 2536
> 2537 //~ add an outer loop for writing mode (vertical text)
> 2538
> 2539 // build blocks for each rotation value
> 2540 for (rot = 0; rot < 4; ++rot) {
> ...
> 2830 //~ need to compute the primary writing mode (horiz/vert) in
> 2831 //~ addition to primary rotation
> ...
> 3316 // build the flows
> 3317 //~ this needs to be adjusted for writing mode (vertical text)
> 3318 //~ this also needs to account for right-to-left column ordering
> 3319 flow = NULL;
> 3320 while (flows) {
> 3321 flow = flows;
> 3322 flows = flows->next;
> 3323 delete flow;
> 3324 }
> 3325 flows = lastFlow = NULL;
> 3326 // assume blocks are already in reading order,
> 3327 // and construct flows accordingly.
>
> ...
>
> 3589 GooString *TextPage::getText(double xMin, double yMin,
> 3590 double xMax, double yMax) {
> ...
> 3632 //~ writing mode (horiz/vert)
> 3633
> 3634 // collect the line fragments that are in the rectangle
>
> ...
>
> 4651 void TextPage::dump(void *outputStream, TextOutputFunc outputFunc,
> 4652 GBool physLayout) {
>
> ...
>
> 4689 //~ writing mode (horiz/vert)
> 4690
> 4691 // output the page in raw (content stream) order
> 4692 if (rawOrder) {
> ...
>
> From the comments, the authors of TextOutputDev.cc seem to
> be aware that the current layout analysis is specific to
> horizontal text. I think it's a homework for CJK people,
> but now I don't have sufficient time to work this issue fully.
>
> # also we can find a few comments for right-to-left script.
>
> But, if we restrict our scope to the text search on PDF,
> I think raw-ordered extraction can work for most cases.
>
> 2-b) getText() for rawOrder TextOutputDev?
>
> As I've written in above, the default, or, rawOrder mode
> of pdftotext is useful for MS Office's tricky vertical text.
> The rawOrder mode can be specified when TextOutputDev object
> is created. But... When I create TextOutputDev object in
> poppler-qt4 to extract raw-ordered text, TextOutputDev::getText()
> returns NULL text. Oops. It is designed behaviour of
> TextOutputDev::getText(). You can find following line in
> TextOutputDev.cc.
>
> 3589 GooString *TextPage::getText(double xMin, double yMin,
> 3590 double xMax, double yMax) {
>
> ...
>
> 3605
> 3606 s = new GooString();
> 3607
> 3608 if (rawOrder) {
> 3609 return s;
> 3610 }
>
> Yet I'm not sure why rawOrder case is discarded. As an
> experiment, I wrote a rawOrder text extraction code aslike:
>
> diff --git a/poppler/TextOutputDev.cc b/poppler/TextOutputDev.cc
> index f244639..1803629 100644
> --- a/poppler/TextOutputDev.cc
> +++ b/poppler/TextOutputDev.cc
> @@ -3702,10 +3702,6 @@ GooString *TextPage::getText(double xMin, double
> yMin,
>
> s = new GooString();
>
> - if (rawOrder) {
> - return s;
> - }
> -
> // get the output encoding
> if (!(uMap = globalParams->getTextEncoding())) {
> return s;
> @@ -3726,6 +3722,23 @@ GooString *TextPage::getText(double xMin, double
> yMin, break;
> }
>
> + if (rawOrder) {
> + TextWord* word;
> + for (word = rawWords; word && word <= rawLastWord; word = word->next)
> { + for (j = 0; j < word->getLength(); ++j) {
> + double gXMin, gXMax, gYMin, gYMax;
> + word->getCharBBox(j, &gXMin, &gYMin, &gXMax, &gYMax);
> + if (xMin <= gXMin && gXMax <= xMax && yMin <= gYMin && gYMax <=
> yMax) + {
> + char mbc[16]; /* XXX: uMap should know the limit !*/
> + int mbc_len = uMap->mapUnicode( *(word->getChar(j)), mbc,
> sizeof(mbc) ); + s->append(mbc, mbc_len);
> + }
> + }
> + }
> + return s;
> + }
> +
> //~ writing mode (horiz/vert)
>
> // collect the line fragments that are in the rectangle
>
> Now TextOutputDev::getText() can extract the text from
> TextOutputDev object in rawOrdered mode.
>
> 2-c) Line-joining issue in TextOutputDev::getText()
>
> The raw text in rawOrdered TextOutputDev object has no spaces
> between words. Here, "word" means a group of glyphs drawn by
> fonts without external current point shifting. My experimental
> patch in above inserts the spaces between words. The insertion
> of spaces between words makes English text better, but causes
> bad effects in MS Office's tricky vertical text. In MS Office's
> tricky vertical text, each glyph is drawn after vertical shift
> of current point, so all words consist from 1 glyph.
>
> At present, I have 2 ideas to prevent such bad insertion of
> spaces between tricky vertical text.
>
> idea i:
> Tracking the current point and the distance between glyphs,
> and determine 2 glyphs are belonging 1 vertical or horizontal
> line.
>
> idea ii:
> Referring line breaking algorithm in Unicode and determine
> whether the space should be inserted between the glyphs.
> - If the codepoints are Latin, the space is inserted.
> - If the codepoints are CJK Ideographs, the space is NOT inserted.
> - ...
>
> I think idea ii is so simple and good to start an experiment,
> although it can be acceptable for poppler.
WoW, that's a huge mail :D
So my understanding is that "proper" CJK searching is a lot of work and you
advocate for just exposing the raw text to the upper layers (users of poppler-
qt4) so they can do the work if they need it?
Albert
>
> Regards,
> mpsuzuki
>
> P.S.
> I've attached a patch "20100801a.diff" to extend
> 1) TextOutputDev::getText() to support rawOrder mode.
> 2) Qt4 Page::text() to take extra flag for rawOrder boolean.
> 3) a test program for poppler-qt's text extraction.
>
> On Wed, 28 Jul 2010 16:32:20 +0900
>
> mpsuzuki at hiroshima-u.ac.jp wrote:
> >Hi,
> >
> >On Wed, 28 Jul 2010 15:04:53 +0800 (CST)
> >
> >"cobra.yu" <cobra.yu at hyweb.com.tw> wrote:
> >> Of course, such fake vertical writing mode is unacceptable.
> >
> >Thanks.
> >
> >>So, it shows that we can't only count on the wMode of the font
> >>information, but also take the real arrangent of text words on
> >>pages into consideration?
> >
> >Yes, WMode is insufficient. As Deri analyzed, MS Office addin
> >draws vertical text by repeating "draw a glyph, move current
> >point vertically, draw a glyph...". So, it might be possible
> >to detect the text flow direction by tracking the moving of
> >current point. But, if our interest is only text search, the
> >tracking of current point won't be essential, I think. Maybe
> >collecting all glyphs in drawing order is sufficient for text
> >search. I will check more detail in poppler-qt4 binding.
> >
> >Regards,
> >mpsuzuki
> >
> >>-----Original message-----
> >>From:suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp>
> >>To:cobra.yu at hyweb.com.tw
> >>Cc:poppler <poppler at lists.freedesktop.org>
> >>Date:Wed, 28 Jul 2010 15:18:58 +0900
> >>Subject:Re: [poppler] Vertical or horizontal writing?
> >>
> >>
> >>Hi,
> >>
> >>Please find attached fake vertical text produced by MS Excel
> >>2007. Is it acceptable for you to exclude such fake vertical
> >>text from your target?
> >>
> >>If you try to select the text on Adobe Reader, you can find
> >>that the order of glyph drawing is horizontal, it is stupid
> >>fake from the viewpoint of page rendering language.
> >>
> >>Regards,
> >>mpsuzuki
> >>
> >>cobra.yu wrote:
> >>> Hi,
> >>>
> >>> The original requirement to detect the direction of text flow is
> >>> for "searching". The present "search" function of Poppler::Page
> >>> is searching horizontally only. So, for CJK users, I must add one
> >>> vertical search function for the vertical writing mode. I could
> >>> sort out all the textboxes in every page by (x,y) of the bounding
> >>> box to make a vertical-like textbox list, but I encountered a
> >>> fundamental problem: If I can't know the exact direction of text
> >>> flow first, how could I know when to use vertical or horizontal
> >>> search? BTW, I've accomplished the vertical text selection by the
> >>> same way as my vertical search right now, but it's rather simpler
> >>> than searching indeed.
> >>>
> >>> Cobra
> >>>
> >>> -----Original message-----
> >>> From:mpsuzuki at hiroshima-u.ac.jp
> >>> To:Deri James <deri at chuzzlewit.demon.co.uk>
> >>> Cc:poppler at lists.freedesktop.org,cobra.yu at hyweb.com.tw
> >>> Date:Wed, 28 Jul 2010 01:59:40 +0900
> >>> Subject:Re: [poppler] Vertical or horizontal writing?
> >>>
> >>> Dear Deri,
> >>>
> >>> On Tue, 27 Jul 2010 17:22:14 +0100
> >>>
> >>> Deri James <deri at chuzzlewit.demon.co.uk> wrote:
> >>>> When looking at the two PDFs you are using with acroread using the
> >>>> text selection tool:-
> >>>>
> >>>> P1 of 'vert-horiz-ipa-std.pdf' selection caret is drawn horizontally.
> >>>> 'msword2010-vert2.pdf' selection caret is drawn vertically.
> >>>>
> >>>> So, it seems acroread can't detect the vertical text in this file,
> >>>> i.e. it is actually horizontal text placed one glyph at a time (apart
> >>>> from 'MS Word 2010' which is horizontal text rotated 90 degrees).
> >>>>
> >>>> The contents of the stream confirms this:-
> >>>>
> >>>> stream
> >>>> /P <</MCID 0/Lang (en-US)>> BDC BT
> >>>> /F1 10.56 Tf
> >>>> 0.000000001 -1 1 0.000000001 496.54 756.84 Tm
> >>>> 0 g
> >>>> 0 G
> >>>> [(MS)6( )5(W)61(ord)-4( )5(20)10(10)] TJ
> >>>> ET
> >>>> EMC /P <</MCID 1>> BDC BT
> >>>> /F2 10.56 Tf
> >>>> 1 0.000000017 -0.000000017 1 495.29 673.7 Tm
> >>>> <085B>Tj
> >>>> ET
> >>>> EMC /P <</MCID 2>> BDC BT
> >>>> 1 0.000000017 -0.000000017 1 495.29 663.14 Tm
> >>>> <29AA>Tj
> >>>>
> >>>>
> >>>>
> >>>> ...
> >>>>
> >>>> So this PDF does not have any true vertical text.
> >>>
> >>> Yes, yes, just I've reached exactly same conclusion.
> >>> Thank you for checking the content of PDF.
> >>>
> >>> The PDF generated by MS Office addin uses the font object
> >>> for horizontal writing mode, in PDF design, at least. So
> >>> the text flow detection in PDF font level does not work
> >>> with such PDF. Higher level recognization is needed.
> >>>
> >>> It brings a philosophical question: what is vertical text?
> >>> Some people makes vertical serie of CJK glyphs by using
> >>> very very narrow text box, is this wrong vertical text?
> >>> If they are not vertical text, why we should distinguish?
> >>> The invalid shape of the punctuations & arrows? Or...
> >>>
> >>> I have to ask Cobra about what is the original requirement
> >>> why the text direction should be detected. Cobra, could
> >>> you describe why you needed to detect the direction of
> >>> text flow?
> >>>
> >>> Regards,
> >>> mpsuzuki
> >
> >_______________________________________________
> >poppler mailing list
> >poppler at lists.freedesktop.org
> >http://lists.freedesktop.org/mailman/listinfo/poppler
More information about the poppler
mailing list