[poppler] Vertical or horizontal writing?

mpsuzuki at hiroshima-u.ac.jp mpsuzuki at hiroshima-u.ac.jp
Sat Jul 31 10:07:42 PDT 2010


Hi,

Sorry for a silence in a while. Checking the source,
I found following points.

1) poppler-qt4 page object issue

In Page::getText() method, poppler's TextOutputDev
object is created, and its getText() method is invoked.
In the creation of TextOutputDev, we can tune its
configuration to enable/disable physical layout,
enable/disable raw order mode, etc. I think, when
the vertical text is re-layouted for horizontal text
renderer, the result is logically broken ordered
when MS Office's tricky vertical text.

If I test TextOutputDev::displayPageSlice() method,
especially with rawOrder option, the text is not
re-layouted. For MS Office's tricky vertical text,
this is slightly better. However, displayPageSlice()
method is designed for FILE stream. If we can pass
the memory buffer to be filled by displayPageSlice(),
it is useful, but such change requires many modifications,
because displayPageSlice() is pan-device method.

# changing TextOutputDev.cc is insufficient, I
# have to change SplashOutputDev.cc, PSOutputDev.cc,
# CairoOutputDev.cc, ArthurOutputDev.cc, ABWOutputDev.cc...
# I cannot test all of them.

On the other hand, getText() is device specific method,
only in TextOutputDev.cc, so changing getText() is
easier.

2) TextOutputDev::getText() issue

Because most PDF generator does not draw spaces by font
but moves the current point simply, the tack of TextOutputDev
is not only the objects drawn by fonts. It cares about
the moving of current point to insert space character
(U+0020) at appropriate position. Thus, TextOutputDev is
also layout-aware device as other output devices.

TextOutputDev has optional switches for "force physical
layout" and "force raw order" of the internal text processing.
The results of "pdftotext -layout msword2007-vert.pdf -"
and "pdftotext -raw msword2007-vert.pdf -" shows the exist
of layout-aware routines in TextOutputDev very clearly. 

I think, raw-ordered text from MS Office's tricky vertical
text can be applicable for text search, but physically-
layouted text cannot be applicable for text search.

2-a) re-layout in vertical writing mode is required?

We can find several interesting "TODO" comments in
TextOutputDev.cc:

   2342 void TextPage::coalesce(GBool physLayout, GBool doHTML) {
        ...
   2535   //----- assemble the blocks
   2536 
   2537   //~ add an outer loop for writing mode (vertical text)
   2538 
   2539   // build blocks for each rotation value
   2540   for (rot = 0; rot < 4; ++rot) {
        ...
   2830       //~ need to compute the primary writing mode (horiz/vert) in
   2831       //~ addition to primary rotation
        ...
   3316   // build the flows
   3317   //~ this needs to be adjusted for writing mode (vertical text)
   3318   //~ this also needs to account for right-to-left column ordering
   3319   flow = NULL;
   3320   while (flows) {
   3321     flow = flows;
   3322     flows = flows->next;
   3323     delete flow;
   3324   }
   3325   flows = lastFlow = NULL;
   3326   // assume blocks are already in reading order,
   3327   // and construct flows accordingly.

   ...

   3589 GooString *TextPage::getText(double xMin, double yMin,
   3590                            double xMax, double yMax) {
        ...
   3632   //~ writing mode (horiz/vert)
   3633 
   3634   // collect the line fragments that are in the rectangle

   ...

   4651 void TextPage::dump(void *outputStream, TextOutputFunc outputFunc,
   4652                     GBool physLayout) {

   ...

   4689   //~ writing mode (horiz/vert)
   4690 
   4691   // output the page in raw (content stream) order
   4692   if (rawOrder) {
        ...



More information about the poppler mailing list