[poppler] Vertical or horizontal writing?
mpsuzuki at hiroshima-u.ac.jp
mpsuzuki at hiroshima-u.ac.jp
Sat Jul 31 10:07:42 PDT 2010
Hi,
Sorry for a silence in a while. Checking the source,
I found following points.
1) poppler-qt4 page object issue
In Page::getText() method, poppler's TextOutputDev
object is created, and its getText() method is invoked.
In the creation of TextOutputDev, we can tune its
configuration to enable/disable physical layout,
enable/disable raw order mode, etc. I think, when
the vertical text is re-layouted for horizontal text
renderer, the result is logically broken ordered
when MS Office's tricky vertical text.
If I test TextOutputDev::displayPageSlice() method,
especially with rawOrder option, the text is not
re-layouted. For MS Office's tricky vertical text,
this is slightly better. However, displayPageSlice()
method is designed for FILE stream. If we can pass
the memory buffer to be filled by displayPageSlice(),
it is useful, but such change requires many modifications,
because displayPageSlice() is pan-device method.
# changing TextOutputDev.cc is insufficient, I
# have to change SplashOutputDev.cc, PSOutputDev.cc,
# CairoOutputDev.cc, ArthurOutputDev.cc, ABWOutputDev.cc...
# I cannot test all of them.
On the other hand, getText() is device specific method,
only in TextOutputDev.cc, so changing getText() is
easier.
2) TextOutputDev::getText() issue
Because most PDF generator does not draw spaces by font
but moves the current point simply, the tack of TextOutputDev
is not only the objects drawn by fonts. It cares about
the moving of current point to insert space character
(U+0020) at appropriate position. Thus, TextOutputDev is
also layout-aware device as other output devices.
TextOutputDev has optional switches for "force physical
layout" and "force raw order" of the internal text processing.
The results of "pdftotext -layout msword2007-vert.pdf -"
and "pdftotext -raw msword2007-vert.pdf -" shows the exist
of layout-aware routines in TextOutputDev very clearly.
I think, raw-ordered text from MS Office's tricky vertical
text can be applicable for text search, but physically-
layouted text cannot be applicable for text search.
2-a) re-layout in vertical writing mode is required?
We can find several interesting "TODO" comments in
TextOutputDev.cc:
2342 void TextPage::coalesce(GBool physLayout, GBool doHTML) {
...
2535 //----- assemble the blocks
2536
2537 //~ add an outer loop for writing mode (vertical text)
2538
2539 // build blocks for each rotation value
2540 for (rot = 0; rot < 4; ++rot) {
...
2830 //~ need to compute the primary writing mode (horiz/vert) in
2831 //~ addition to primary rotation
...
3316 // build the flows
3317 //~ this needs to be adjusted for writing mode (vertical text)
3318 //~ this also needs to account for right-to-left column ordering
3319 flow = NULL;
3320 while (flows) {
3321 flow = flows;
3322 flows = flows->next;
3323 delete flow;
3324 }
3325 flows = lastFlow = NULL;
3326 // assume blocks are already in reading order,
3327 // and construct flows accordingly.
...
3589 GooString *TextPage::getText(double xMin, double yMin,
3590 double xMax, double yMax) {
...
3632 //~ writing mode (horiz/vert)
3633
3634 // collect the line fragments that are in the rectangle
...
4651 void TextPage::dump(void *outputStream, TextOutputFunc outputFunc,
4652 GBool physLayout) {
...
4689 //~ writing mode (horiz/vert)
4690
4691 // output the page in raw (content stream) order
4692 if (rawOrder) {
...
More information about the poppler
mailing list