[poppler] pdftohtml patch: restore old "raw" command-line option
Warren Toomey
poppler at tuhs.org
Sun Oct 5 05:49:20 PDT 2008
pdftohtml used to have a "raw" mode which has been removed. In "raw" mode,
text from a PDF document is processed in the order that it occurs. However,
the current version of pdftohtml reorders the text to be in increasing y-value,
i.e. from the top of a page going down to the bottom.
This text reordering plays merry havoc with multi-column pages, as the text
from the columns becomes interleaved instead of remaining separate.
The attached patch restores the -raw command-line option to pdftohtml. The
program retains its current behaviour if the -raw option is not used, but
reverts to the "text as it appears" behaviour with the -raw option enabled.
Cheers,
Warren
-------------- next part --------------
--- pdftohtml.cc 2008/10/01 05:47:31 1.5
+++ pdftohtml.cc 2008/10/05 12:37:58
@@ -39,7 +39,7 @@
static int firstPage = 1;
static int lastPage = 0;
-static GBool rawOrder = gTrue;
+static GBool rawOrder = gFalse;
GBool printCommands = gTrue;
static GBool printHelp = gFalse;
GBool printHtml = gFalse;
@@ -71,8 +71,8 @@
"first page to convert"},
{"-l", argInt, &lastPage, 0,
"last page to convert"},
- /*{"-raw", argFlag, &rawOrder, 0,
- "keep strings in content stream order"},*/
+ {"-raw", argFlag, &rawOrder, 0,
+ "keep strings in content stream order"},
{"-q", argFlag, &errQuiet, 0,
"don't print any messages or errors"},
{"-h", argFlag, &printHelp, 0,
@@ -270,7 +270,8 @@
}
}}
- rawOrder = complexMode; // todo: figure out what exactly rawOrder do :)
+ if (complexMode)
+ rawOrder = complexMode; // todo: figure out what exactly rawOrder do :)
// write text file
htmlOut = new HtmlOutputDev(htmlFileName->getCString(),
--- pdftohtml.1 2008/09/30 10:03:25 1.3
+++ pdftohtml.1 2008/10/05 12:40:31
@@ -46,6 +46,10 @@
.B \-noframes
generate no frames. Not supported in complex output mode.
.TP
+.B \-raw
+process text as it occurs in the document. Without this option, text is
+processed in increasing y-value, i.e. from top to bottom of each page.
+.TP
.B \-stdout
use standard output
.TP
More information about the poppler
mailing list