[poppler] pdftohtml patch: restore old "raw" command-line option

Warren Toomey poppler at tuhs.org
Sun Oct 5 05:49:20 PDT 2008


pdftohtml used to have a "raw" mode which has been removed. In "raw" mode,
text from a PDF document is processed in the order that it occurs. However,
the current version of pdftohtml reorders the text to be in increasing y-value,
i.e. from the top of a page going down to the bottom.
 
This text reordering plays merry havoc with multi-column pages, as the text
from the columns becomes interleaved instead of remaining separate.
The attached patch restores the -raw command-line option to pdftohtml. The
program retains its current behaviour if the -raw option is not used, but
reverts to the "text as it appears" behaviour with the -raw option enabled.
 
Cheers,
        Warren
-------------- next part --------------
--- pdftohtml.cc	2008/10/01 05:47:31	1.5
+++ pdftohtml.cc	2008/10/05 12:37:58
@@ -39,7 +39,7 @@
 
 static int firstPage = 1;
 static int lastPage = 0;
-static GBool rawOrder = gTrue;
+static GBool rawOrder = gFalse;
 GBool printCommands = gTrue;
 static GBool printHelp = gFalse;
 GBool printHtml = gFalse;
@@ -71,8 +71,8 @@
    "first page to convert"},
   {"-l",      argInt,      &lastPage,      0,
    "last page to convert"},
-  /*{"-raw",    argFlag,     &rawOrder,      0,
-    "keep strings in content stream order"},*/
+  {"-raw",    argFlag,     &rawOrder,      0,
+    "keep strings in content stream order"},
   {"-q",      argFlag,     &errQuiet,      0,
    "don't print any messages or errors"},
   {"-h",      argFlag,     &printHelp,     0,
@@ -270,7 +270,8 @@
 	  }
   }}
 
-  rawOrder = complexMode; // todo: figure out what exactly rawOrder do :)
+  if (complexMode)
+    rawOrder = complexMode; // todo: figure out what exactly rawOrder do :)
 
   // write text file
   htmlOut = new HtmlOutputDev(htmlFileName->getCString(), 
--- pdftohtml.1	2008/09/30 10:03:25	1.3
+++ pdftohtml.1	2008/10/05 12:40:31
@@ -46,6 +46,10 @@
 .B \-noframes
 generate no frames. Not supported in complex output mode.
 .TP
+.B \-raw
+process text as it occurs in the document. Without this option, text is
+processed in increasing y-value, i.e. from top to bottom of each page.
+.TP
 .B \-stdout
 use standard output
 .TP 


More information about the poppler mailing list