[poppler] pdftohtml patch: restore old "raw" command-line option

Ross Moore ross at ics.mq.edu.au
Sun Oct 5 18:44:31 PDT 2008


Hi Warren, and others,

On 05/10/2008, at 11:49 PM, Warren Toomey wrote:

> pdftohtml used to have a "raw" mode which has been removed. In  
> "raw" mode,
> text from a PDF document is processed in the order that it occurs.  
> However,
> the current version of pdftohtml reorders the text to be in  
> increasing y-value,
> i.e. from the top of a page going down to the bottom.
>
> This text reordering plays merry havoc with multi-column pages, as  
> the text
> from the columns becomes interleaved instead of remaining separate.

The 'raw' option is also needed when extracting the text from PDFs
using  pdftotext , where the PDF was produced using LaTeX and Plain TeX,
especially when there is mathematical content.

Here are just a few of the inline structures that are dependent upon
following the correct order of the letters/symbols within the PDF  
streams:

   1.  2-character accented letters, (accent first, then the letter)
       --- based upon y-coord, the upper accent moves to the beginning
           of the line, lower accents follow the whole line;

   2.  superscripts and subscripts on math-symbols similarly migrate;
       also, superscripted markers indicating footnotes or citations.

   3.  fractions, integrals with limits, summations, products, etc.

   4.  large delimiters made from 2 or more pieces

   5.  other math constructions involving 2-dimensional layout
       e.g., square-roots, surds, labelled relations, ...


I've mentioned this kind of problem before:

http://lists.freedesktop.org/archives/poppler/2008-June/003877.html
http://lists.freedesktop.org/archives/poppler/2008-June/003888.html
http://lists.freedesktop.org/archives/poppler/2008-May/003839.html

viz.

>> > Thus there are several issues that need to be handled to get the
>> > "correct" text extraction from such PDFs.
>>
>> That is, both the layout and the original stream order
>> must be considered, perhaps also using extra knowledge
>> of how the PDF was generated.


I'm interested in collaborating with anyone interested in working
on this aspect of the Poppler library.


> The attached patch restores the -raw command-line option to  
> pdftohtml. The
> program retains its current behaviour if the -raw option is not  
> used, but
> reverts to the "text as it appears" behaviour with the -raw option  
> enabled.
>
> Cheers,
>         Warren


Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------




More information about the poppler mailing list