[poppler] pdftohtml patch: restore old "raw" command-line option
Ross Moore
ross at ics.mq.edu.au
Sun Oct 5 18:44:31 PDT 2008
Hi Warren, and others,
On 05/10/2008, at 11:49 PM, Warren Toomey wrote:
> pdftohtml used to have a "raw" mode which has been removed. In
> "raw" mode,
> text from a PDF document is processed in the order that it occurs.
> However,
> the current version of pdftohtml reorders the text to be in
> increasing y-value,
> i.e. from the top of a page going down to the bottom.
>
> This text reordering plays merry havoc with multi-column pages, as
> the text
> from the columns becomes interleaved instead of remaining separate.
The 'raw' option is also needed when extracting the text from PDFs
using pdftotext , where the PDF was produced using LaTeX and Plain TeX,
especially when there is mathematical content.
Here are just a few of the inline structures that are dependent upon
following the correct order of the letters/symbols within the PDF
streams:
1. 2-character accented letters, (accent first, then the letter)
--- based upon y-coord, the upper accent moves to the beginning
of the line, lower accents follow the whole line;
2. superscripts and subscripts on math-symbols similarly migrate;
also, superscripted markers indicating footnotes or citations.
3. fractions, integrals with limits, summations, products, etc.
4. large delimiters made from 2 or more pieces
5. other math constructions involving 2-dimensional layout
e.g., square-roots, surds, labelled relations, ...
I've mentioned this kind of problem before:
http://lists.freedesktop.org/archives/poppler/2008-June/003877.html
http://lists.freedesktop.org/archives/poppler/2008-June/003888.html
http://lists.freedesktop.org/archives/poppler/2008-May/003839.html
viz.
>> > Thus there are several issues that need to be handled to get the
>> > "correct" text extraction from such PDFs.
>>
>> That is, both the layout and the original stream order
>> must be considered, perhaps also using extra knowledge
>> of how the PDF was generated.
I'm interested in collaborating with anyone interested in working
on this aspect of the Poppler library.
> The attached patch restores the -raw command-line option to
> pdftohtml. The
> program retains its current behaviour if the -raw option is not
> used, but
> reverts to the "text as it appears" behaviour with the -raw option
> enabled.
>
> Cheers,
> Warren
Cheers,
Ross
------------------------------------------------------------------------
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
More information about the poppler
mailing list