[poppler] pdftotext feature request: user-specified toUnicode-like tables
Jeff Lerman
jclerman at jefflerman.net
Tue Jun 11 16:40:43 PDT 2013
On 6/11/2013 3:34 PM, Ihar `Philips` Filipau wrote:
> On 6/11/13, Jeff Lerman <jclerman at jefflerman.net> wrote:
>> Yes, indicating words is an advantage - but failing to indicate that a
>> given character in a word is in a given font is a bug.
> This is about right time to tell you the thing: focus of poppler is
> the on-screen representation of the PDF, not helping extracting
> information from the PDFs. Otherwise, year ago I would have flooded
> the place with patches. :D
On the one hand, fair point - on the other hand, pdftotext is included
in poppler, and as Albert has pointed out, things get better when they
get worked on.
> To paint a character, one doesn't need to know its Unicode - the raw
> code point is an index of the font's glyphs/etc for the character. The
> Unicode of a character is only needed for copy-pasting. (Some PDF
> software intentionally strips the Unicode mapping tables to make
> copy-paste/text extraction unusable.)
>
> Otherwise, for you it is worth googling "pdf2htmlEX" and/or
> "pdftohtmlEX". Search for the precise terms. There are several project
> on the net (one of them is definitely based on poppler) focusing on
> extracting text/etc from PDFs into HTML, with high level of fidelity.
> Probably that would be more helpful to you than forcing poppler do
> something it is not designed to do.
I'm looking up those packages now, and have checked out pdf2htmlEX.
However, HTML is an unnecessary intermediate for our purposes. Also, I
don't see evidence (happy to be corrected here) that pdf2htmlEX contains
a table of mappings for obscure fonts, or a way for users to specify
their own mapping tables. Without that, it doesn't help me.
>> Now, when I use pdftohtml (I'll include the actual command below too), I
>> get a file that includes:
>> .....
>> as you can see, the font "MathematicalPi-One" is not noted as being the
>> correct one for that numeral "1". There is no way to find out the
>> actual fonts being used, on a per-character basis, for the text in the
>> PDF file.
> That what I meant by saying that pdftohtml erroneously merges some fonts.
> But this is not per se a bug. Conversion of PDF into an HTML is at
> best approximate process, primarily optimized to display an average
> PDF in a readable fashion.
I would say that "erroneously merges" is indeed a bug. (Unless the
documentation specifies that behavior - in which case it might be a
feature, but not a great one, I'd suggest). If the mis-assignment of
fonts is done in pdftotext as well, and if it is done before conversion
to Unicode, then my request (even if I choose to try coding it myself)
will be blocked, since the per-character font information is lost
(discarded) by pdftotext. Does anyone have any information on whether
that is the case?
> ... In fact, modulo "Tagged PDF" feature, PDF is not designed to
> represent text, per se. The most common PDF is just a container with
> vector graphics. Some of the graphics is drawing of text. Extraction
> of text is literally, based on interception of a text drawing
> operation and instead of drawing the text, dumping it into a file/etc.
> @Leonard, please don't hit me. /me *cowers*. :D
Yes, I know that PDF was not originally designed to represent
machine-readable text, and instead is essentially optimized for
human-readability only. However, the fact is that in today's world some
of us must extract text from PDFs anyway.
I stipulate that I am aware of the problem and that I am forced to face
it anyway.
>> Hmm, OK. I'm a little concerned, looking at the code, that assumptions
>> about how to map a character from a given font are made on a whole-font
>> basis, not per-character.
>> I'm not sure if there is support for fallback
>> mechanisms in the algorithms that convert a PDF character to Unicode for
>> pdftotext. For example, if a document has font X and I know that
>> character A in that font should be remapped to Z, but I have no
>> information on some other character B, I want to be able to specify the
>> A->Z remapping without affecting whatever default is used to show the B
>> character. I'm not sure if the code simply looks for the existence of a
>> certain kind of translation table for each font and then assumes that
>> the table is always complete - that would be sub-optimal for my
>> use-case. Can someone shed light on that question?
> The toUnicode table is per-font. But, for example normal, bold, italic
> and bold+italic fonts are 4 different fonts. That is why the merge is
> needed for HTML.
>
> There should be already a place to hook the Unicode mapping table,
> because there is already place in code (I've seen it once) which
> extracts from PDF the font specific Unicode mapping table.
>
OK, here I will begin to delve into the code itself, at least to scope
the problem. This is important to our team and we're willing to devote
resources to solving the problem and contributing code if people are
willing to answer questions about aspects of the existing Poppler code.
Happy to take some questions offline if there are particular folks who
have the necessary expertise.
I think my feature request could be broken down like this:
1. Pass a file containing a custom set of font-specific "ToUnicode"
mappings to pdftotext.
2. Ensure that pdftotext is correctly parsing and preserving, for each
PDF character, the name of the font used (in the PDF) to represent it,
as well as the character number.
3. Edit the appropriate code (in GfxFont.cc?) to "patch" the
character-to-Unicode mapping using table supplied in step 1. The
mapping should be used, probably, as if it came from a "ToUnicode" table
supplied in the PDF.
At line 1182 in GfxFont.cc, I see the comment "// merge differences into
encoding" (preceding a code block) which seems to be where a "patch"
table such as the one I'm proposing should be utilized. However, I have
questions:
a. It looks like that code-block might only be used for certain fonts
(it's in Gfx8BitFont::Gfx8BitFont). Is that true? If so, is there an
analogous block for each of the other possible font-types?
b. Is there any danger of the "patches" being ignored for certain fonts
based on the "choose a cmap" logic outlined later? In my copy there is
a block of comments that begin:
// To match up with the Adobe-defined behaviour, we choose a cmap
// like this:
// 1. If the PDF font has an encoding:
...
Any takers for any of the above steps 1-3?
Finally: I think allowing users to specify this kind of
character-mapping table manually at runtime would contribute enormously
to the usefulness of pdftotext. At the moment, the character-mapping
issue is the most prominent problem with using an otherwise very useful
program (THANK YOU to all who have been building/maintaining this
suite!), and it's frustrating to have the correct mappings already
in-hand, but not have a way to tell pdftotext to incorporate them. If
pdftotext *could* accept such a table, there would be strong incentive
to crowdsource additions to the table, since they are of broad
interest. Since collecting the data for the table is quite
labor-intensive, incentivizing the work by allowing it to be easily
utilized would itself be very influential.
Thanks very much,
--Jeff
> Regards.
>
>>> N.B. PDFs might have attachments. In the past, I once came across a
>>> PDF without the font encodings - but with the source WinWord document
>>> attached. Worth checking.
>>>
>
More information about the poppler
mailing list