[poppler] Graphical caracter filtering
Ross Moore
ross.moore at mq.edu.au
Fri Jun 21 16:49:22 PDT 2013
Hi Leonard, Stefano, and others,
On 21/06/2013, at 10:53 PM, Leonard Rosenthol wrote:
> That sounds like a problem with the software that created the PDF, not with Poppler.
It's a bit of a stretch to blame other software for
using a technique that has been around since before PDF
was invented, and even from before Adobe was formed
as a company.
Besides, the construction characters are supported in
Unicode, within the range Ux0239B – Ux023B9
(and Ux02320, Ux02321 for large integral signs).
For extraction of text from PDFs, there are (at least)
2 aspects to consider:
1. does the PDF map the character pieces to Unicode,
via a /ToUnicode CMap for the font ?
Recent LaTeX-generated PDFs should do this;
whereas older ones (or recent ones using the
older methods & packages) do not.
2. finding all the pieces, and perhaps converting
to a single bracket/brace/parenthesis/integral sign,
"when that is the best way to express the output".
It is far from obvious that converting to a single character
will always be the right thing to do, since this loses all
size information, and the position of the pieces within
the text stream can give good clues as to layout of the
rows of the matrix that the brackets enclose.
Accepting what you get and doing further post-processing
would seem to be the right way to go.
Of course if the PDF were to be properly tagged,
including for the structure of the mathematics being
represented, then the story is quite different.
But that is an ongoing research/development project
--- on which I'll be presenting a paper in a few weeks time.
http://www.cicm-conference.org/2013/cicm.php
The attached image shows part of a fully-tagged PDF,
which validates for PDF/UA (i.e., PDF/A-2u ).
In the attached .txt file, I give the results of 5
different methods of extraction, for the highlighted text.
One of these gives the appropriate portion of XML for the matrix,
which is properly encoded into this PDF.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2013-06-22 at 9.24.01 AM.png
Type: image/png
Size: 103245 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20130622/b05d52c7/attachment-0002.png>
-------------- next part --------------
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Text-Output-formats.txt
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20130622/b05d52c7/attachment-0001.txt>
-------------- next part --------------
The full PDF is publicly available:
http://rutherglen.science.mq.edu.au/dmth237S113/sols/2013-Assign2-soln.pdf
> What happens if you open that PDF up in Adobe Reader and copy/paste the data? Do you get the same thing?
>
> From: "stefano.rubino" <stefano.rubino at laposte.net>
> Reply-To: "stefano.rubino" <stefano.rubino at laposte.net>
> Date: Thursday, June 20, 2013 11:54 PM
> To: "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
> Subject: [poppler] Graphical caracter filtering
>
>
> Hello,
>
>
> I've tested the poppler library.
> I did a piece of code, perhaps I missed something,
> but I haven't find the way for desactivating/filtering the graphical caracters.
> For instance, for a pdf document containing matrix brackets
> I get piece of the graphical brackets as many single character.
> Is there a way for filtering them ?
> Is there a poppler function, or poppler configuration (tag, flag, ...) or ... ?
>
> Thanks for your help
>
> S.R.
There was some discussion here recently about Poppler
support for Tagged PDFs, and a throw-away comment was that
these are "rare".
From: Ihar `Philips` Filipau <thephilips at gmail.com>
Subject: Re: [poppler] pdftotext feature request: user-specified toUnicode-like tables
Date: 12 June 2013 6:43:42 AM AEST
My work is aimed at trying to change this; especially with
mathematical content.
Hope this helps,
Ross
------------------------------------------------------------------------
Ross Moore ross.moore at mq.edu.au
Mathematics Department office: E7A-206
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: logo.png
Type: image/png
Size: 5257 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20130622/b05d52c7/attachment-0003.png>
-------------- next part --------------
More information about the poppler
mailing list