[poppler] Graphical caracter filtering

Ross Moore ross.moore at mq.edu.au
Fri Jun 21 16:49:22 PDT 2013


Hi Leonard, Stefano, and others,

On 21/06/2013, at 10:53 PM, Leonard Rosenthol wrote:

> That sounds like a problem with the software that created the PDF, not with Poppler.  

It's a bit of a stretch to blame other software for 
using a technique that has been around since before PDF
was invented, and even from before Adobe was formed
as a company.


Besides, the construction characters are supported in
Unicode, within the range  Ux0239B – Ux023B9
(and Ux02320, Ux02321 for large integral signs).

For extraction of text from PDFs, there are (at least)
2 aspects to consider:

 1.  does the PDF map the character pieces to Unicode,
     via a /ToUnicode  CMap for the font ?
     
    Recent LaTeX-generated PDFs should do this;
    whereas older ones (or recent ones using the
    older methods & packages) do not.


 2.  finding all the pieces, and perhaps converting
     to a single bracket/brace/parenthesis/integral sign,
    "when that is the best way to express the output".

It is far from obvious that converting to a single character 
will always be the right thing to do, since this loses all
size information, and the position of the pieces within
the text stream can give good clues as to layout of the
rows of the matrix that the brackets enclose.

Accepting what you get and doing further post-processing
would seem to be the right way to go.



Of course if the PDF were to be properly tagged,
including for the structure of the mathematics being
represented, then the story is quite different.
But that is an ongoing research/development project 
--- on which I'll be presenting a paper in a few weeks time.

  http://www.cicm-conference.org/2013/cicm.php


The attached image shows part of a fully-tagged PDF,
which validates for PDF/UA  (i.e., PDF/A-2u ).
In the attached  .txt file, I give the results of 5 
different methods of extraction, for the highlighted text.
One of these gives the appropriate portion of XML for the matrix,
which is properly encoded into this PDF.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2013-06-22 at 9.24.01 AM.png
Type: image/png
Size: 103245 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20130622/b05d52c7/attachment-0002.png>
-------------- next part --------------



-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Text-Output-formats.txt
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20130622/b05d52c7/attachment-0001.txt>
-------------- next part --------------


The full PDF is publicly available:

   http://rutherglen.science.mq.edu.au/dmth237S113/sols/2013-Assign2-soln.pdf



> What happens if you open that PDF up in Adobe Reader and copy/paste the data?  Do you get the same thing?
> 
> From: "stefano.rubino" <stefano.rubino at laposte.net>
> Reply-To: "stefano.rubino" <stefano.rubino at laposte.net>
> Date: Thursday, June 20, 2013 11:54 PM
> To: "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
> Subject: [poppler] Graphical caracter filtering
> 
> 
> Hello,
> 
> 
> I've tested the poppler library.
> I did a piece of code, perhaps I missed something,
> but I haven't find the way for desactivating/filtering the graphical caracters.
> For instance, for a pdf document containing matrix brackets
> I get piece of the graphical brackets as many single character.
> Is there a way for filtering them ?
> Is there a poppler function, or poppler configuration (tag, flag, ...) or ... ?
> 
> Thanks for your help
> 
> S.R.


There was some discussion here recently about Poppler
support for Tagged PDFs, and a throw-away comment was that
these are "rare".

	From: 	Ihar `Philips` Filipau <thephilips at gmail.com>
	Subject: 	Re: [poppler] pdftotext feature request: user-specified toUnicode-like tables
	Date: 	12 June 2013 6:43:42 AM AEST


My work is aimed at trying to change this; especially with
mathematical content.



Hope this helps,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-206      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: logo.png
Type: image/png
Size: 5257 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20130622/b05d52c7/attachment-0003.png>
-------------- next part --------------




More information about the poppler mailing list