[poppler] Solving 8 chars maximum limit on a glyph representation

Sun May 11 15:02:28 PDT 2008

Hi Albert,

On 04/05/2008, at 9:07 AM, Albert Astals Cid wrote:
> A Diumenge 04 Maig 2008, Albert Astals Cid va escriure:
>> Like Ross pdf showed, we have a maximum limit of 8 char for the
>> representation of a glyph, so even there's a char that identifies  
>> itself as
>> \rightarrow pdftotext only gives \rightar
>>
>> I'm fixing this hardcoded limit with the attached patch. As side  
>> effects
>> we're having a speed boost as i stop copying things when calling
>> CharCodeToUnicode::mapToUnicode and lower memory usage as for each
>> CharCodeToUnicodeString now only the exact memory needed is used,  
>> not a
>> fixed 8 like before.
>>
>> I'm attaching the patch for further review. If noone disagrees  
>> i'll commit
>> on sunday 11.

No disagreement from me.
I've applied the patch, and the earlier ones related to Annotations,  
etc.

All the  utils/pdfto*  work much better (no Bus Error) with my  
example PDFs,
except for  pdfimages (which generates image files of size 8 bytes !)

Thanks very much for your work on this.

However, there are still some problems with the actual text strings
extracted using  pdftotext .

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 5019-e-c4.txt
Url: http://lists.freedesktop.org/archives/poppler/attachments/20080512/d63d1e5f/attachment-0004.txt 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 5019-e-c18.txt
Url: http://lists.freedesktop.org/archives/poppler/attachments/20080512/d63d1e5f/attachment-0005.txt 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 5019-e-c1.txt
Url: http://lists.freedesktop.org/archives/poppler/attachments/20080512/d63d1e5f/attachment-0006.txt 
-------------- next part --------------

Here's an example of one thing that is not working quite right yet.
The file resulting from the following command is attached.

    pdftotext -f 1 -l 1 5019-e-cmap.pdf 5019-e-c1.txt

It contains a string:

  pp. 17?30 : Rainer L?wen and Burkard Polster o Linear geometries  
on the Moebius strip: a theorem of Skornyakov type.

When the same string is extracted from the PDF using AR8's text  
selection
and copy, then pasted directly into this email message, one gets:

  pp. 17?30 : Rainer L?owen and Burkard Polster
Linear geometries on the Moebius strip: a theorem of
Skornyakov type.

Note where pdftotext has positioned the 'o' that should
have followed the umlaut accent.

The same kind of thing happens throughout the document.
e.g.
    pdftotext -f 18 5019-e-cmap.pdf 5019-e-c18.txt

contains

    Institut f?r Analysis und Algebra u Technische Universit?t a  
Pockelsstr. 14 38106 Braunschweig Germany e-mail: r.loewen at tu-bs.de

in which both accented vowels are shifted.
(The file  5019-e-c18.txt  is attached.)

A similar effect is evident using the other PDF:  5019-e-mmap.pdf .

pdftotext gives an acceptable result here:

pp. 17?30 : Rainer L\" owen and Burkard Polster Linear geometries on  
the Moebius strip: a theorem of Skornyakov type.

but AR8 does it a bit better with:

pp. 17?30 : Rainer L\"owen and Burkard Polster
Linear geometries on the Moebius strip: a theorem of
Skornyakov type.

But here  pdftotext  shifts part of the word:

Institut f\" Analysis und Algebra ur Technische Universit\" at  
Pockelsstr. 14 38106 Braunschweig Germany e-mail: r.loewen at tu-bs.de

while AR8 gives:

Institut f\"ur Analysis und Algebra
Technische Universit\"at
Pockelsstr. 14
38106 Braunschweig
Germany
e-mail: r.loewen at tu-bs.de

The same kind of issue occurs with acute accents ( \' in TeX).
viz.

  Yong-Gao Chen, Andr\' S\' ozy, Vera T. S\' and MinTang as ark\"  
os .. .. .. Generation of diagonal acts of some semigroups of  
transformations and relations Peter Gallagher and Nik Ru?kuc  
s .. .. .. .. .. .. .. Subalgebras of free restricted Lie algebras  
R.M. Bryant, L.G. Kov\' and Ralph St\" acs ohr

This suggests that there may be a regular-expression problem
when the extracted text contains  \" or \' or the bare accent
characters for umlaut and acute accents.

The second problem is as follows.
The file resulting from

    pdftotext 5019-e-cmap.pdf

does not appear to be correctly encoded as UTF-8.
At least, Apple's  TextEdit  application will not open it
with UTF-8 encoding, whereas it will open as 8-bit MacRoman
(which of course doesn't correctly show the multi-byte UTF-8
characters).

Some of the output is valid UTF-8, but not all of it;
viz.
The 1st three pages are OK, extracted via:

     pdftotext -f 1 -l 3 5019-e-cmap.pdf 5019-e-c3.txt

But the 4th page is not valid:

     pdftotext -f 4 -l 4 5019-e-cmap.pdf 5019-e-c4.txt

(the file  5019-e-c4.txt  is attached).

This 4th page is the first place within  5019-e-cmap.pdf
that real mathematics occurs, using Unicode Plane-1 characters.

For example, the title of one article contains the following
(extracted using AR8):

    method for the p(x)-Laplacian equation

where the $p(x)$ is styled for mathematics.
  pdftotext  extracts this snippet as:

   method for the ??????(??????)-Laplacian equation

Notice that AR8 has mapped the Plane-1 mathematics symbols
to their simple alphabetic counterparts in the ASCII range.
This is not necessarily the most desirable thing to do,
however it may be the most practical thing to do.

Here is the result of extracting some more complicated mathematics:

  using  pdftotext 5019-e-cmap.pdf

Introduction A ???at stable plane (??????, ???)  
consists of a point space ??????, which is a surface  
(topological 2manifold), and a system ??? of lines, which are  
closed subsets of ??????, such that any two points are joined  
by a unique line and that the operations of join and intersection are  
continuous. Moreover, it is required that intersection is stable,  
that is, the set of pairs of distinct intersecting lines is open.

  using  pdftotext 5019-e-mmap.pdf

Introduction A flat stable plane (E, \mathcal{L} ) consists of a  
point space E, which is a surface (topological 2manifold), and a  
system \mathcal{L}  of lines, which are closed subsets of E, such  
that any two points are joined by a unique line and that the  
operations of join and intersection are continuous. Moreover, it is  
required that intersection is stable, that is, the set of pairs of  
distinct intersecting lines is open.

The attached image shows the result, in a web-browser, of this
portion of the document processed with:

    pdftohtml  5019-e-cmap.pdf

and viewed with UTF-8 encoding. Notice the missing characters
corresponding to Plane-1 mathematics symbols.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Picture 20.png
Type: image/png
Size: 85309 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080512/d63d1e5f/attachment-0001.png 
-------------- next part --------------

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 5019-e-c5.txt
Url: http://lists.freedesktop.org/archives/poppler/attachments/20080512/d63d1e5f/attachment-0007.txt 
-------------- next part --------------

A file  5019-e-c5.txt  is attached, containing the result of

     pdftotext -f 5 -l 5 5019-e-cmap.pdf 5019-e-c5.txt

which includes the above snippet.

Here's another thing where the text is extracted incorrectly,
from page 18 (physical page 6) of  5019-e-mmap.pdf :

and x =  - x \bigl\{  \bigr\}  if | x|  = 1. The point \infty  will  
always be represented by the pair (0, 1), (0,  - 1) , as in Figure 1.

AR8 extracts that bit as:

and x = x
if | x|  = 1. The point \infty  will always be represented by the pair
\bigl\{
(0, 1), (0, 1)
\bigr\}
, as in
Figure 1.

(which has lost a '-' sign! ).

> Albert

Hopefully these encoding and text-extraction issues will be easy
to resolve.

Your latest changes to  OptionalContent.cc  just came through.
Do those affect the above issues in any way?

  Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------