[poppler] Microsoft Word Save-as PDF Accents

Mon Mar 28 05:23:12 PDT 2011

On Fri, 2011-03-25 at 20:43 +0000, Albert Astals Cid wrote:
> A Divendres, 25 de març de 2011, vàreu escriure:
> > On Fri, 25 Mar 2011 19:02:46 +0000, Albert Astals Cid <aacid at kde.org>
> > 
> > NB I just tried extracting from a Word-generated PDF and TextOutputDev
> > didn't see the line with the diacritic at all.
> 
> And are you sure it's not a Word fault?

(What tool do you use to de-compress/analyse PDFs?)

Here's the PDF file generated with Word 2010:
http://users.ecs.soton.ac.uk/tdb2/ms_word_accents.pdf

The bad text is this (only contains two diacritics but word has chewed
the whole paragraph):
[<005A03580003003E>-4<0182>5<01C1011E>-3<0176>3<0003>9<01020176>4<011A>3<0003001
1035800030057>4<017D>-5<016F0190019A>10<011E018C>] TJ

There's a CMAP included:
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
17 beginbfchar
<0003> <0020>
<0011> <0020>
... repeated for all chars above mapping to 0020
<0358> <0020>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop

Do I understand correctly that Word is encoding all paragraphs
containing diacritics using a custom font table with a Unicode cmap that
maps every character to space (which is exactly the behaviour shown by
Chrome copy-n-paste)?

All the best,
Tim.