[poppler] Bug 69485

Mon Jan 6 23:04:25 PST 2014

Hi Alex,

On 07/01/2014, at 4:35 PM, Alex Korobkin wrote:

>> Hi Ross, 
>> 
>> 2014/1/5 Ross Moore <ross.moore at mq.edu.au>
>> 
>> >> While we're on this subject, maybe you could have a look at the PS output produced by pdftops, when processing the same file?
>> >> The resulting level 3 PostScript cannot be parsed by Distiller either, the error is
>> >>
>> >> %%[ Error: undefined; OffendingCommand: xyshow ]%%
>> 
>> OK. I can reproduce this.
>> 
>> Again  ps2pdf  has no problem with it, but Apple's  pstopdf
>> also fails to do the conversion.
>> 
>> 
>> This is most perplexing as the  xyshow  command is handled
>> correctly 10 times, but fails on the 11th usage.
>> 
>> 
> Just to be sure I understand this correctly: I only see xyshow being used once in the document, when defining Tj macro. 

That's correct.

> Do you refer to the 11th invocation of Tj macro? 

Yes.
I think this is called for each syllable or group of letters in each word.
In particular I think it is called for each individual chinese character,
or a group of characters.

>>  
>> It seems that the difficulty is first encountered
>> when handling the chinese characters in the heading of
>> the Form.
>> (Preceding this are the characters of:  "Form V.2013"
>> at top-right of page 1. These are done OK. )
>> 
>> Here's how I can check this:
>> 
>> >> % text string operators
>> >> /xyshow where {
>> >>   pop
>> >>   /xyshow2 {
>> >>     dup length array
>> >>     0 2 2 index length 1 sub {
>> >>       2 index 1 index 2 copy get 3 1 roll 1 add get
>> >>       pdfTextMat dtransform
>> >>       4 2 roll 2 copy 6 5 roll put 1 add 3 1 roll dup 4 2 roll put
>> >>     } for
>> >>     exch pop
>>   mark /xyshow where pstack cleartomark  %<--- insert this line
>> >>     xyshow
>> >>   } def
>> >> }{
>> >>   /xyshow2 {
>> >>     currentfont /FontType get 0 eq {
>>     ... etc ...
>> 
>> This causes the contents of the stack to be written to the Log
>> immediately before  xyshow  is called.
>> The indication is that  xyshow  is indeed well-defined,
>> yet something still goes wrong.

> 
> Thank you for the hint, I will try to do more debugging using this technique. 

Postscript has a few neat constructions that help with debugging.
Another, which might come in helpful here, is to use the 'stopped'
operator.
With this, you can cause messages to be printed to the log,
only when an error has been encountered. This kind of debugging
would most likely be using 'pstack', 'mark' and 'cleartomark'
if you want to see what was on the stack when the error occurs.

>> By commenting out groups of lines like this, I can process
>> further and further into the file.
>> 
>> 
>> Does it mean that the error is not caused by any particular call to xyshow, but more likely by the number of such calls made consequently? 
> Maybe it is some kind of nesting issue, or stack not been freed properly issue?

Not the number of calls.
There can be a large number of instances of  xyshow  between those
which fail, or they can be more or less adjacent.

I suspect that the character strings that  xyshow  is trying to set
are faulty in some way; that is, correspond to non-existent code-points
or glyphs within the subsetted font.
But I'm no expert on this, so it is pretty much guesswork. 

>>  
>> It seems that the errors occur with chinese characters,
>> always when coming from the font referenced as:  /F243_0
>> which is:
>> 
>> /F243_0 /ZJWNJQ+SimSun 0 pdfMakeFont16L3
> 
> 
>> Thus it would seem that there could be something badly wrong
>> with this font, or with the way it is being used in this document.
>> 
>> Note carefully what I am saying here.
>> Not every instance of this font's usage causes an error,
>> but all the errors that I have found are associated with
>> an instance of this font's use.
>> 
>> 
>> 
>> >>
>> >> The PS file can be retrieved from here, it is 18Mb in size. (Unlike pdftocairo, pdftops generates huge PS files. This particular one gets 10x larger when I provide licensed fonts to pdftops.)
>> 
>> Yes.
>> Almost all of the first 87% of the file is devoted to the fonts.
>> 
> 
> I kind of wonder why pdftops embeds so many instances of the font, while both pdftocairo and pdf2ps somehow avoid this problem and create smaller PS documents. But, that's a subject for another discussion. 

A lot of large fonts are being included.
Some are not even used, I'd guess.
viz.

[GlenMorangie:] rossmoor% grep -n font china-visa-application-without-fonts.ps | grep BeginResource
509:%%BeginResource: font ZJWNJQ+SimSun
11688:%%BeginResource: font AASELS+TimesNewRoman,Bold
14028:%%BeginResource: font JEIVZQ+SimSun
25207:%%BeginResource: font HRUUFF+SimSun
36454:%%BeginResource: font AdobeSongStd-Light
61390:%%BeginResource: font SimHei
86326:%%BeginResource: font SimSun
111296:%%BeginResource: font MicrosoftYaHei
136232:%%BeginResource: font MicrosoftYaHei,Bold
160033:%%BeginResource: font NSimSun
185037:%%BeginResource: font AdobeSongStd#20Light
209973:%%BeginResource: font FF487_0_ZJWNJQ+SimSun
221186:%%BeginResource: font FF589_0_ZJWNJQ+SimSun
232365:%%BeginResource: font QGJLNI+CambriaMath
234968:%%BeginResource: font FF520_0_AdobeSongStd-Light
259938:%%BeginResource: font KozMinPr6N-Regular

The subset:  ZJWNJQ+SimSun  isn't too large, at roughly 11000 lines. 
Whereas the un-subset  SimSun  is roughly 25000 lines.
But then there seem to be 2 more subsets:  
  FF487_0_ZJWNJQ+SimSun   and   FF589_0_ZJWNJQ+SimSun .

> 
> I only point to this location because it's the first mention of the [1.447 0 1.447 0 1.447 0 1.447 0] sequence, referred to by Distiller error message. Perhaps I misinterpret Distiller's message. 

Arrays with these numbers occur quite a lot.
It determines the spacing between successive characters or glyphs.

e.g. at line 284996 we get a failing block:

(A\361+cB'>\230)
 [1.447
 0
 1.447
 0
 1.447
 0
 1.447
 0] Tj

There are 4 instances of 16-bit character or glyph-ids here:
  "A\361", "+c", "B'", ">\230"
where each character or octal code (\xxx) represents 8 binary bits.
If I'm converting these into Hex correctly, they should correspond
to the unicode characters:

Ux0041F1 : 䇱
Ux002B63 :  ???
Ux004227 : 䈧
Ux003E98 : 㺘

That 2nd one looks suspicious.
So maybe these codes do not map directly to Unicode.
We would have to look more closely at the font itself,
which is not so easy to do --- at least not for me.

Also in this vein, this string works OK, from line 284976 :
 (\004]\011~\004\352"A)
 Ux000456 : і
 Ux00117E : ᅾ
 Ux0004DA : Ӛ
 Ux002241 : ≁
but the symbols are not all chinese.

So this probably isn't the correct interpretation.

> 
> Hope this helps,
> 
> It helps greatly, thanks again. 
> 
> 
> -Alex

I don't think that there is anything more that I can do.
Let me know if you get anywhere further with this.

Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross.moore at mq.edu.au 
Mathematics Department                           office: E7A-206      
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: logo.png
Type: image/png
Size: 5257 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20140107/4b0f0d91/attachment-0001.png>
-------------- next part --------------