[poppler] Bug 69485
Ross Moore
ross.moore at mq.edu.au
Mon Jan 6 23:04:25 PST 2014
Hi Alex,
On 07/01/2014, at 4:35 PM, Alex Korobkin wrote:
>> Hi Ross,
>>
>> 2014/1/5 Ross Moore <ross.moore at mq.edu.au>
>>
>> >> While we're on this subject, maybe you could have a look at the PS output produced by pdftops, when processing the same file?
>> >> The resulting level 3 PostScript cannot be parsed by Distiller either, the error is
>> >>
>> >> %%[ Error: undefined; OffendingCommand: xyshow ]%%
>>
>> OK. I can reproduce this.
>>
>> Again ps2pdf has no problem with it, but Apple's pstopdf
>> also fails to do the conversion.
>>
>>
>> This is most perplexing as the xyshow command is handled
>> correctly 10 times, but fails on the 11th usage.
>>
>>
> Just to be sure I understand this correctly: I only see xyshow being used once in the document, when defining Tj macro.
That's correct.
> Do you refer to the 11th invocation of Tj macro?
Yes.
I think this is called for each syllable or group of letters in each word.
In particular I think it is called for each individual chinese character,
or a group of characters.
>>
>> It seems that the difficulty is first encountered
>> when handling the chinese characters in the heading of
>> the Form.
>> (Preceding this are the characters of: "Form V.2013"
>> at top-right of page 1. These are done OK. )
>>
>> Here's how I can check this:
>>
>> >> % text string operators
>> >> /xyshow where {
>> >> pop
>> >> /xyshow2 {
>> >> dup length array
>> >> 0 2 2 index length 1 sub {
>> >> 2 index 1 index 2 copy get 3 1 roll 1 add get
>> >> pdfTextMat dtransform
>> >> 4 2 roll 2 copy 6 5 roll put 1 add 3 1 roll dup 4 2 roll put
>> >> } for
>> >> exch pop
>> mark /xyshow where pstack cleartomark %<--- insert this line
>> >> xyshow
>> >> } def
>> >> }{
>> >> /xyshow2 {
>> >> currentfont /FontType get 0 eq {
>> ... etc ...
>>
>> This causes the contents of the stack to be written to the Log
>> immediately before xyshow is called.
>> The indication is that xyshow is indeed well-defined,
>> yet something still goes wrong.
>
> Thank you for the hint, I will try to do more debugging using this technique.
Postscript has a few neat constructions that help with debugging.
Another, which might come in helpful here, is to use the 'stopped'
operator.
With this, you can cause messages to be printed to the log,
only when an error has been encountered. This kind of debugging
would most likely be using 'pstack', 'mark' and 'cleartomark'
if you want to see what was on the stack when the error occurs.
>> By commenting out groups of lines like this, I can process
>> further and further into the file.
>>
>>
>> Does it mean that the error is not caused by any particular call to xyshow, but more likely by the number of such calls made consequently?
> Maybe it is some kind of nesting issue, or stack not been freed properly issue?
Not the number of calls.
There can be a large number of instances of xyshow between those
which fail, or they can be more or less adjacent.
I suspect that the character strings that xyshow is trying to set
are faulty in some way; that is, correspond to non-existent code-points
or glyphs within the subsetted font.
But I'm no expert on this, so it is pretty much guesswork.
>>
>> It seems that the errors occur with chinese characters,
>> always when coming from the font referenced as: /F243_0
>> which is:
>>
>> /F243_0 /ZJWNJQ+SimSun 0 pdfMakeFont16L3
>
>
>> Thus it would seem that there could be something badly wrong
>> with this font, or with the way it is being used in this document.
>>
>> Note carefully what I am saying here.
>> Not every instance of this font's usage causes an error,
>> but all the errors that I have found are associated with
>> an instance of this font's use.
>>
>>
>>
>> >>
>> >> The PS file can be retrieved from here, it is 18Mb in size. (Unlike pdftocairo, pdftops generates huge PS files. This particular one gets 10x larger when I provide licensed fonts to pdftops.)
>>
>> Yes.
>> Almost all of the first 87% of the file is devoted to the fonts.
>>
>
> I kind of wonder why pdftops embeds so many instances of the font, while both pdftocairo and pdf2ps somehow avoid this problem and create smaller PS documents. But, that's a subject for another discussion.
A lot of large fonts are being included.
Some are not even used, I'd guess.
viz.
[GlenMorangie:] rossmoor% grep -n font china-visa-application-without-fonts.ps | grep BeginResource
509:%%BeginResource: font ZJWNJQ+SimSun
11688:%%BeginResource: font AASELS+TimesNewRoman,Bold
14028:%%BeginResource: font JEIVZQ+SimSun
25207:%%BeginResource: font HRUUFF+SimSun
36454:%%BeginResource: font AdobeSongStd-Light
61390:%%BeginResource: font SimHei
86326:%%BeginResource: font SimSun
111296:%%BeginResource: font MicrosoftYaHei
136232:%%BeginResource: font MicrosoftYaHei,Bold
160033:%%BeginResource: font NSimSun
185037:%%BeginResource: font AdobeSongStd#20Light
209973:%%BeginResource: font FF487_0_ZJWNJQ+SimSun
221186:%%BeginResource: font FF589_0_ZJWNJQ+SimSun
232365:%%BeginResource: font QGJLNI+CambriaMath
234968:%%BeginResource: font FF520_0_AdobeSongStd-Light
259938:%%BeginResource: font KozMinPr6N-Regular
The subset: ZJWNJQ+SimSun isn't too large, at roughly 11000 lines.
Whereas the un-subset SimSun is roughly 25000 lines.
But then there seem to be 2 more subsets:
FF487_0_ZJWNJQ+SimSun and FF589_0_ZJWNJQ+SimSun .
>
> I only point to this location because it's the first mention of the [1.447 0 1.447 0 1.447 0 1.447 0] sequence, referred to by Distiller error message. Perhaps I misinterpret Distiller's message.
Arrays with these numbers occur quite a lot.
It determines the spacing between successive characters or glyphs.
e.g. at line 284996 we get a failing block:
(A\361+cB'>\230)
[1.447
0
1.447
0
1.447
0
1.447
0] Tj
There are 4 instances of 16-bit character or glyph-ids here:
"A\361", "+c", "B'", ">\230"
where each character or octal code (\xxx) represents 8 binary bits.
If I'm converting these into Hex correctly, they should correspond
to the unicode characters:
Ux0041F1 : 䇱
Ux002B63 : ???
Ux004227 : 䈧
Ux003E98 : 㺘
That 2nd one looks suspicious.
So maybe these codes do not map directly to Unicode.
We would have to look more closely at the font itself,
which is not so easy to do --- at least not for me.
Also in this vein, this string works OK, from line 284976 :
(\004]\011~\004\352"A)
Ux000456 : і
Ux00117E : ᅾ
Ux0004DA : Ӛ
Ux002241 : ≁
but the symbols are not all chinese.
So this probably isn't the correct interpretation.
>
> Hope this helps,
>
> It helps greatly, thanks again.
>
>
> -Alex
I don't think that there is anything more that I can do.
Let me know if you get anywhere further with this.
Cheers,
Ross
------------------------------------------------------------------------
Ross Moore ross.moore at mq.edu.au
Mathematics Department office: E7A-206
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: logo.png
Type: image/png
Size: 5257 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20140107/4b0f0d91/attachment-0001.png>
-------------- next part --------------
More information about the poppler
mailing list