[poppler] Toward to JBIG2 support in CairoOutputDev
mpsuzuki at hiroshima-u.ac.jp
Wed Dec 31 00:29:19 PST 2014
In this week, I'm trying to make CairoOutputDev to emit CCITT
G4 or JBIG2 data to reduce the filesize (in my impression, the
transcoding from CCITT G4/JBIG2 to Deflate increases the data
size twice or more). Although cairo does not support CCITT
emission yet, cairo supports JBIG2 emission (to PDF surface)
already. However, because JBIG2 coded data is (sometimes) not
self-contained, I had a few points to consider. So please let
me ask your comments for appropriate design.
What is JBIG2Globals?
The problem is the handling of "Globals". Some JBIG2 streams
in PDF may refer another binary data stream "Globals" that is
shared by multiple JBIG2 images (by storing the same content
as an external and shared resource, PDF can reduce the filesize).
Here is the quote of PDF spec (PDF 32000-1:2008), p.33.
5 0 obj
<< /Type /XObject
/Filter [/ASCIIHexDecode /JBIG2Decode]
/DecodeParms [null << /JBIG2Globals 6 0 R >>]
JBIG2Globals is a shared data stream stored at out of JBIG2
Cairo interface to manage JBIG2Globals
In cairo, we can pass 3 kinds related to JBIG2 data
via cairo_surface_set_mime_data() API;
1) JBIG2 data itself (the stream in "5 0 obj" itself, in
2) JBIG2 global data (the stream in "6 0 R" in above example),
3) Unique ID to specify which JBIG2 global data should be
used in the decoding process.
Yet I'm not fully understanding the official design in cairo,
it seems that: unique-id (3) is passed for first, and JBIG2
image (1) is passed in next, and finally JBIG2 global data
(2) is passed - when JBIG2 image is passed, cairo bind it
with the latest declaration of the unique-id, and, when
JBIG2 global data (2) is passed to cairo, cairo binds it
with the latest declared unique-id. Therefore, even if
we repeat sending same JBIG2 global data (2), as far as
we don't change unique-id (3), only 1 JBIG2 global data
is emitted to PDF output.
The problem is "how we can determine the unique-id for
JBIG2 global data?".
Problem to make a unique-id for JBIG2Globals in PDF
The easiest & straight-forward idea would be using the
object reference and generation number (referring the
JBIG2 global data) to form a unique-id. In above example,
we can declare as "pdf-jbig2-globals-6-0".
But, it seems that current design of JBIG2Stream hold
the stream itself, not the indirect object referring
to the stream (in above example, JBIG2Stream class
could access to the content of "6 0 R" stream, but
could not know how it is referred - the reference number
(=6) and generation number (=0)).
Furthermore, we could imagine a worse case, differently
chained reference to same object;
1 0 obj
<< /Length 100 >>
2 0 obj
1 0 R
3 0 obj
/DecodeParms << /JBIG2Globals 1 0 R >>
4 0 obj
/DecodeParms << /JBIG2Globals 2 0 R >>
I'm not sure if such chained indirect object is
prohibited (I could not find such statement in PDF
32000-1:2008, p.21-22). If it is not prohibited,
when we use "1 0 R" and "2 0 R" to make a unique-id,
the global data would be duplicated.
There might be 2 ideas to solve such problem:
A) Copying global data content to a temporal buffer
(it is not useless work, because we should pass it
to cairo anyway), and calculate some hash value,
and use it as a unique-id.
B) Tracking the chained reference to the stream object,
and use the last referring object before the stream
to make a unique-id. However, maybe we have to extend
JBIG2Stream class to hold the referring object (or
the reference number and generation number).
Which is better, or any other good idea to make
a unique-id for JBIG2 global data?
More information about the poppler