[poppler] [patch] Add ability to extract embedded files.
Leonard Rosenthol
leonardr at pdfsages.com
Sun Aug 28 18:48:34 PDT 2005
At 08:07 AM 8/27/2005, Brad Hards wrote:
>As a by-product of my flight to Akademy (yeah, thanks BA - I would like my
>clothes, back )-:, I worked up this little patch. It provides the capability
>to extract an "attached" or embedded file.
Cool!
>The change to the core poppler code is as shown below. The change is actually
>pretty small - I did have to expand the API for the NameTree class a little.
OK - so a bit of history on embedded files in PDF...
The method for embedded files that you have supported is the
modern (PDF 1.5 and later) method for doing so - it provides a more
logical grouping & organization. PDF 1.6 added some additional
metadata (file description) and also support for encryption of the
enclosures w/o encrypting the entire doc.
Prior to that, going back to PDF 1.2 is the "EmbeddedFile"
annotation. It is a Subtype of Annotation which refers to the same
data structures that you are working with from the Names tree - but
just living elsewhere.
Acrobat 7's Attachment pane displays both types to the
user...Poppler should too.
>I'm not sure I'm walking down the datastructures in a reliable way, because I
>forgot to generate a test file with Acrobat before leaving home, and resorted
>to creating one with pdftk. See the attached document to see the proposed Qt4
>API / test application. I've also checked in that pdftk-created test example
>(into test/unittestcases/, as WithAttachments.pdf)
Here are two additional attachments test
files...A6EmbeddedFiles.pdf was (obviously) created by Acrobat 6 with
the new Names tree feature. Shapes+attachments.pdf was created with
Acro7 (alpha, I think) but uses the new features.
>There is potentially other metadata that could be extracted. At this stage
>there is a description that shows up in Acrobat Reader that I can't find in
>the file. I guess that Acrobat 7 will put more that pdftk-1.12, based on the
>column headers in Acrobat 7 Reader.
Not too much...
Here is the snippet from our PDFspy program that is
Xpdf-based. This is the function that dumps info about an embedded
file Names entry to our XML grammar.
void attrs2xml::EFNames2XML( Object* inEFObj, domLite::nodePtr
inNode, UnicodeMap *inUnicodeMap )
{
if ( inEFObj->isDict() ) {
Object fnObj;
inEFObj->dictLookup( "F", &fnObj );
if ( fnObj.isString() ) {
inNode->contents( std::string(
fnObj.getString()->getCString() ) );
}
fnObj.free();
// PDF 1.6 - file description
Object desc;
inEFObj->dictLookup( "Description", &desc );
if ( desc.isString() ) {
String2XML( desc.getString(), std::string(
"description" ), false, inNode, inUnicodeMap );
}
desc.free();
}
}
Hope this helps...
Leonard
---------------------------------------------------------------------------
Leonard Rosenthol <mailto:leonardr at pdfsages.com>
Chief Technical Officer <http://www.pdfsages.com>
PDF Sages, Inc. 215-938-7080 (voice)
215-938-0880 (fax)
More information about the poppler
mailing list