[poppler] [patch] Add ability to extract embedded files.

Sun Aug 28 18:48:34 PDT 2005

At 08:07 AM 8/27/2005, Brad Hards wrote:
>As a by-product of my flight to Akademy (yeah, thanks BA - I would like my
>clothes, back )-:, I worked up this little patch. It provides the capability
>to extract an "attached" or embedded file.

         Cool!

>The change to the core poppler code is as shown below. The change is actually
>pretty small - I did have to expand the API for the NameTree class a little.

         OK - so a bit of history on embedded files in PDF...

         The method for embedded files that you have supported is the 
modern (PDF 1.5 and later) method for doing so - it provides a more 
logical grouping & organization.  PDF 1.6 added some additional 
metadata (file description) and also support for encryption of the 
enclosures w/o encrypting the entire doc.

         Prior to that, going back to PDF 1.2 is the "EmbeddedFile" 
annotation.  It is a Subtype of Annotation which refers to the same 
data structures that you are working with from the Names tree - but 
just living elsewhere.

         Acrobat 7's Attachment pane displays both types to the 
user...Poppler should too.

>I'm not sure I'm walking down the datastructures in a reliable way, because I
>forgot to generate a test file with Acrobat before leaving home, and resorted
>to creating one with pdftk. See the attached document to see the proposed Qt4
>API / test application. I've also checked in that pdftk-created test example
>(into test/unittestcases/, as WithAttachments.pdf)

         Here are two additional attachments test 
files...A6EmbeddedFiles.pdf was (obviously) created by Acrobat 6 with 
the new Names tree feature.  Shapes+attachments.pdf was created with 
Acro7 (alpha, I think) but uses the new features.

>There is potentially other metadata that could be extracted. At this stage
>there is a description that shows up in Acrobat Reader that I can't find in
>the file. I guess that Acrobat 7 will put more that pdftk-1.12, based on the
>column headers in Acrobat 7 Reader.

         Not too much...

         Here is the snippet from our PDFspy program that is 
Xpdf-based.  This is the function that dumps info about an embedded 
file Names entry to our XML grammar.

void attrs2xml::EFNames2XML( Object* inEFObj, domLite::nodePtr 
inNode, UnicodeMap *inUnicodeMap )
{
         if ( inEFObj->isDict() ) {
                 Object  fnObj;
                 inEFObj->dictLookup( "F", &fnObj );
                 if ( fnObj.isString() ) {
                         inNode->contents( std::string( 
fnObj.getString()->getCString() ) );
                 }
                 fnObj.free();

                 // PDF 1.6 - file description
                 Object  desc;
                 inEFObj->dictLookup( "Description", &desc );
                 if ( desc.isString() ) {
                         String2XML( desc.getString(), std::string( 
"description" ), false, inNode, inUnicodeMap );
                 }
                 desc.free();
         }
}

Hope this helps...

Leonard

---------------------------------------------------------------------------
Leonard Rosenthol                            <mailto:leonardr at pdfsages.com>
Chief Technical Officer                      <http://www.pdfsages.com>
PDF Sages, Inc.                              215-938-7080 (voice)
                                              215-938-0880 (fax)