Chapter 14: Document Stitching, Metadata, PDF/A
Contents
14.1 Document Stitching
AspPDF is capable of joining together two or more PDFs to form a new document. This process is often referred to as document stitching.
14.1.1 AppendDocument Method
Document stitching is performed via the AppendDocument method provided by the PdfDocument object. This method expects a single argument: an instance of the PdfDocument object representing another document to be appended to the current document. The AppendDocument method can be called more than once to append multiple documents to the current one.
The PdfDocument object to which other documents are appended (the master) can either be a new or existing document. The PdfDocument objects that get appended must be all existing documents. A document cannot be appended to itself.
The master document determines the general and security properties of the resultant document.
The following code sample appends the file doc2.pdf to the end of the document doc1.pdf:
' Open document 1
Set Doc1 = Pdf.OpenDocument( Server.MapPath("doc1.pdf") )
' Open document 2
Set Doc2 = Pdf.OpenDocument( Server.MapPath("doc2.pdf") )
' Append doc2 to doc1
Doc1.AppendDocument Doc2
' Save document, the Save method returns generated file name
// Open Document 1
IPdfDocument objDoc1 = objPdf.OpenDocument(Server.MapPath("doc1.pdf"), Missing.Value);
// Open Document 2
IPdfDocument objDoc2 = objPdf.OpenDocument(Server.MapPath("doc2.pdf"), Missing.Value);
// Append doc2 to doc1
objDoc1.AppendDocument(objDoc2);
// Save document, the Save method returns generated file name
String strFilename = objDoc1.Save( Server.MapPath("stitch.pdf"), false );
Click the links below to run this code sample:
14.1.2 Applying and Removing Security
A cumulative document produced by appending one or more PDFs to a master document inherits the master document's security properties. For example, if a master document is encrypted and the documents appended to it are not, the resultant PDF will be encrypted with the same passwords and permission flags as the master document. Conversely, if the master document is unencrypted and encrypted documents are appended to it, the result document will be unencrypted.
This feature can be used to apply security to unsecure documents, as well as modify or remove security from encrypted documents. The idea is to create an empty document, call the Encrypt method on it if necessary, then append the PDF that needs security added or removed.
To be in compliance with Adobe PDF licensing requirements, AspPDF performs security removal only if the document being appended is opened using the owner password. Otherwise, an error exception is thrown.
The following code sample applies security to the file doc1.pdf. Note that various document properties are being copied from the original document (doc1.pdf) to the new one, because by default the resultant PDF would inherit document properties of the master PDF (in our case, an empty document) and the original document's properties would be lost.
' Create empty document
Set Doc = Pdf.CreateDocument
' Open document to apply security to
Set Doc1 = Pdf.OpenDocument( Server.MapPath("doc1.pdf") )
' Copy properties
Doc.Title = Doc1.Title
Doc.Creator = Doc1.Creator
Doc.Producer = Doc1.Producer
Doc.CreationDate = Doc1.CreationDate
Doc.ModDate = Doc1.ModDate
' Apply security to Doc
Doc.Encrypt "abc", "", 128
' Append doc1 to doc
Doc.AppendDocument Doc1
' Save document, the Save method returns generated file name
Filename = Doc.Save( Server.MapPath("apply.pdf"), False )
// Create empty document
IPdfDocument objDoc = objPdf.CreateDocument( Missing.Value );
// Open Document 1
IPdfDocument objDoc1 = objPdf.OpenDocument( Server.MapPath("doc1.pdf"), Missing.Value );
// Copy properties
objDoc.Title = objDoc1.Title;
objDoc.Creator = objDoc1.Creator;
objDoc.Producer = objDoc1.Producer;
objDoc.CreationDate = objDoc1.CreationDate;
objDoc.ModDate = objDoc1.ModDate;
// Apply security to Doc
objDoc.Encrypt( "abc", "", 128, Missing.Value );
// Append doc1 to doc
objDoc.AppendDocument( objDoc1 );
// Save document, the Save method returns generated file name
String strFilename = objDoc.Save( Server.MapPath("apply.pdf"), false );
Click the links below to run this code sample:
14.1.3 Making Changes to Documents Being Appended
As mentioned earlier, a document being appended must be an existing document opened via OpenDocument or OpenDocumentBinary. Changes made to a document being appended will not propagate to the resultant compound document.
If you need to make changes to a document being appended, the following workaround is recommended:
Set Doc2 = Pdf.OpenDocument(...)
' Make changes to Doc2
Set Doc3 = Pdf.OpenDocumentBinary( Doc2.SaveToMemory )
Doc1.AppendDocument Doc3
This code fragment uses an intermediary memory-based document Doc3 to hold the modified version of Doc2.
14.1.4 Creating Multi-Page Documents Based on a Template
AppendDocument is not a very efficent way to create multi-page documents based on a single-page PDF template. We recommend that the method CreateGraphicsFromPage described in Section 9.6 be used for this task instead. For a code sample, see our KB Article PS130905190.
14.2 Metadata
All major Adobe products share a common technology that enables you to embed data describing a file, known as metadata, into the file itself. This technology, called Extensible Metadata Platform (XMP), uses XML as the syntax for metadata description. For more information on XMP, go to http://www.adobe.com/products/xmp.
XML tags used in an XMP data block are described by the Resource Description Framework (RDF) available at http://www.w3.org/RDF.
A typical metadata string may look as follows:
<rdf:Description about='' xmlns='http://ns.adobe.com/pdf/1.3/'
xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
<pdf:CreationDate>2002-12-24T07:48:28Z</pdf:CreationDate>
<pdf:ModDate>2003-02-28T19:39:16+09:00</pdf:ModDate>
<pdf:Producer>Acrobat Distiller 5.0.1 for Macintosh</pdf:Producer>
<pdf:Title>Technical Specifications</pdf:Title>
<pdf:Author>John Smith</pdf:Author>
</rdf:Description>
<rdf:Description about=''
xmlns='http://ns.adobe.com/xap/1.0/'
xmlns:xap='http://ns.adobe.com/xap/1.0/'>
<xap:CreateDate>2002-12-24T07:48:28Z</xap:CreateDate>
<xap:ModifyDate>2003-02-28T19:39:16+09:00</xap:ModifyDate>
<xap:MetadataDate>2003-02-28T19:39:16+09:00</xap:MetadataDate>
<xap:Title>
<rdf:Alt>
<rdf:li xml:lang='x-default'>Technical Specifications</rdf:li>
</rdf:Alt>
</xap:Title>
<xap:Author>John Smith</xap:Author>
</rdf:Description>
<rdf:Description about=''
xmlns='http://purl.org/dc/elements/1.1/'
xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:title>Technical Specifications</dc:title>
<dc:creator>John Smith</dc:creator>
</rdf:Description>
</rdf:RDF>
AspPDF enables you to retrieve and specify metadata associated with a PDF document via the MetaData property of the PdfDocument object. The following code fragment extracts and prints out metadata from an existing PDF file:
Response.Write Doc.MetaData
AspPDF provides no functionality for parsing out individual metadata items. Any XML parser object can be used for that, such as Microsoft XML DOM.
14.3 PDF/A Support
14.3.1 PDF/A: PDF for Archiving
The PDF/A format is a subset of the regular PDF format with certain features, deemed incompatible with long-term archival and storage of documents, removed. PDF/A-compliant documents must be completely self-contained, with no reliance on external resources. The single most important requirement for PDF/A files is that all fonts must be embedded. Other requirements include:
- Encryption is not allowed;
- Documents must contain standards-based metadata;
- Links to other documents and URLs are not allowed.
- Use of device-dependent color spaces such as DeviceRGB is only allowed with some restrictions.
- Certain other PDF features, such as JavaScript, XML Forms Architecture (XFA), LZW compression, and others, are not allowed.
There are currently three levels of PDF/A conformance: PDF/A-1, PDF/A-2 and PDF/A-3, with Level 1 subdivided into sublevels A and B.
For more information on PDF/A, see http://www.pdfa.org.
14.3.2 AspPDF's Support for PDF/A
As of Version 3.3, AspPDF is capable of producing PDF documents compliant with PDF/A-1b, the basic conformance level which ensures reliable reproduction of the visual appearance of the document. Even prior to Version 3.3, AspPDF embedded all TrueType fonts and allowed metadata to be specified, thus meeting the most important PDF/A requirements. Version 3.3 bridges the remaining gap to full PDF/A-1b compliance by implementing the following features and enhancements:
- The new PdfDocument.AddOutputIntent method enables mapping from a device-dependent color space such as DeviceRGB to a device-independent color space via an International Color Consortium (ICC) profile, thus satisfying the media-independent visual color reproduction requirement.
- The entries /CIDToGIDMap and /CIDSet have been implemented for embedded TrueType fonts.
- A bug has been fixed responsible for certain stream objects to lack the required end-of-line character before the keyword endstream.
The AddOutputIntent method expects 4 arguments: the output condition, the output condition indentifier, the path to the .icc profile file, and the number of color components in the device-dependent color space used by the document (1 for DeviceGray, 3 for DeviceRGB and 4 for DeviceCMYK.) The output condition is a string concisely identifying the intended output device or production condition in human-readable form. The output condition identifier is a string that identifies the output device or production condition as it appears in an industry-standard registry, and can be set to "Custom".
The metadata format is XML-based and similar to that described in the previous section of this chapter, but must contain additional tags. The following example is a minimal metadata string required for PDF/A-1b compliance:
<x:xmpmeta x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00" xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="" xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
<pdfaid:part>1</pdfaid:part>
<pdfaid:conformance>B</pdfaid:conformance>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Persits Software AspPDF - www.persits.com</pdf:Producer>
<pdf:Keywords></pdf:Keywords>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
There are several components of this metadata string that are worth noting:
- The metadata must be enclosed within the <?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?> and <?xpacket end="w"?> tags.
- The PDF/A level of conformance must be specified via the <pdfaid:part> and <pdfaid:conformance> tags (1 and B for AspPDF's level of conformance.)
- The Producer value must match the current value for the PdfDocument.Producer property which is set to "Persits Software AspPDF - www.persits.com" by default.
In addition to the tags shown above, PDF/A metadata almost always contains "Dublin Core" (DC) tags as well, such as <dc:title> and <dc:description>, for example:
<x:xmpmeta x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00" xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
...
<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Sunset on the beach</rdf:li>
<rdf:li xml:lang="de-DE">Sonnenuntergang am Strand</rdf:li>
</rdf:Alt>
</dc:title>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">Hello, World</rdf:li>
<rdf:li xml:lang="de-DE">Hallo, Welt</rdf:li>
</rdf:Alt>
</dc:description>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
14.3.3 Code Sample
The following code sample creates a PDF/A-1b compliant document by importing the URL http://www.asppdf.com, attaching the metadata from the text file metadata.txt located in the same folder as the code sample, and adding an output intent based on the color profile AdobeRGB1998.icc located in the sibling folder manual_15 of the installation.
The content of the file metadata.txt is as follows:
<x:xmpmeta x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00" xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="" xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
<pdfaid:part>1</pdfaid:part>
<pdfaid:conformance>B</pdfaid:conformance>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Persits Software AspPDF - www.persits.com</pdf:Producer>
<pdf:Keywords></pdf:Keywords>
</rdf:Description>
<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">@@@title@@@</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
Note that this metadata file is actually a template, it contains the placeholder @@@title@@@ where the actual title should be. The code sample replaces the placeholder with the value of the PdfDocument.Title property (which is not known in advance since ImportFromUrl sets it based on the HTML content it imports) to ensure that the document's Title entry and the value of the <dc:title> tag in the metadata match.
Set Doc = PDF.CreateDocument
' Convert HTML to PDF
Doc.ImportFromUrl "http://www.asppdf.com", "landscape=true; scale=0.75"
' Add metadata from a file
strMetadata = PDF.LoadTextFromFile( Server.MapPath("metadata.txt") )
' Replace placeholder with actual document title
strMetadata = Replace( strMetadata, "@@@title@@@", Doc.Title )
Doc.MetaData = strMetadata
' Add output intent using an RGB color profile. Borrow .icc file from Chapter 15
strProfilePath = Server.MapPath(".") & "\..\manual_15\AdobeRGB1998.icc"
Doc.AddOutputIntent "AdobeRGB", "Custom", strProfilePath, 3
'Save document
Path = Server.MapPath( "pdfa.pdf")
FileName = Doc.Save( Path, false)
// Create empty document
IPdfDocument objDoc = objPdf.CreateDocument(Missing.Value);
// Convert HTML to PDF
objDoc.ImportFromUrl( "http://www.asppdf.com", "landscape=true; scale=0.75", Missing.Value, Missing.Value );
// Add metadata from a file
string strMetadata = objPdf.LoadTextFromFile( Server.MapPath("metadata.txt") );
// Replace placeholder with actual document title
strMetadata = strMetadata.Replace( "@@@title@@@", objDoc.Title );
objDoc.MetaData = strMetadata;
// Add output intent using an RGB color profile. Borrow .icc file from Chapter 15
string strProfilePath = Server.MapPath(".") + @"\..\manual_15\AdobeRGB1998.icc";
objDoc.AddOutputIntent( "AdobeRGB", "Custom", strProfilePath, 3 );
// Save document
string strPath = Server.MapPath( "pdfa.pdf");
string strFileName = objDoc.Save( strPath, false);
Click the links below to run this code sample:
As of Version 3.7, AspPDF is capable of producing PDF documents with the highest, PDF/A-3, level of compliance. The changes involved small additions to, and modifications in, various internal PDF objects, as well as a major new feature.
14.3.4.1 The List of Changes
The following small changes have been made to AspPDF's PDF-generating engine to make PDF/A-3 validators happy:
- The document catalog now has the attribute MarkInfo and it is set to true.
- The value of the CIDSet attribute of the font object, supported since version 3.3, is now computed differently to be compatible with PDF/A-3 specifications.
- The font's ToUnicode character map has been modified to fit PDF/A-3 requirements.
- Link annotations internally created by ImportFromUrl and DrawText methods to display clickable URLs are now marked printable by default, which is apparently a PDF/A-3 requirement. In previous versions, link annotations had the print flag cleared by default.
- FileAttachment annotations now has the Subtype attribute set to the embedded file's MIME type such as "text/html". PDF/A-3 apparently requires this attribute.
- The ICC profile added by the AddOutputIntent method now has the Info attribute, which AspPDF sets arbitrarily to the profile's filename. This attribute is apparently yet another PDF/A-3 requirement.
- When creating a FileAttachment annotations, two new parameters, Names=true, and AF=<value between 0 and 6> need to be passed to the PdfAnnots.Add method. The Names parameter, when set to True, instructs AspPDF to put the new annotation object in the document catalog's Names array instead of the Annots array, where it would normally reside. The AF parameter, which stands for "associated files relationship", is responsible for adding the AFRelationship entry to the file annotation object, which is yet another PDF/A-3 requirement, The AF value can be one of the following: 0 (Source), 1 (Data), 2 (Alternative), 3 (Supplement), 4 (Unspecified), 5 (EncryptedPayload) and 6 (None).
- All FileAttachment annotations must be included in the logical structure tree by calling the method AddStructureElement introduced in Version 3.7. The tree's root is represented by the entry StructTreeRoot which is added to the document catalog. Even if there are no FileAttachment annotations, the document still must contain a logical structure tree (possibly with no nodes) to be PDF/A-3 compliant. AddStructureElement is described in detail below.
14.3.4.2 Logical Structure Tree
PDF's logical structure facilities provide a mechanism for incorporating structural information about a document's content into a PDF file. Such information might include, for example, the organization of the document into chapters and sections or the identification of special elements such as figures, tables, and footnotes.
As of Version 3.7, AspPDF provides limited support for the logical structure trees via the methods BeginMarkedContent, EndMarkedContent, and AddStructureElement.
When a drawing operation such as a call to PdfCanvas.DrawText, or a group of such operations, is sandwitched between a call to PdfCanvas.BeginMarkedContent and PdfCanvas.EndMarkedContent, the content drawn by these operations become marked content with a unique ID. The ID is returned by the BeginMarkedContent method. This marked content ID can then be put in the logical structure tree via a subsequent call to AddStructureElement.
The PdfDocument.AddStructureElement method places a structure element onto the logical structure tree and returns a unique item ID of the newly placed item, The method expects 5 arguments:
- Item: an image, graphics or annotation object, or a marked content ID returned by BeginMarkedContent. Can also be set to Nothing (null) for an empty structure element.
- Type: a string describing the item. Although arbitrary names are generally allowed, PDF/A-3 compliance requires that a standard type should be used. The standard types include: Document, Part, Art, Sect, Div, Caption, TOC (table of content), TOCI (table of contents item), Index, NonStruct, Private, H, H1-H6 (headers), P (paragraph), L (list), LI (list item), Lbl (label), LBody (list body), Table, TR (table row), TH (table header), TD (table data cell), THead (table header row group), TBody (table body row group), TFoot (table footer row group), Span, Quote, Note, Reference, BibEntry (bibliography reference), Code, Link, Annot, Rudy, Warichu, and others.
- ParentID: the ID of a structure element added by a previous call to the AddStructureElement method. Must be set to 0 for the root element.
- Page: an instance of the PdfPage object associated with this structure element. Can be set to Nothing (null).
- Params: reserved for future use. Must be set to an empty string.
For PDF/A-3 compliance, the PDF document must at least have an empty root structure element. The following code sample generates a PDF/A-3 compliant document. It is almost identical to the previous code sample, but adds an empty root element and replaces the "1B" designation (PDF/A-1B) with "3A" (PDF/A-3) in the metadata:
' PDF/A-3 specific: specify A3 instead of B1 in metadata, add empty structure element
strMetadata = Replace(strMetadata, "<pdfaid:part>1</pdfaid:part>", "<pdfaid:part>3</pdfaid:part>")
strMetadata = Replace(strMetadata, "<pdfaid:conformance>B</pdfaid:conformance>", "<pdfaid:conformance>A</pdfaid:conformance>")
Doc.AddStructureElement Nothing, "Document", 0, Nothing, ""
...
// PDF/A-3 specific: specify A3 instead of B1 in metadata, add empty structure element
strMetadata = strMetadata.Replace("<pdfaid:part>1</pdfaid:part>", "<pdfaid:part>3</pdfaid:part>");
strMetadata = strMetadata.Replace("<pdfaid:conformance>B</pdfaid:conformance>", "<pdfaid:conformance>A</pdfaid:conformance>");
objDoc.AddStructureElement( null, "Document", 0, null, "" );
...
Click the links below to run this code sample:
The following code snippet demonstrates the creation of a meaningful logical structure tree with the document element at the root, a section element, and title and two paragraph elements within this chapter element. Also it adds a FileAttachment annotation to the document and adds a structure element for this annotation, which is required for PDF/A-3 compliance:
Set Doc = PDF.CreateDocument
DocID = Doc.AddStructureElement(Nothing, "Document", 0, Nothing, "")
SectID = Doc.AddStructureElement(Nothing, "Sect", DocID, Nothing, "")
Set Page = Doc.Pages.Add
HeaderID = Page.Canvas.BeginMarkedContent
Page.Canvas.DrawText "Header", "x=10; y=700", doc.fonts("Arial")
Page.Canvas.EndMarkedContent
P1ID = Page.Canvas.BeginMarkedContent
Page.Canvas.DrawText "Paragraph 1", "x=10; y=500", doc.fonts("Arial")
Page.Canvas.EndMarkedContent
P2ID = Page.Canvas.BeginMarkedContent
Page.Canvas.DrawText "Paragraph 2", "x=10; y=300", doc.fonts("Arial")
Page.Canvas.EndMarkedContent
Doc.AddStructureElement HeaderID, "H", SectID, Page, ""
Doc.AddStructureElement P1ID, "P", SectID, Page, ""
Doc.AddStructureElement P2ID, "P", SectID, Page, ""
Set Annot = Page.Annots.Add("", "x=1,y=700;width=10;height=10; Type=FileAttachment; Names=true; AF=0;", "Paperclip", "c:\path\factur-x.xml")
Doc.AddStructureElement annot, "Annot", DocID, Page, ""
...
IPdfManager objPdf = new PdfManager();
IPdfDocument objDoc = objPdf.CreateDocument( Missing.Value );
int DocID = objDoc.AddStructureElement(null, "Document", 0, null, "");
int SectID = objDoc.AddStructureElement(null, "Sect", DocID, null, "");
IPdfPage objPage = objDoc.Pages.Add();
int HeaderID = objPage.Canvas.BeginMarkedContent();
objPage.Canvas.DrawText( "Header", "x=10; y=700", objDoc.Fonts["Arial"] );
objPage.Canvas.EndMarkedContent();
int P1ID = objPage.Canvas.BeginMarkedContent();
objPage.Canvas.DrawText( "Paragraph 1", "x=10; y=500", objDoc.Fonts["Arial"] );
objPage.Canvas.EndMarkedContent();
int P2ID = objPage.Canvas.BeginMarkedContent();
objPage.Canvas.DrawText( "Paragraph 2", "x=10; y=300", objDoc.Fonts["Arial"] );
objPage.Canvas.EndMarkedContent();
objDoc.AddStructureElement( HeaderID, "H", SectID, objPage, "" );
objDoc.AddStructureElement( P1ID, "P", SectID, objPage, "" );
objDoc.AddStructureElement( P2ID, "P", SectID, objPage, "" );
IPdfAnnot objAnnot = objPage.Annots.Add("", "x=1,y=700;width=10;height=10; Type=FileAttachment; names=true; AF=0;", "Paperclip", @"c:\path\factur-x.xml");
objDoc.AddStructureElement( objAnnot, "Annot", DocID, objPage, "" );
...