Some metadata stripped from PDF/A files

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: TrackerSupp-Daniel, Tracker Support, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Sean - Tracker, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

Some metadata stripped from PDF/A files

Post by DIV »

Hello.

I am reporting what may be a bug with certain metadata fields being reproducibly stripped from PDF files when they are saved as a form of PDF/A format (specifically, PDF/A-Na, with N=1 or 2 or 3).

I annotated a scanned PDF with metadata in PDF-XChange Editor (originally in version 6.0, but here in version 7.0).

The metadata added was entered from the "Additional Metadata" button in the Document Properties dialogue box, then in the "Description" Category the following:
  1. Document Title — maps to dc:title; retained in PDF/A
  2. Author — maps to dc:creator; retained in PDF/A
  3. Author Title — maps to xmp:AuthorsPosition; stripped from PDF/A
  4. Description — maps to dc:description; retained in PDF/A
  5. Description Writer — maps to xmp:CaptionWriter; stripped from PDF/A
  6. Keywords — maps to dc:subject(!!); stripped from PDF/A
  7. Copyright Status — maps to xmpRights:Marked; retained in PDF/A
  8. Copyright Notice — maps to dc:rights; retained in PDF/A
  9. Copyright Info URL — maps to xmpRights:WebStatement; retained in PDF/A
When viewing the "Advanced" Category to see the XMP structure, it is notable that items 1, 2, 4, 6 and 8 (in bold above) are stored in the dc namespace (unexpected mappings in italics). Note that 'Description' is alternatively called 'Subject' in the "Document Info" display in the main Document Properties dialogue box, matching the mapping to dc:subject!
Other items map to the xmp or xmpRights namespaces.

Note further that item 6 does not correspond to 'Keywords' in the "Document Info" display in the main Document Properties dialogue box (so there's an inconsistency there). The latter instead maps to pdf:Keywords — which is retained in PDF/A.

dc:format is (rightly) set to "application/pdf", and cannot be amended — and, of course, is retained in PDF/A.
Likewise visible in the "Document Info" display in the main Document Properties dialogue box — for PDF files in general — are
  • PDF Producer — maps to pdf:Producer; retained in PDF/A
  • Application — maps to xmp:CreatorTool; retained in PDF/A
  • PDF Version — doesn't map to any stored metadata in the XMP structure
  • Created — maps to xmp:CreateDate; retained in PDF/A
  • Modified — maps to xmp:ModifyDate; retained in PDF/A
  • Page Count — doesn't map to any stored metadata in the XMP structure
  • Page Size — doesn't map to any stored metadata in the XMP structure
  • PDF-XChange — no data; not sure where this maps to, if anywhere
These items variously map to the xmp or pdf namespaces, although a few are unmapped (presumably they are either generated on-the-fly by the PDF viewer/editor, or else are stored in the 'header' of a PDF file, but not in xmp-structured metadata per se).

When exporting a file as PDF/A, some of this metadata is stripped out. I have used PDF-XChange Editor in both version 6.0 and version 7.0 (build 328.1), with apparently the same behaviour. Detailed testing was with version 7.0.
For Conformance I have chosen under Options variously PDF/A-1a, PDF/A-2a, and PDF/A-3a — with seemingly no difference.

As indicated above, three of the metadata fields are stripped out:
  • Author Title / xmp:AuthorsPosition
  • Description Writer / xmp:CaptionWriter
  • Keywords / dc:subject
At first I wondered whether this was due to some hindrance of the PDF/A specification.
However, looking at the retention of other fields, I am more inclined to believe that it is a bug.
And, indeed, from a quick parsing of the Technical Note linked to above, it seems allowed to include any metadata one likes in a PDF file, provided it's correctly structured and follows a suitable schema.

Yours sincerely,
DIV
Last edited by DIV on Sat Feb 23, 2019 2:19 am, edited 3 times in total.
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

Searching fora

Post by DIV »

P.S. As a side note, I found it was not possible to successfully search these fora for the term "PDF/A" (entered with or without quotation marks). I mostly got results that contained only "PDF".
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

More metadata stripped from PDF/A files

Post by DIV »

Metadata beyond that which can be directly entered within PDF-XChange Editor (version 7.0) may also (typically?) be removed when undertaking a "Save As" operation with "Save as type" set to "PDF/A document (*.pdf)" (with option e.g. PDF/A-1a).

For instance, the PDF specifying the PRISM version of DC contains two items of metadata in the pdfx namespace, and one new entry in the xmp namespace:
  • pdfx:SourceModified (entry is "D:20121212185446", which appears to be a date–time string for 18:65:46 on 12 December 2012) — dropped in PDF/A
  • pdfx:Company (entry is "Hewlett-Packard") — dropped in PDF/A
  • xmp:MetadataDate — retained in PDF/A
amongst some other things.

When using PDF-XChange Editor to save this as a PDF/A-1a file, the two pdfx entries are dropped, but the xmp:MetadataDate entry is retained.
By the way, there is insufficient detail in the "Conversion Report" to diagnose why this might be happening.

Yours sincerely,
DIV
DIV
User
Posts: 252
Joined: Fri Jun 23, 2017 1:47 am

Sample files

Post by DIV »

Attached are a couple of pairs of PDF files to illustrate the above points.

Document one ("Doc1"):
  1. "original" created 'from scratch' in PDF-XChange Editor 7.0 (build 328.1). OS: Windows 8.1 x64. Attached.
  2. archival version produced by File > Save As etc. in PDF-XChange Editor, with conformance set to PDF/A-2b. OS: Windows 8.1 x64. Attached.
File I contains xmp:AuthorsPosition, xmp:CaptionWriter and dc:subject, but these have been removed in file II.


Document two ("Doc2"):
  1. 'temporary' PDF created by composing text in Microsoft Word, then producing PDF using Acrobat Distiller 9.0.0 / Acrobat PDFMaker 9.1 for Word. Not attached.
  2. 'final' PDF produced in PDF/A-1b archival format using Callas pdfaPilot 1.3 (080). OS: Windows XP SP3. creator: PScript5.dll Version 5.2.2. producer = Acrobat Distiller 7.0.5 (Windows). Not attached.
  3. "resaved original" = all pages deleted except one, and content of some metadata entries adjusted in Adobe Acrobat 7.0.0 Standard; file then simply saved. (Not "Save As".) Conformance to PDF/A-1b is still claimed. OS: Windows 8.1 x64. Attached.
  4. opened above file in PDF-XChange Editor 7.0 (build 328.1) and produced archival version through File > Save As etc. OS: Windows 8.1 x64. Attached.
File II contains copious amounts of metadata, including 30 separate entries under the pdfx namespace (schema ns.adobe.com/pdfx/1.1/). These entries are all retained when saving the single page remaining in Acrobat Standard after deleting the rest of the pages, in file III. However, every one of those pdfx:* entries has been removed in file IV, along with dc:subject (xmp:AuthorsPosition & xmp:CaptionWriter were never present).

—DIV

P.S. As a very minor point, it is interesting to note that Acrobat 7.0.0 associates "Description Writer" in the GUI with photoshop:CaptionWriter, rather than xmp:CaptionWriter.
Attachments
V2008_print—PDFA1bRGB9HR_pageB_PDF-XChangePDFA-1b.pdf
Doc2_singlePage (IV) — PDF-XChange PDFA-1b
(216.38 KiB) Downloaded 60 times
V2008_print—PDFA1bRGB9HR_pageB.pdf
Doc2_singlePage (III) — resaved original PDFA-1b
(131.51 KiB) Downloaded 51 times
DocumentWithMetadata.PDF-XChange_A-2b.pdf
Doc1 (II) — PDFA-2b
(34.46 KiB) Downloaded 60 times
DocumentWithMetadata.pdf
Doc1 (I) — original
(6.5 KiB) Downloaded 55 times
Last edited by DIV on Fri Mar 15, 2019 12:43 pm, edited 2 times in total.
User avatar
Tracker Supp-Stefan
Site Admin
Posts: 17893
Joined: Mon Jan 12, 2009 8:07 am
Location: London
Contact:

Re: Some metadata stripped from PDF/A files

Post by Tracker Supp-Stefan »

Hello DIV,

Thanks for the sample files and the report. I've asked colleagues in the dev team to take a look, and we will post an update here as soon as we have any news on this case!

Regards,
Stefan
Post Reply