Structured PDF or Tagged PDF

‘Tagged PDF’ and ‘Structured PDF’ are both terms that describe flavours of PDF that not only allow a digital document to be displayed and/or printed, but also allow meaningful content to be reliably extracted. The details of the two definitions are similar in that they both require (amongst other things) that fonts are embedded, characters are properly encoded, text reading order is preserved and the document’s logical structure is defined.

Tagged PDF is a formal requirement of the PDF/A (Archiving) and PDF/UA (Universal Accessibility) standards, and is becoming increasingly important. Amongst the applications that require Tagged PDF are:

  • Screen readers for the visually-impaired
  • PDF files that will be archived
  • Reliable text search within PDF files
  • Reflowing of PDF pages on small format devices

Structured PDF is Mimotek’s internal format for segmented document pages. While it is not exactly the same as Tagged PDF, the similarity is striking, and either definition could be used within the Mimotek software. For example, it would be possible to export PDF/A files from Mimotek Structuriser, in which case the Mimotek page segmentation software could be thought of as a post hoc Tagged PDF creation tool for complex documents (such as newspapers).

We shall be keeping an eye on the future developments surrounding Tagged PDF and PDF/A.