Mimotek’s technology can be applied to a wide range of different tasks that require manipulation of the content of PDF files. Among the applications that have been implemented are:
- Secure redaction of sensitive content from official documents
- Extraction of images and text from consumer catalogues for use in web and smart phone environments
- Creation of customised academic text books by combining chapters from a series of source books. Both pages and chapters are renumbered and a new Table of Contents is created automatically
- Extraction of article text from newspapers to populate an archive
- Extraction of digital clippings from newspapers
- Creation of ‘e-newspapers’ to allow online reading of the print edition (example)
For further information about how you may be able to use our technology to meet your requirements, please contact Mimotek.
Mimotek Structuriser is a suite of tools that adds logical structure to a PDF file of a newspaper or magazine page. It segments the page into the constituent articles, identifying individual objects such lines of text, images and artwork.
The workflow involves three stages:
- An automatic segmentation tool processes the PDF page. This uses visual clues to make a best guess as to the way that the page is divided into articles. The results of this process are added to the PDF file, to produce a Structured PDF file.
- Since the first stage may not have resulted in totally correct segmentation, the structured PDF file is displayed in an editor for manual correction. The corrected structure is written back into the structured PDF file.
- The content is extracted from the structured PDF file in whatever format is required.
There are two main software components:
Mimotek Structuriser Server
This is an automatic process that monitors a hot folder and processes any PDF file that it finds there. Its primary purpose is to segment PDF pages into articles, to produce a structured PDF file.
The server can also be configured to perform one or more of the following actions on the PDF file:
- Split multi-page PDFs into single page files.
- Convert between single newspaper pages, readers’ spreads and printers’ spreads.
- Optimise the size of a PDF file. This involves reducing the file size by down-sampling images and using the appropriate compression techniques.
- Normalise a page by removing any inbuilt rotation.
- Extract content from a structured PDF file.
- Rasterise a page to generate a bitmap (for example for use as a page thumbnail).
- Adjust page margins.
Mimotek Structuriser ClipEdit
The accuracy of the segmentation calculated automatically by Structuriser Server depends on the details of the page design. For a regular page, the segmentation will be generally correct, but on more complex pages it is difficult to identify articles accurately without manual intervention.
Structuriser ClipEdit allows an operator to view, and if necessary edit, the result of the segmentation applied by Structuriser Server. ClipEdit reads the Structured PDF files that have been created by Structuriser Server, and saves any edits applied by the operator back into the Structured PDF file.
The following video describes the features of Mimotek Structuriser:
Mimotek Structuriser, as described above, only requires a PDF file as input. However, if extra information is available it may be possible to automate the page segmentation fully, with no need for manual editing.
Many newspaper editorial systems are capable of exporting an XML version of each page and/or article coordinates. Mimotek XMLHelper makes use of a secondary feed in addition to the primary PDF file of the page. While the PDF file describes only the visual appearance of the page, the companion XML file describes the logical structure of the same page. If the structure definition from the XML is used to guide the segmentation of the PDF, we can significantly improve the quality of the automatic segmentation, and so reduce the need for manual correction.
While this approach offers reductions in the effort required for manual corrections, it does require extra effort to administer the correct feeds. In particular, it is essential that the PDF and XML feeds are correctly synchronised.
For further information, please contact Mimotek.