Mimotek Structuriser 2.0
Many clipping applications work with a bitmap representation of the page. Why does Mimotek not do that?
Mimotek’s technology works directly with the native graphical objects in a PDF file. This has many advantages over an approach that manipulates a page bitmap.
In a ‘real’ PDF file the text is represented by instructions to draw a particular glyph from a particular font at a particular position on the page. So by accessing the text directly within the PDF file, the character codes can be exported without any need to use OCR to guess the character from its shape (as is necessary when extracting text from a bitmap). This results in much higher quality extracted text.
Further, a bitmap clipping will always be a bitmap. It is not searchable, cannot be zoomed without losing quality and does not necessarily print well. A ‘real’ PDF clipping, on the other hand, is searchable and can be zoomed and printed at high quality.
Lastly, bitmap manual editing relies on drawing boundaries around areas of the page. In Mimotek’s’ software the individual graphical objects are referenced directly, with no need to draw boundaries.
What do you mean by a ‘real’ PDF file?
This is a PDF file in which full use is made of the rich imaging model of PDF. The text is defined by character codes and high-quality outline fonts, graphics are drawn using vectors and images are included at their native resolution. Almost all professionally produced PDFs follow that model.
By way of contrast, some PDFs simply contain a bitmap of the page, embedded in a PDF wrapper.
Can Mimotek Structuriser process page bitmaps if necessary?
Yes. For example, if the only source material is hard copy, you can create a bitmap PDF using a scanner, and Structuriser can process it. However, this workflow cannot take advantage of the increased quality that would come from the ‘real’ PDF workflow.
The bitmap PDF workflow can also be used as a fall-back in the rare cases where a ‘real’ PDF does not contain the information necessary to extract the content accurately.
What is a ‘Structured PDF file’?
A Structured PDF file contains both a description of the appearance of the document, and a structure tree that describes the logical structure of the document within the same file.
The structure tree is similar in principle to an XML description of the page, and we use it to record how the page content is organised into articles, text blocks, text lines, images etc. Since it describes both appearance of the page and its logical structure, a Structured PDF file contains everything that is required to decompose the page contents and export them in some other format. For example, the structure tree allows all the text blocks in a particular article to be identified, and the relevant text to be output as an XML file or as a PDF clipping.
Below is an example of the dual representation of a segmented newspaper page: visual representation on the right and structure tree on the left.
Is Structured PDF a proprietary format?
No. Adding a structure tree to a PDF file (resulting in files known as ‘Tagged PDF’) is part of the PDF specification (ISO 3200). In general the structure tree is optional, and is required only for PDF files that conform to the PDF/A or PDF/UA standards. The vast majority of PDF files contain no such information.
Doesn’t having two representations of the document in one file lead to bloated files?
No. The implementation is such that the structure tree typically adds no more than 10% to the size of the PDF file.
What hardware/software environment is necessary to run Mimotek Structuriser?
The software runs on Windows XP / 7 / 8 or Windows Server 2003 / 2008 / 2012.
The Structuriser Server component has no special hardware requirements, but its throughput will be improved on hardware with a faster processor, more RAM and/or multi-core processors.
For Structuriser ClipEdit the hardware must include at least a dual core processor and at least 4 GB of RAM is recommended.
Are there any recommendations for the monitor?
The ClipEdit user interface is best suited for a wide screen display (e.g. 16:9 or 8:5 aspect ratios), although it is also usable on a legacy 4:3 format screen.
ClipEdit also supports a dual monitor configuration, which is particularly helpful if you work with page spreads. In this configuration, the page (or spread) window is located on the left-hand display, and the article (clipping) window on the right-hand one.
What about Unix or Mac OS?
These environments are not currently supported.
Can this technology be integrated into my existing system?
Certainly. Many of Mimotek’s tools are available as Windows DLLs that you can call from your own code. For example, the ClipEdit user interface can be displayed in a window of your own application.
What are the output options?
The system can be configured to generate any combination of the following:
- PDF article clippings, optionally packaged ready for distribution on A4 (or other) sheets with metadata, logos, page thumbnails etc
- Text content, marked up in various XML formats including:
- NITF (suitable for Kindle)
- ePub (suitable for iPad)
- NewsML
- METS/ALTO
- Other XML schemas according to the customer’s requirements
- Extracted images (JPEG, JPEG2000)
- Image representations (JPEG, JPEG2000, PNG, GIF etc) of pages or articles
- Page thumbnails, optionally with article locations highlighted
…and of course, the (edited) page-level segmentation structure is stored in the structured PDF file.
How does Structuriser deal with articles that continue from one page to another?
If an article is printed on both pages of a spread, the clipping can be extracted as one unit from the spread.
When the article continued on some other page, a link between the two parts is created during the manual correction process. The PDF clipping then consists of a single file with two (or more) pages—each page contains one part of the complete article. And a single XML file is generated for the whole article.
Can clippings of over-size articles be formatted to fit on A4 sheets?
The PDF clippings can be automatically ‘reshaped’ to fit within a user-specified rectangular area. The approach is to re-arrange the article by moving complete lines of text and images so that the story is reflowed onto as many sheets as are required. Since the text maintains its fonts, pointsizes and line-breaks, the original look and feel of the article are preserved, while the clipping can be delivered on whatever paper size is required.
The automatic algorithms that control the reshaping of clippings work well with regularly designed articles, but when an article has graphically rich features, it is more difficult to carry out successful reshaping without human intervention. The ClipEdit application allows an operator to adjust the layout of such clippings manually.
The clippings can be delivered on complete pages that also include metadata, branding items (logos or mastheads), copyright notices and page thumbnails.
I have a requirement to extract text as a specific form of XML. Is that possible?
The software allows you to define an XML template that is used to generate many customised XML variants. If this does not satisfy your requirement, we can develop a specific XML driver according to your specification.
How accurate is the extracted text?
The text is extracted directly from the PDF file, so there is no unreliable OCR process to introduce errors. Unless there are encoding errors in the PDF fonts, the extracted text is 100% accurate at the character level.
Newspaper typography can be challenging, since the short line lengths lead to wide variations in word- (and letter-) spacing and to heavy use of hyphenation. Mimotek’s code has been tuned to take account of these issues and we use spell-checking to resolve ambiguities, leading to a very high level of accuracy at the word level. Finally, words that do not appear in the spelling dictionary are highlighted in ClipEdit’s XML editor, so the operator is alerted to possible word-spacing problems.
Can the process be made entirely automatic, without a manual correction stage?
Structuriser Server segments pages automatically, with no manual intervention. The accuracy of this process depends crucially on the page layout.
Pages that have a regular layout (for example, above left) can be segmented automatically to a high degree of accuracy. But a more creatively-designed page (above right) will be more challenging. Indeed on such pages it is often difficult for a human to decide which combination of text and graphics go together to make up a story! So the success of automatic segmentation (with no manual intervention) varies considerably between publications, and even between different pages in the same publication. If it is vital that absolute accuracy be achieved, it will usually be necessary to view each page in Structuriser ClipEdit and make corrections as required.
Do I always need to use both Structuriser Server and ClipEdit?
In a production environment, you would normally use Structuriser Server to perform a ‘first-pass’ automatic segmentation, and then use Structuriser ClipEdit to view the structure and correct it if necessary.
Alternatively, you can skip the first stage and add the structure within ClipEdit. This takes slightly longer, but is useful for testing, training etc.
For some publications you may also be able to use the output of Structuriser Server, without correcting it in ClipEdit. See the previous entry for details.
Why does Mimotek build its software around libraries licensed from Adobe Systems?
Using Adobe’s PDF Library to perform basic tasks like reading, writing, parsing and rasterising PDF files has many advantages:
- This is the same code as Adobe uses in its Acrobat product, so we can take advantage of the massive engineering investment that Adobe has made in that application. In particular, the libraries include many fixes for dealing with corrupt or damaged PDF files, so Mimotek products automatically inherit that resilience.
- Compatibility with the latest version of PDF is guaranteed.
- Upgrading to each new version of PDF is is simply a matter of integrating a new release of PDF Library. This gives us access to new features automatically, with no development effort on our part.
- Our developers do not need to spend time on supporting the tasks that are handled by PDF Library. This means that they can concentrate on developing product functionality, rather than fixing issues concerning the reading and writing of PDF files.