Modern formats, in brief
By and large, when people share a document (whether it’s formatted text, visual media, or some combination), they believe that when they both look at it, they’re seeing the same thing. Modern users have arguably been taught/coerced into tolerating small differences (color pallets that are varying degrees of rich-to-washed-out, poop emojis with browser-dependent swirls), but in the large, we expect that when we pull up the same file as someone else, we’re looking at the same thing.
The fact that this often holds is somewhat of a minor miracle, given that modern users may be using completely different software stacks, including operating systems, programming languages, and application codebases. Nevertheless, it’s often achieved because
- different development teams have an incentive to make their tools compatible, at least on basic document features;
- formats are often defined independently of any single application codebase by a standards body.
A ready example of such a format ecosystem is the Portable Document Format (a.k.a. “PDF” to basically everyone). PDF has approximately 60 viewers across actively used platforms ; its formal definition is defined by the international standard family known as ISO 32000, with the latest edition being ISO 32000-2:2020 published in December 2020 . The PDF standard is formally developed in the consensus-based, open forums that ISO processes require and that is coordinated and supported by the PDF Association , the peak industry body for PDF. The PDF Association also provides a public errata process  for capturing and providing feedback to implementers.
Despite PDF’s ubiquity and longevity, its readers, writers, and the standard are continuously (if not always dramatically) updated to account for various flaws and inconsistencies. In this post, we’ll describe a relatively major instance of such an inconsistency, which we helped to unearth while developing new tools for defining formal document standards and analyzing extant document processors in the large.
Even though document viewers largely are able to show different people the same thing when they open the same document, there’s plenty of good reasons for things to go wrong. This stems partly from the fact that a well-written standard will typically strive to place as few requirements on a document as possible, in the interest of not overly constraining how document writers and readers work under the hood. This leaves plenty of room for different viewers to get creative in how they process unspecified features. Even when a standard is quite clear about content that a document must not contain, viewers are sometimes compelled to handle it anyway: no one wants to have the PDF viewer that refuses to open a document when there’s another option around that seems to do something reasonable.
Second, it’s extremely hard to write a practical format specification that methodically and completely defines whether every possible sequence of bytes is in a format, and how each document in the format should be rendered. The current PDF 2.0 standard weighs in at 1,000 pages of English, and covers everything from the fundamental structure of a PDF’s pages to graphics, text, fonts, and multimedia (and much, much more). PDF 2.0 initially took over nine years to create after the fast-track adoption of Adobe’s PDF 1.7 specification as the original ISO 32000-1:2008 standard for PDF. After PDF 2.0 was first published in 2017 as ISO 32000-2:2017, it was then updated just three years later to reflect implementation experience, errata, and minor technical changes occurring in related technical areas. With two formal meetings every year, supported by multiple ad hoc discussion groups working on specific tasks and project leaders performing editing between each meeting, PDF 2.0 has been continuously updated and reviewed over a total of 13 years as various implementers consult it, note incompletenesses, and propose solutions. But there’s nothing like a guarantee that the current version has all ambiguities removed once and for all: if it ever had such a guarantee, it would be nothing less than historic.
Recently, as part of a multi-year research effort to improve document security (more on that below), Galois helped the PDF Association to find an ambiguity in the standard that in principle can, and in practice does, cause different PDF viewers to render completely different graphical content in a PDF, and that has existed since the very first Adobe PDF 1.0 specification way back in 1993.
The issue centers on the definition of Inline Images, a construct for including images in the content of a PDF. Like many other data objects in a PDF, inline images are represented as key-value dictionaries: key-value bindings define things like the image’s dimensions, masks and filters that must be applied to it, and colors used to render its content. Unlike all other data objects in a PDF, keys in an inline image may be referred to using either their full name or an abbreviation: e.g., the width of an image might be given using the full name Width or the abbreviation W.
Inline images have always been targeted at smaller images, with Adobe PDF 1.0 explicitly stating “The in-line format should be used only for small images (4K or less) because viewer applications have less flexibility when managing in-line image data.” The motivation for inline image key abbreviations isn’t given in the original Adobe PDF specifications, but we might surmise that they enabled smaller PDF files and were more efficient to compress on computers of the era with far less memory and far less capable CPUs.
Sounds simple enough as a feature, although it raises the question of whether the value bound to a full name or its abbreviation should be used if both occur in the same inline image. Like, say, this PDF does, right here (feel free to download: it doesn’t actually harm your system, it’s just hard to predict what exactly it will display).
The file contains eight images, defined by messing with various fields as follows:
- Only key names are full: all PDF processors should definitely support this one;
- Only key names are abbreviated: definitely shouldn’t see any differences here either;
- Width and Height keys have different and incorrect values compared to the respective abbreviations, W and H;
- The Filter key is being tested with an incorrect value for the inline image data (whereas F is correct). This also likely means the inline image pixel data will confuse the PDF parser and it may not be able to process further test cases;
- The ColorSpace key uses an extreme CalRGB colorspace which will make the image appear different in failing processors, whereas CS uses RGB (Device Red-Green-Blue);
- The Decode array is the linear inverse of D and thus the image will display differently in failing processors;
- The Interpolate key is true and, for those processors that support image interpolation, the image will display differently in failing processors;
- The DecodeParms key is missing key information which will make the image display differently in failing processors.
The upshot of all of these tweaks is that if a PDF viewer always goes by abbreviations when they’re available, then all of the images will get displayed, and they’ll all look the same. In reality, here is what some highly prominent viewers do (names withheld so this doesn’t come off like a call-out, although honestly, it’s not the viewers’ fault: expected behavior simply wasn’t written down anywhere):
Which is all to show that different processors aren’t internally consistent with themselves like we would hope, and they certainly aren’t consistent with each other.
Surprisingly enough, the standard defines no course of action in cases like these. Maybe the more immediate concern is that different PDF viewers clearly end up rendering them in very different ways.
Getting a fix
Once the issue is raised, it’s not hard to patch the standard and implementations to follow the patch. If anything, there are almost too many reasonable patches to choose from:
- The PDF standard states quite explicitly that all data objects must not have multiple occurrences of the exact same key: if a document has a would-be definition of a data object with multiple bindings to the same key, then that document, technically speaking, is not a PDF. It seems consistent with this philosophy to generalize the notion of “same” to “equivalent” and say that each full name is equivalent to its abbreviations. Then, a document that binds, say, both the keys Length and L is not a PDF.
- If we’re inclined to allow both full names and abbreviations, and we want to think of a dictionary definition as a sequence of updates to a record or array, then we could say that the last value bound is the one to be used. This definition feels inline with the theory of arrays, an axiomatic theory used to formalize the behavior of array data structures and computer memory, often used by program verifiers.
- Other than those, we could just say to always use the full name or the abbreviation or the first that occurs.
In the end, after some discussion, the PDF Association’s PDF Technical Working Group decided that “in the situation where both an abbreviated key name and the corresponding full key name from Table 91 are present, the abbreviated key name shall take precedence.” In a sense, the decision reaches even beyond the definition of the current PDF version (2.0) in that it acts as industry guidance for how actively maintained processors of previous versions can be altered to be consistent.
The final decision and the process used to arrive at it illustrate an important point about practical format designs: it’s tempting and not poorly justified to want to revise definitions to be aesthetically pleasing according to various tastes or to bring out parallels with well-studied formalisms, but often the adjustment to a standard that will likely result in the least toil and trouble is the one that strikes a balance between
- describing many existing implementations as they are;
- having a simple enough description that it can be articulated cleanly and patched into implementations that don’t happen to be doing it already.
Improving this situation somewhat with DaeDaLus, Talos, and the FAW
In work supported by the DARPA SafeDocs program , we’ve been working to address a host of problems, many of which can be viewed as solving general forms of the problems that arose in the process of finding and fixing the issue with inline images described above.
Saying what we mean, in DaeDaLus
It was hard to realize that the PDF standard was ambiguous because the PDF standard is 1,000 pages of English. Ideally, format definitions would be written in a data description language that’s intelligible to humans but is also machine-readable, in the sense that it would be equipped with
- a definition checker that can automatically check if a format is ambiguous (like a static type-checker checks for problems in a program);
- a parser-generator that would translate the definition into a program that, given a document, either accepts or rejects it as a member of the format.
Such data-definition languages are about as far as you can get from a new concept in computer science: they’re just another name for classes of languages in the Chomsky hierarchy, which is one of the earliest results for organizing computational problems and machines that solve them , and which includes regular expressions and context-free grammars, formalisms that have been widely studied and which serve as the foundation for modern, ubiquitous, practical tools, such as grep and ANTLR . But practical data formats like PDF aren’t regular or even context-free: they have components where the semantic interpretation of an early part of the document determines the structure of a latter part (e.g., an integer that defines the number of packets to follow), or which can be defined only by inspecting the document over many passes. So at present, many practical format definitions remain in the domain of language that is natural and informal.
In an effort to allow format definitions a more formal footing, we’ve been developing a novel data-description language, named DaeDaLus, that is expressive enough to define real formats precisely and unambiguously. DaeDaLus remains a very (very) experimental prototype, but it’s completely open-source and available on GitHub. If a format designer writes up their definition in DaeDaLus, it doesn’t leave room for ambiguities like the one concerning inline images. In fact, it was the process of trying to formalize the existing PDF standard in DaeDaLus that unearthed the issue in the first place.
Without giving a full tutorial on DaeDaLus, here’s a representative snippet that defines the BitsPerComponent entry of an Inline Image, and that’s hopefully self-contained enough for you to muddle through:
The basic idea behind the definition is to say that an Inline Image is a sequence of key-value pairs and that the resulting semantic value is a dictionary that binds each parsed key to its parsed value. In a sense, there’s no deep magic here: defining Inline Images explicitly like this won’t somehow guarantee that you’ll define that format as you intend. In fact, some of the available constructs (like empty and Insert, for maintaining maps/dictionaries) and parser libraries (like Fold, which extends the textbook Kleene iteration to maintain an accumulated value) make it pretty easy to get at least some working definition written down pretty quickly. But:
- writing the definition out like this at least ensures that it’s well defined whether any given document is in the format or not. Format experts might make well-informed arguments about why the definition should be changed going forward, but at least behaviors won’t be left open for implementations to diverge on in the meantime;
- the act of writing the definition explicitly almost inevitably leads you to want to simplify it as much as possible, because writing and maintaining definitions that are overly complicated for no reason isn’t fun (same as for usual programming).
In fact, it was specifically the act of writing out something like the above snippet and trying to simplify it that led us to prod the PDF Association about issues that ultimately led to the spec change: it seemed annoying that the same type of value is parsed for both the key strings BitsPerComponent and BPC, which led us to hope that we could just simplify everything by only having one BitsPerComponent key that results from parsing either the strings “BitsPerComponent” or “BPC.” The standard was ultimately revised in a way that doesn’t allow this, but it was the act of trying to simplify that unearthed the issue.
All of that said, while formal definitions help format experts to be sure that they’ve handled every case of a format’s definition, intuitive understanding of format definitions (and for that matter, computer programs) comes from examples. In an effort to help format experts understand the formats that they write, we’re developing a tool called Talos that, given a format definition, automatically generates documents that are in it (if you’re familiar with program analysis tools, think of it as a symbolic execution engine for a data-description language). Talos support is still experimental, but the current version is available in the DaeDaLus repository.
Identifying funny business with the FAW
The Inline Image issue brings up at least two other key aspects of practical format design.
- Even if a format definition hasn’t been formalized, we can still tell that there’s an issue by just cross-checking what a bunch of existing parsers do on files in the aggregate: if they produce very different results, then we already know that there’s a problem.
- When format experts decide that a definition needs to be updated, a critical factor (possibly the overriding one) in deciding how to update it amounts to articulating what the popular processors are already doing.
In an effort to support these and other workflows, we are developing the Format Analysis Workbench (FAW), a tool whose primary goal is to enable users to understand how formats are processed by their existing processors. For a format of interest, the FAW can be instantiated with a document corpus, a set of document processors, and a set of output patterns. It can then be used to process the corpus using the processors and subdivide the corpus, based on which documents result in which outputs. In addition, it can identify files that cause notable rendering differences between multiple parsers.
Where we go from here
Day-to-day, the world manages to use data formats pretty effectively, arguably in part because format standards are published as sensible, reasonably precise guidelines, document producers try not to push the limits of the format too hard when they can help it, and document consumers attempt to be as flexible as possible, typically above and beyond what the standard calls for. However, as attackers begin to work more and more at the edges of behavior left undefined by standards and implemented ad-hoc by producers and consumers in isolation, well-intentioned but incomplete specs will no longer suffice. We’ve been working hard on tools that make it feasible to iron out actual document formats in all of their messy details and understand documents formats as they’re actually used: there are plenty of interesting problems (both theoretical and practical) left to bite off, so stay tuned.