Document and protocol formats are the languages in which computing systems exchange information. The notion of a “document-format” certainly isn’t foreign to most computer users: everyday, vast swaths of users directly create, edit, and share documents and media in formats such as PDF, JPEG, and Word. But in principle, the concept of a “format” is quite flexible and describes the messages that are exchanged deep within the guts of operating systems and networking protocols as well.
The theory of describing formats using formalisms such as regular expressions and context-free grammars goes back almost to the birth of computer science, starting more or less with the definition of the Chomsky hierarchy in 1956. Today, there are plenty of practical frameworks in use for defining formats and parsing them that do the job well enough to get by–many of them seeing plenty of day-to-day use at Galois.
However, the art and science of format definition and parser construction still has plenty of problems left to be addressed. Languages based on formalisms such as regular expressions and context-free grammars are simple and self-contained, and when they fit the format of interest they’re hard to beat, but they can’t be used to describe an industrial-standard format, like PDF. These shortcomings are evident to end-users and seen in a consistent stream of critical, system level security exploits that result from under-specified formats and buggy parsers.
Funded by the DARPA SafeDocs Program, Galois–in close collaboration with Trail of Bits Inc. and students and faculty at Cornell, Penn State, Princeton, Purdue, and Tufts–is currently performing research focused on improving the state of the art and science of finding security vulnerabilities.
To that end, we are excited to share an update about our work on DaeDaLus and developing the Format Analysis Workbench (FAW). Although there’s plenty of work ahead, after about a year of research the tools we’ve built throughout the process so far have contributed to the design of a very practical, very complex format. We’ve uncovered seven novel issues that the PDF Association aims to fix by updating the next version of the standard.
Getting Precise About the Real-World Formats
Although decades of research have explored tradeoffs in making data description languages (i.e., languages in which people write machine-readable format definitions) more expressive versus the time taken to parse their formats, the dust hasn’t settled to yield a language yet.
At Galois, we’re leading the design of a novel data description language—named DaeDaLus—that aims to aid in shaping expressive format definitions that are easily understandable. DaeDaLus is designed to be expressive enough to describe practical formats precisely, but also declarative and self-contained enough to be understood with a reasonable amount of effort, especially by format experts who aren’t necessarily expert programmers.
Traditional libraries that define parser combinators are a powerful and useful tool for rapidly building programmer-understandable parsers for practical formats. However, the hallmarks of parser combinators that give them so much flexibility (that they provide mechanisms for building parsers that can be used by arbitrary programs in a general-purpose language) can actually work against them. To address this issue, DaeDaLus parsing algorithms seek to ensure that a parser defined using DaeDaLus acts as a declarative format specification that can be easily understood by non-programmers.
DaeDaLus is still relatively young, but our work with it so far has been promising: over the span of a few months, we’ve not only defined the entire language and built first prototypes of its parsing algorithms, but we’ve used it to define a machine-readable definition of a “surface” syntax of PDF objects. DaeDaLus has been expressive enough that we’ve been able to transcribe the details of the PDF standard into it.
DaeDaLus is currently supported by parsing algorithms implemented in Haskell, so that it can be integrated into a Haskell project to build parsing results as Haskell datatypes. DaeDaLus is now open-source software, available from GitHub. In the future, we hope to add more efficient parsing algorithms and parser generators, as well as self-certifying parsers (i.e., parsers that are either correct by construction or that generate easily-checkable certificates that their results are correct to accompany the results themselves).
Formats from Giant Tarballs of Documents
It’s no small thing to be able to manually translate standards written in natural language into a machine-readable form. Primarily because
1. It can get pretty exhausting after a while;
2. it’s not even clear what you actually want (a lot of the time). In reality, formats aren’t just official standards documents; they’re implicit understandings of what features users decide they want, and what parser implementers decide they’re going to support.
We’re working in close collaboration with Trail of Bits to develop a toolkit that a format expert can use to generate a machine-readable description of formats as they’re actually used from existing documents and parsers. Trail of Bits has recently released a suite of tools—PolyFile and PolyTracker—that enable users to understand existing documents and parsers. PolyFile takes a document and a library of known formats and labels document bytes as pieces of the candidate formats, PolyTracker tracks which components of a given parser process which pieces of an input.
DaeDaLus will be supported by the Format Analysis Workbench (FAW), which enables a user to access a corpus of documents using a suite of existing parsers and quickly navigate the plethora of error and warning messages that practical parsers generate. Our goal is that, using FAW, a format expert can quickly determine which documents should act as the ground truth of an implicit format in use, balancing their informal understanding of the meaning of various parser error messages with the practical implications of accepting or rejecting the sets of documents that trigger such messages. Our hope is that in the future, this will be useful both to global document standards bodies and to organizations looking to define restricted, trusted formats for internal use (e.g., a version of PDF that does not contain JavaScript).
That’s all not to imply that parsing doesn’t leave a bit of time for fun and games. A recent competition hosted by Kudu Dynamics tested the parsing prototypes of various groups that are carrying out research under the SafeDocs program. Every team had their strengths, but in the end, a particular piece of hardware will be calling SW 6th Ave in Portland, home for the foreseeable future…
This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.