DaeDaLus and SPARTA: Enabling Format Users to Describe Extant Formats Precisely and Enabling Programmers to Parse Them Safely

Funded by DARPA’s SafeDocs program, this project aims to achieve three major goals: design a novel language for precisely describing practical formats, design a framework for learning formal descriptions from extant documents, and enable the generation of correct and performant format parsers.

We exchange information every day through widely-used document formats such as PDF, DOCX, XLSX, and between Internet of Things devices using DDS streaming formats. The popularity of these file formats, combined with the lack of clear and explicit descriptions of which documents are actually in them, makes them highly susceptible to malware exploits, such as the “macroless” DDE Exploit that recently affected Microsoft Word documents.

Attackers deploy malware exploits for these files by tricking ubiquitous document processors, called parsers, into believing that malicious files are harmless documents, deceiving the parser into either validating a malicious file as safe; or accidentally carrying out the attack in the act of attempting to process the file. These attacks get through because the specification of what documents are safe is typically left implicit. Parser programmers must constantly work to track what properties of a document have indeed been validated as a parser executes, and if these properties suffice to carry out a desired action. In practice, this is an overwhelming task.

Over the course of this project, we will provide users with trustworthy formats by:

  • Designing a rich language of data formats, called DaeDaLus, which will naturally and precisely describe the most detailed features of complex, practical formats.
  • Designing tools that learn DaeDaLus descriptions from a format’s extant representation as a large corpus of documents. Such tools will be usable by format users who may not necessarily have any background in programming.
  • Designing a parser toolkit for the DaeDaL formats, called the Safe PARser Toolkit for Assurance (SPARTA). SPARTA will provide a language for programming parsers, along with heavily automated tools that generate parsers from DaeDaLus formats and verify the correctness of parsers in the language.

Describing Practical Formats Precisely

Many of the practical formats that we rely on evolved to be usable by becoming highly flexible. For example, a PDF might simply contain formatted text, but it might also contain pictures, videos, and JavaScript code. The cost of this flexibility is that precisely describing exactly which documents belong in a format becomes extremely challenging. For example, simply knowing how big one part of a document should be may require a deep inspection of the content in an earlier part of the document.

A major result of this project will be a data-description language, named DaeDaLus, that will enable format users, including ones without experience in programming, to understand specify practical formats precisely. The key features of DaeDaLus design and toolchain will include:

  • A carefully designed language of natural but powerful data properties, and language features for using them to dictate expected document structure.
  • A type system, supported by an automatic type checker, that a format writer will be able to use to specify the semantic value that should result from correctly parsing a document.
  • A format checker, named ChkDDL, to which format users will be able to pose queries regarding the safety and security of the format itself. Such queries might include “Does this format overlap with a different, existing format? Could a single document in this format be interpreted multiple ways? Does this format enable a malicious document writer to hide information not visible to an end-user?”

Format Descriptions from the Documents You Already Have

DaeDaLus will be supported by a toolchain that will enable format users to quickly obtain DaeDaLus descriptions of the document formats that they rely on. To obtain a complete description of a format in DaeDaLus, a format user will be able to provide a corpus of existing format documents to a powerful tool, called LearnDDL. LearnDDL will automatically generate a DaeDaLus format that naturally describes the existing format by employing sophisticated techniques from machine learning.

Optionally, the format user will use other tools in the framework to learn normalizations from relatively complex formats that describe formats as they are used to simpler formats that can be more easily understood by humans and parsers alike (e.g., by removing redundant whitespace symbols).

Critically, the DaeDaLus formats, including those generated by LearnDDL, will be a declarative, complete, and precise description of which documents are in a format. But for format users to be able to work with documents in a format, they still need a parser to process format documents.

From Descriptive Formats to Safe and Fast Parsers

Developers will obtain correct, high-performance format parsers by using the Safe PARser Toolkit for Assurance (SPARTA). SPARTA will include a Domain Specific Language (DSL) designed specifically for developing parsers, named Pickle. While programmers will be able to develop Pickle parsers completely from scratch, SPARTA will also enable them to generate at least major parser components from DaeDaLus formats automatically.

We expect a typical usage of the Pickle toolchain to develop a parser for a DaeDaLus format will proceed as follows:

  • The developer will invoke a parser-generator tool named Pillage that will generate a Pickle program that parses exactly the given DaeDaL format. To do so, Pillage will employ type-directed synthesis.
  • The developer will manually edit the generated Pickle program as needed, potentially to replace automatically-generated code with customized algorithms, and use the Pickle-type system to ensure that the resulting code remains correct.
  • The developer will use one of SPARTA’s compilers from Pickle to a conventional language, such as C++, to generate high-performance code that can directly interact with existing codebases, accompanied by a machine-checkable proof that the code is correct.
  • Optionally, the programmer may also use SPARTA to generate format normalizers that simplify a document before it is given to the developed parser.

Safer Formats Will Make a Safer World

DaeDaLus and SPARTA aim to help put an end to issues caused by unclear file formats and unsafe parsing. The frameworks, toolkits, and toolchains we are developing will be designed for customizability while taking advantage of new advances in automation and machine learning. If successful, systems built using these implementations will be invulnerable to a wide class of malware attacks.

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-19-C-0073. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA).