At Galois, we’re always looking for better ways to build reliable systems. A big part of that is doing our own research. At 130 people, we’re one of the world’s largest formal methods research teams! But another part of our work is understanding new tools and techniques from the research community. We’re interested in solving problems using any approach that works, including learning from other formal-methods researchers. Of course, it can be difficult to know how well a technique will work in practice from reading a research paper, so we often test new techniques out ourselves to learn more about them.
One area that we’re very interested in at the moment is black-box testing of network protocol implementations. Two important examples are the ubiquitous TLS protocol and QUIC, a new transport-layer protocol under standardization at the IETF that will underlie HTTP/3. These protocols are critical for security but are notoriously difficult to get right, and implementations of TLS have often suffered from security vulnerabilities in the wild. To make matters worse, network protocols are inherently difficult to test because exercising a bug might require that several nodes play out a long sequence of interactions.
Traditionally, protocol developers use interoperability testing, in which, e.g., a server running implementation A of QUIC is tested against a client running a different implementation B. By running the protocol between the two implementations, long sequences of complex interactions can be generated. This approach also provides a limited sort of adversarial testing, as implementation B might have interpreted the QUIC specification differently, resolved ambiguities differently, or might behave erroneously. The problem with this approach is that it is very hard to achieve a high coverage of the QUIC specification. Doing so depends on the implementation of B, the precise sequence of network operations, and other contingent factors. Testing erroneous behaviour is even more challenging.
That’s why we were excited to see two papers from Ken McMillan and Lenore Zuck which use some pretty clever ideas from formal methods to address the shortcomings of interoperability testing:
- McMillan, Kenneth L., and Lenore D. Zuck. “Formal specification and testing of QUIC.” Proceedings of the ACM Special Interest Group on Data Communication. 2019.
- McMillan, Kenneth L., and Lenore D. Zuck. “Compositional Testing of Internet Protocols.” 2019 IEEE Cybersecurity Development (SecDev). IEEE, 2019.
McMillan and Zuck develop a testing tool built on top of the Ivy formal verification framework that uses specification-based testing to test much more diverse scenarios. In their approach, McMillan and Zuck develop a formal model of the IETF QUIC specification and use it to randomly simulate network nodes that send packets to the implementation under test. For example, a QUIC server implementation would be tested by randomly simulating a QUIC client (or clients) using the formal model. Compared to interoperability testing, the randomized, simulated client produces much more diverse stimulus than the concrete implementation B for testing the server A. McMillan and Zuck find a bunch of interesting bugs in implementations of QUIC, including vulnerabilities to denial-of-service attacks and an information leak similar to the famous Heartbleed bug. Sounds pretty cool!
So we were curious what it would be like to apply this new tool ourselves. There’s a lot of great formal methods research happening at the moment, but in our experience, there can be a gap between research and what’s needed for a practical engineering tool. Academic researchers are focused on pushing the frontiers, which means that research tools sometimes have rough edges that make them difficult for new users. Research tools can also be specialised to a particular use-case in a way that makes them hard to generalise. We wanted to replicate McMillan and Zuck’s results and understand how useful their approach would be for testing other QUIC implementations.
The good news is that the tool worked as we’d hoped! We applied it to Picoquic, a simple QUIC implementation, and even managed to find a non-conformance problem in the version we tested. That’s a good indicator that this approach could be practical in the future. That said, this is definitely a research tool rather than something ready for deployment to customers. Using it currently requires a deep understanding of the Ivy language, the QUIC protocol, and the target implementation. We’re confident that these aspects of the tool could be improved with more engineering work. So, in general, the results of our investigations were very positive, and we’re excited to apply this idea ourselves in the future.
If you are curious to try it out, we packaged our experiment using Docker for easy replication and made it available at https://github.com/GaloisInc/ivy-quic-testing-demo.
Let’s dig a bit more into McMillan and Zuck’s approach and see what we find.
Overview of the Tool
Ivy is a general-purpose formal verification tool, but McMillan and Zuck use it specifically for black-box testing of protocols. Ivy’s testing of QUIC works by analysing events visible on the wire rather than any internal events of the implementation. This means Ivy only needs to be able to communicate through the network with the implementation under test and requires no source-code access or instrumentation. The testing tool masquerades as just another peer on the network, which makes it easy to deploy.
Ivy models are written in the Ivy modeling language, which is approachable to someone who has programmed, e.g., in C or Python. Models in Ivy are written using imperative (i.e., C-like) statements and pre/post-conditions which are expressed in first-order logic. These models are very flexible and can specify the behaviour of protocols with unbounded numbers of peers, unbounded data, unbounded sequences of network packets, etc. Ivy translates those models into the language of the Z3 SMT solver to generate tests and check for compliance.
The QUIC model consists of a monitor, which is an (infinite) state-machine which monitors the events visible on the wire: each wire event is associated with a precondition, which stipulates in which states of the monitor the event is allowed to take place and an update of the monitor’s state. A sequence of packets observed on the wire is correct if, when we run the monitor on each event in order, no precondition is ever violated.
This modeling style enables what is called assume-guarantee testing1 of network components. For example, to test a server, we will check that, assuming that the packets sent by clients never cause a precondition violation (the assumption), then the packets sent by the server never cause a precondition violation (the guarantee).
In practice, a test of a QUIC server might proceed as follows:
- Ivy initializes the model and then calls the Z3 SMT solver to find a random packet that a client can send to the server; this will likely be a client HELLO packet, which is used in QUIC to establish a new connection. Z3 finds values for all the fields in the packet (e.g., the connection id) such that those fields agree with the precondition of the client HELLO packet, and Ivy calls its network shim to actually send that packet on the wire. The monitor state is then updated according to the monitor’s state-update function. For example, the monitor remembers the connection id used in the client request.
- The server then responds, likely with a server HELLO packet. Upon receiving the response, Ivy calls Z3 again, but this time to check that the received packet satisfies the precondition that the monitor associates with server HELLO packets. For example, it checks that the destination connection id is the same connection id that the client used in its HELLO packet, which is available in the monitor state.
- Subsequently, Ivy calls Z3 again to find a new packet that can be sent to the server, and the exchange of packets between the simulated client and the server continues in this fashion.
Upon failure, Ivy displays a high-level event trace and the location of the assertion that was violated.
Since the model is used both to generate packets to send the implementation under test and to check that the responses are correct, a failed test may reveal not only a) bugs in the implementation, but also cases in which the model is b) too strong or c) too weak. The model is too strong when a precondition is too restrictive and fails on a packet that is intended to be legal by the authors of the IETF specification; if a component under test generates such a packet, a spurious precondition violation will occur. Conversely, the model is too weak when it can cause a packet that is not intended to be legal to be generated during a test; such a packet is likely to be rejected by the component under test, cause it to crash, or behave unexpectedly. In both cases, we are likely to get an error trace that can be inspected.
In other words, protocol testing with Ivy allows us to test both the implementation and the model at the same time! This is important because, like programming, developing models is an error-prone process and models must be debugged too. This means that a user can begin with a candidate model that is gradually refined at the same time that bugs are fixed in the implementation under test. One note of caution is that when a test failure occurs and Ivy displays an error trace, figuring out whether we are in situations a), b), or c) is a manual process of interpretation, requiring close inspection of the trace, the Ivy model, and the IETF specification.
Because the specification expresses the whole protocol for a network of any size, testing can begin any component, and any size of network can be tested. The user can add logical conditions that restrict (or disable) protocol steps, thereby narrowing down the set of interactions produced. Conversely, test-generation can be made more permissive by removing logical conditions: this has the effect of allowing more interactions. The specification can also be modified to allow invalid requests in order to test the component for robustness against bugs or security attacks.
Finally, we should note that Ivy’s protocol-testing facility is not specific to QUIC and can be applied to other protocols. However, Ivy comes with a QUIC specification (based on IETF draft 23) and a configuration to use it to test sample QUIC implementations (Quant and Picoquic).
Results
According to the QUIC specification, only packets containing ACK-eliciting frames should be acknowledged with an ACK. An ACK frame itself is not considered ACK eliciting, as this can result in an ACK loop. The Ivy QUIC specification did not specify this requirement, we so added it ourselves. Then, we tested the Picoquic server using one of the predefined test scenarios that come with the Ivy distribution. In this test, Ivy simulates a client downloading an HTML file from the Picoquic server under test. In each run, Ivy picks at random a behavior of the client that is consistent with the Ivy QUIC specification, thus exposing the server to a wide range of client behaviors. This exposed a case in which the Picoquic server acknowledges a client packet that contains only an ACK frame, which violates the specification (note that we used an old version of Picoquic, written against IETF draft 23, because the Ivy QUIC specification follows IETF draft 23). Our results can be replicated using the Docker setup hosted at https://github.com/GaloisInc/ivy-quic-testing-demo
Lessons Learned
- The first snag we ran into on the project was that the Quant server didn’t seem to interoperate well with underlying TLS libraries embedded into the test generator (the Quant server returned “TLS error 120” upon receiving packets from the test generator). We did not understand the cause of this error, and we ended up switching from Quant to Picoquic as a workaround. Picoquic uses Picotls under the hood, which is also what McMillan and Zuck’s original test generator uses.
- The Ivy specification covers a wide range of scenarios, and its breadth can be overwhelming. It must be augmented by additional constraints if one wants to target a particular scenario. The test framework is designed to randomly explore all permissible actions, and thus hitting a particular test requirement is unlikely to occur without a custom spec that narrowly constrains the simulated components to particular steps. Our experience with the demo scenario above showed that creating a specific test scenario requires a good understanding of both the Ivy specification and the IETF specification.
- The test output, and in particular the Ivy traces, are difficult to interpret. It took a lot of digging into the code to understand which actions originated from the test environment and which actions originated from the system under test. This is no fundamental limitation, but it would take time to polish the tool to make the output more user-friendly.
- When a test case fails, one gets an event trace and a pointer to an assumption violation on a code line in the Ivy spec. Only someone deeply familiar with both the Ivy spec and the IETF spec can understand why a test has failed. In general (and perhaps unsurprisingly), interpreting an error requires deep knowledge of the QUIC protocol. Arbitrary debug events can be inserted in the spec to facilitate debugging.
- The Ivy QUIC specification covers a large part of the protocol logic but is not complete. It also does not specify serialization formats and SSL protocol logic. Therefore, it does not generate randomized tests covering those aspects (it uses concrete implementations of those to generate test inputs).
- The current QUIC specification written in Ivy is based on the QUIC IETF draft number 23, which is outdated. To apply Ivy to test a new implementation, the Ivy specification must first be updated to the corresponding IETF draft version.
- We have not tried the offline trace-checking mode, and PCAP trace capture does not work currently.
Conclusions
At Galois, we’re big fans of formal verification, which is the gold standard of systems assurance. But for many projects, developing a formal proof would be too expensive. At the same time, it can be difficult to test a complex system such as a network protocol. The exciting thing about McMillan and Zuck’s work is that it suggests we can have some of the best of both worlds. By using a formal model, we can systematically test a very complex protocol implementation without the cost of formally verifying the whole codebase. We’re really happy with the results we got in this preliminary study, and we hope to see a lot more research on this type of compositional testing in the future.
1Giannakopoulou, Dimitra, C. S. Păsăreanu, and Colin Blundell. “Assume-guarantee testing for software components.” IET Software 2.6 (2008): 547-562.