Crux in SV-COMP 2022: A Retrospective

This year Galois participated in the annual Competition on Software Verification (SV-COMP) for the first time with our verification tool, Crux. Preparing Crux to be in competition shape has been a project several years in the making, and we are very pleased to have reached this milestone. In this post, we will take a closer look at what it takes to participate in SV-COMP, what changes we had to make to Crux to get there, how Crux fared in the competition, and what our plans for future competitions are.

Our motivations

SV-COMP is an annual competition that announces its results at the colocated TACAS conference. Since SV-COMP 2022 is the 11th iteration of the event, several competitors have had many years to hone their skills. Our primary goal is not to compete for the top scores as newcomers. Instead, our goal is to increase Crux’s visibility in the software verification community and contextualize Crux’s capabilities against other tools. Moreover, the competition results can be used to inform the future development of Crux features.

An SV-COMP primer

To participate in SV-COMP, you first need a software verification tool capable of verifying properties of C or Java programs. Next, you must declare which properties your tool is capable of analyzing, which could include:

  • Is an error function (e.g., abort()) reachable?
  • Are there any instances of signed integer overflow?
  • Is a program memory-safe?
  • Does a program always terminate?

See the competition rules for a complete list of properties and their definitions.

The competition organizers curate a suite of benchmark programs that declares which properties should hold for specific programs. For example, some benchmark programs are expected to invoke an error function, while others are expected never to do so. Besides the properties above, there are some other unique categories that one can choose to participate in or not, such as programs derived from large software systems (e.g., the Linux kernel or BusyBox) or programs that use concurrency. It’s worth noting that many of the programs are contributed by the competitors themselves!

The benchmark suite is considered the source of ground truth when it comes to determining whether a property should hold or not. Tools that compete in SV-COMP will be given a score based on how many programs they can successfully demonstrate adhering to—or, in other cases, violating—the specific property being tested. To earn points a tool must do two things:

  1. It must report TRUE (for properties that hold) or FALSE (for property violations), and this answer must match what is recorded in the benchmark suite.
  2. The tool must produce a witness automaton that demonstrates the correctness or incorrectness of a program.

The second requirement is the more interesting one, and we will talk about it in more depth later. For now, it suffices to say that it isn’t enough for a tool to just blindly guess TRUE or FALSE—it must also show its work. For the full details on how tools are scored, see the SV-COMP rules. Note that as an alternative to reporting TRUE or FALSE, a tool can also report UNKNOWN, which results in zero points and avoids the possibility of being penalized for an incorrect verdict.

Finally, once all of the competing tools have been scored, the SV-COMP organizers announce the results for each category. Tools that place in the top three in a specific category receive special honors. In addition, there are sometimes special honors given to tools that do well in other aspects, such as tools with the least amount of energy consumption.

Crux: a competition training montage

Galois’s Crux tool is well suited to handle the kinds of programs that comprise SV-COMP’s benchmark suite. Crux works by taking the source code of a program as input, and Crux will produce as output a summary of whether the program contains any correctness issues, such as undefined behavior or memory safety. Moreover, Crux also has support for specifying functional properties as inline code. For more details on how this works, see here for a tutorial of how to use Crux in a C context, or here for a demonstration of Crux in a Rust context. Besides C and Rust, Crux also has experimental support for C++, Java, Go, WebAssembly, and concurrent programs.

When preparing Crux to enter SV-COMP for the first time, we decided to focus on Crux’s strengths. First, we chose to only verify single-threaded C programs, as Crux’s support for C is more mature than it is for other languages. We also decided to only enter the categories that check for error-function reachability and signed-integer overflow, as Crux was already adept at checking these properties. While it may have been possible to make Crux check for other competition properties, this would have required significant engineering work to accomplish, and we did not do this for lack of time.

Having narrowed our sights on C programs of a particular flavor, what else remained to be done? Broadly speaking, we ran into six kinds of challenges when preparing for the competition. (If you’d like to just skip to the results of the competition, go to the “The competition results” section.)

Picking an LLVM version

Crux’s C frontend works by leveraging the LLVM compiler toolchain to compile C programs into bitcode, which is amenable to further analysis. When we first started testing Crux on SV-COMP benchmark programs, we encountered around 500 programs that would incorrectly fail to verify because the corresponding LLVM bitcode would access an unreachable instruction. This should never happen under normal circumstances, and further investigation suggested that this occurred due to a bug in LLVM 10 that would occur when compiling certain kinds of infinite loops, such as this one.

We ultimately resolved the issue by switching version LLVM 10 to 12, which was then the latest released version. This was a somewhat nontrivial task, as LLVM 12 introduced several new constructs requiring additional Crux support. Our work paid off, however, as we never encountered the unreachable bug after that.

Adding overrides

SV-COMP benchmark programs make use of a wide variety of functionality from the C standard library. Most of the time, an LLVM bitcode file will treat C library functions as abstract, so it is up to Crux to determine what semantics to give the function. For example, Crux has special semantics for the malloc() and free() functions that allow Crux to reason about memory allocations. These special semantics are called overrides in Crux parlance.

Whenever Crux encounters a library function that it does not have an override for, it will fail by default. As a result, we spent a considerable amount of time adding new overrides to Crux to prepare it for the enormous diversity of library functions in the SV-COMP benchmark suite. These overrides range from floating-point utilities to bit-twiddling functions to special functions only used in LLVM’s bitcode format. While we made significant progress, there are still around 30 library functions we have yet to add support for. We ultimately decided to report UNKNOWN whenever any such functions were encountered to avoid being penalized on a large swath of programs.

Fixing some Crux issues

Testing Crux against SV-COMP’s benchmark suite revealed assorted problems with Crux. Some problems were legitimate bugs, such as a bug in Crux’s memory model, which could cause incorrect floating-point values to be loaded from memory. Other problems were missing pieces of functionality. For instance, some benchmark programs declare their main function as int main(int argc, char** argv) instead of int main(), but Crux only supported the latter at the time. As it turns out, none of the programs actually use argc or argv in a meaningful way, so adding support for the former was straightforward.

Not fixing other Crux issues

Not all issues revealed during testing were as straightforward to resolve. Some benchmark programs make use of the long double type, which Crux does not currently support. Other programs made use of inline assembly code, which is also unsupported. Robustly supporting these in Crux would likely require a lot of work, so for the time being, we made Crux report UNKNOWN on all such programs to avoid penalties.

Creating witnesses

SV-COMP uses witness automaton files as a way to check that a tool’s TRUE or FALSE verdict can be reproduced by an external program, called a validator. Witness automata contain a control-flow graph that provides hints to a validator as it steps through the execution of a program. For FALSE verdicts, a witness file should guide a validator to a state in which a property is violated (e.g., a line of code that invokes an error function). For TRUE verdicts, a witness file gives program invariants to assist the validator in proving that a property holds. See this repo for examples of what witness automata actually look like in practice.

Crux had no prior support for generating witness automata, so this is something that had to be designed from scratch before entering SV-COMP. This was one of the more challenging aspects of the competition for various reasons. One reason is that Crux does not infer loop invariants or function specifications, so it is unclear what information Crux should provide witnesses for TRUE verdicts. Another challenge is that the behavior of validator programs is not always easy to predict, and a witness might be accepted by one validator but rejected by another validator. In our experience, figuring out which information to provide to witnesses was often a matter of trial and error.

For programs with FALSE verdicts, Crux will generate a straight-line path of nodes leading from main() to the property violation, where each edge transitions to the head of a different basic block in LLVM bitcode. For programs will TRUE verdicts, Crux will do something extremely dumb but simple: it will generate trivial witness with a single node with no edges. This provides no hints to the validator, but even that is enough for a decent chunk of programs to be validated. (More on this later.)

Handling undefined behavior

By default, Crux imposes very strict validity checks on the programs it verifies. If Crux encounters undefined behavior or other situations not fully specified by the C standards, it will conservatively reject the program as invalid. In the context of SV-COMP, however, this is not always desirable for a couple of reasons. One reason is that a SV-COMP benchmark program is always correct (or incorrect) with respect to a specific property. For example, a program that is invalid in the no-signed-integer-overflow category can simultaneously be valid in the no-error-function-is-reachable category. With Crux’s default settings, however, it would check for both of these properties, causing it to flag the program as invalid in both categories. To accommodate this, we had to ensure that certain validity checks in Crux could be toggled so that they are only checked in certain SV-COMP categories.

Another reason why Crux’s strict validity checking can work against it is simply that a lot of SV-COMP programs do unsavory things. Many benchmark programs will read from uninitialized memory or make use of undefined behaviors that are unrelated to the property being tested. We aren’t the first to have noticed this, either: this analysis found that 14% of programs in SV-COMP 2016 that were marked as correct exhibited some form of undefined behavior. As one example, consider this excerpt from a 2022 benchmark program:

Note that x is declared but not initialized to a specific value. Despite this, the program computes the maximum value in x later. This is an extremely common pattern in SV-COMP’s benchmark suite, but Crux will reject such programs by default since it is not clear what value x should have. Nevertheless, this pattern is so common in the benchmark suite that we needed to do something about it, lest we make large chunks of SV-COMP out of reach for Crux.

Ultimately, we tackled this issue by adding a lax-loads-and-stores option to Crux. When this option is enabled, reading from uninitialized memory (such as x above) will return an uninterpreted value of the appropriate type instead of throwing an error. While we originally implemented lax-loads-and-stores specifically for SV-COMP, it has ended up being useful in other contexts as well, such as verifying programs involving C bitfields.

The exact status of reads from uninitialized memory locations remains a source of some uncertainty and debate among the C standards and verification community. See, for example, this commentary on the C11 specification, which does a good job explaining the historical and current status of the standard and some of the remaining questions.

The competition results

How did we do? The full results of SV-COMP 2022 are here, but we’ll give you the highlight reel. First, let’s look at how many programs Crux looked at during the competition, alongside how many points Crux scored:

A guide to reading this table:

  • The status column represents the number of individual programs, while the raw score column represents the points scored from those programs.
  • The correct row records each time Crux accurately identified a valid (correct true) or invalid (correct false) program. Moreover, each program’s witness automaton was confirmed by a validator.
  • The correct-unconfirmed row is like the correct row, but for which the witnesses could not be confirmed by a validator. As a result, these are worth no points.
  • The incorrect row records each time Crux reported an incorrect verdict, both false positives (incorrect true) and false negatives (incorrect false). These result in points being deducted from the overall score.

Our total score was -23,560. This certainly isn’t anything to brag about, but we argue it’s not as abysmal as it looks. The ReachSafety category, which tests if an error function is reachable, has a subcategory of programs entirely derived from large software systems, such as Linux kernel drivers. Crux does quite poorly in this category, and 1,578 of our 1,641 incorrect programs come from this subcategory alone. Because SV-COMP heavily penalizes incorrect answers, this subcategory subtracted 24,948 points from our score!

Thankfully, the competition organizers are aware of the punishing nature of the software systems programs, and they have put these programs into a separate group for ranking purposes. If the software systems programs are excluded, then Crux scored 1,097 points in the ReachSafety category, ranking 8th out of 12 competitors. Crux also scored 290 points in the NoOverflows category (which tests for signed integer overflow), ranking 14th out of 16.

Then again, the rankings aren’t really as interesting as figuring out why Crux performs the way it does on certain problems. To a large extent, Crux’s performance can be explained by Crucible, the underlying symbolic execution engine that Crux uses under the hood. Crucible works best on programs with clear termination criteria, and since the NoOverflows category largely consists of these kinds of programs, Crux does better there.

The programs in the ReachSafety category, on the other hand, are a much more mixed bag. Many of these programs include for loops with symbolic iteration bounds, which is difficult for Crucible to analyze. Usually, Crucible will handle these programs by bounding the number of loops or by timing out, ultimately reporting an UNKNOWN answer. Unlike other tools competing in SV-COMP, Crucible makes no effort to infer loop invariants, which puts Crux at a disadvantage on these kinds of problems.

On the other hand, Crux has very good results in the soundness direction. Crux only reports a single false-positive result, which we are currently investigating. That aside, encountering so few false positives increases our confidence in Crux’s overall robustness.

One final aspect of the results that surprised us was the relatively few correct-unconfirmed true results: just nine altogether. This is despite the fact that the witnesses that Crux generates for TRUE results encode no program invariants whatsoever. In all likelihood, the validators that check these witnesses are smart enough to infer many invariants without any hints, which bails us out in most cases. Although we could try to improve the quality of TRUE witnesses, only having nine failures here shows that this need not be a large priority in the future.

Next steps

Entering Crux into SV-COMP was a unique learning experience and one that we hope to partake in again in the future. Now that we have one competition under our belt, we have a much better idea of what to focus on for future competitions. In no particular order, we should try some combination of the following:

  • Add overrides for library functions that Crux does not currently support.
  • Improve the quality of witness automata for FALSE verdicts. In particular, adding hints about what values a function’s arguments have would likely go a long way.
  • Enter more SV-COMP categories. We opted out of some categories due to a lack of time to prepare, such as checking for memory safety. With more time, we might be able to add the necessary functionality to Crux.
  • Work on fixing incorrect verdicts. While we made some progress in this direction with the lax-loads-and-stores option, there are many more sources of incorrectness that warrant further investigation.
  • Try to eliminate some undefined behavior from SV-COMP benchmarks. While we made an effort to accommodate undefined behavior found in the benchmark suite, we feel that there is only so far that will get us. Fortunately, the competition organizers have been accommodating about accepting patches to avoid undefined behavior.

We’d also love it if you tried Crux on your own programs and opening issues if anything goes wrong. Let us know what you think, be it as a competitor or a spectator.