We think that Rust is a great language, and maybe you agree! Unfortunately, even if you do, there’s a good chance whatever application you’re working on is written in some older language such as C. To help with this, Galois has been developing c2rust, an automated transpiler (source-to-source translator) from C code into Rust code. c2rust can take almost any C and turn it into C-like Rust code, the first step in creating a new Rust application. And we’re building more features to turn C into safe, idiomatic Rust code.
Recently, we have been experimenting with LLMs to help with transpilation from C to Rust. This blog describes one such experiment, where we built an analysis for determining nullability of function arguments in C. This is a necessary stage in the c2rust translation pipeline, and we already have an existing interprocedural static analysis tool that performs this task. We built a companion LLM-based tool using GPT-4o, and compared the performance between our static and LLM-based analysis.
Our results are promising. With a modest amount of effort, we were able to build an analysis that generated useful nullability results. We examined by hand a few results when the static and LLM analysis disagreed. In these cases, the LLM analysis is more often correct in restricting the analysis, rather than making it more permissive.
In the end, our experiment reveals clear pros and cons for both our static and LLM-based tools. The static analysis tool generates high confidence results with near-zero cost to run, but is slower and more expensive to develop and modify, and is limited to a restricted set of well-behaved programs. By contrast, the LLM-based tool is cheaper to develop and modify and can be applied to almost any program, even one that does not compile. However, it is also more expensive to run and has a high probability of error. The fact that the LLM-based tool may make mistakes means that it could only be used in cases where the results can be checked by some other mechanism.
We plan to continue experimenting with LLM-based solutions as part of our ongoing work to increase the accessibility and utility of our formal verification and software assurance tools.
The problem we want to solve is the following:
> Given some C function `f()` that takes pointer argument `x`, is it guaranteed to be safe to pass `x == NULL`?
There are two possibilities. If all dereferences of `x` are protected by checks to determine whether it is `NULL`, then `x` can be passed as `NULL` (and so we say `x` is *nullable*). Or if there is at least one access to `x` where the argument is not checked, then the caller must not pass `NULL` for that argument, as doing so would cause undefined behavior.
This is a tricky static analysis problem because it requires figuring out (1) where the value `x` flows in the program, and (2) what accesses to `x` are downstream of null-checks (if there are any). Many types of analysis are possible, and as usual in formal methods the problem is in general undecidable.
For `c2rust` we previously built a simple interprocedural static analysis to solve this problem. Our analysis is flow-insensitive, so it does not consider null checks, which makes it easier to implement but limits its precision. This analysis performs acceptably well by itself, but there are many cases where it is overly conservative.
Of course, we could make our analysis more precise without using an LLM. For example, we are currently building a runtime / dynamic analysis to augment the static analysis results. We could also replace our simple static analysis with one that is more precise and takes advantage of the information provided by null checks. However, each of these approaches have costs, both in terms of engineering to develop the analysis, and in trade-offs between precision / speed / predictability etc.
`c2rust` uses the result of our existing static nullability analysis to determine how to translate C pointers. The starting state for translation is that a C pointer is converted into a raw pointer, *T. Then the analysis attempts to determine whether the C pointer was nullable. In cases where a pointer is `NON_NULL`, then it can be translated to `&T`. If it is nullable, we must use a Rust type which allows such values, `Option<&T>`, which is more verbose and less idiomatic.
In the worst case, we can conservatively assume all pointers are nullable, so this is a safe failure mode for the analysis. The consequences are more significant if we wrongly infer that a function parameter is `NON_NULL`:
As an alternative to our static analysis, we built an analysis using a state-of-the-art LLM, GPT-4o. We used a standard prompt-construction style, with a Python driver script, and JSON results. The analysis pipeline works as follows:
1. Search the target C file for function names
2. For each function name `f()`, construct a prompt consisting of:
1. The whole C file
2. A query asking the LLM to determine nullability for each argument of `f()`
3. Query GPT-4o via the OpenAI API
4. Parse the resulting JSON output
We also instructed the LLM to produce a natural-language analysis which justified the claimed result. This served as a lightweight ‘chain-of-thought’ and allowed us to examine the LLM’s reasoning in cases where it made mistakes.
We experimented with several different prompt structures and designs. Our final prompt is as follows (here `FUNCNAME` was replaced by the target function):
```
Your task is to analyze the usage of each pointer argument of the
function 'FUNCNAME' as defined in this C code.
For each argument of 'FUNCNAME':
(1) Determine the type of the argument.
(2) If the argument has a pointer type, inspect the uses of the
argument in the code of 'FUNCNAME', to determine if it is ever dereferenced
without a null pointer check, that is, any violations of CERT Rule
EXP34-C "Do not dereference null pointers".
(3) if the argument is ever dereferenced without a null pointer check,
indicate "unchecked dereference" and list the places in the code where
the dereferences occur.
Analyse each argument of function 'FUNCNAME'.
For each argument of 'FUNCNAME', compute these values:
'argument_name', containing the name of the argument,
'argument_type', containing the argument type,
'is_pointer_type' containing a boolean indicating
if the parameter is a pointer type or not,
'unchecked_dereference_locations' should contain a list of places
anywhere in the code that might dereference the argument without checking
for a null pointer.
These values for each pointer should be returned to the user
in the 'response' function.
In the "analysis" field of the 'response' function provide details as to
the reasoning behind your answers.
Go slow and try hard to analyze the program correctly.
```
Here are some results we generated from analyzing our standard benchmark application, lighttpd, comparing the static and LLM-based analysis. We ran the tool on the lighttpd-rust/src/*.c set of modules (https://github.com/immunant/lighttpd-rust.git, commit 385ca548).
Here, the ‘Manually inspected’ column was generated by looking at a few cases where the static and AI results disagreed on the nullability of a function argument. Intuitively, our manual inspection is looking for the same thing as the LLM-based analysis, namely the presence/absence of null-checks guarding dereferences.
Note that the manual inspection process (and the LLM) might disagree with the static analysis:
We’ve included a specific case analysis of some instances where the LLM and static analysis disagree. See [Interesting Cases](#interesting-cases) at the end of this document.
Our LLM-based analysis was remarkably successful, given that it required no work on building domain knowledge or infrastructure. The key limitation is that the LLM is not always right, and when it is wrong, the errors are not predictable. This means that results from the LLM cannot be used unless errors can be tolerated – for example, if they do not lead to unsafe code.
The LLM-based analysis is also costly in money terms. Analyzing `lighttpd` cost us over $100. With optimizations of our scripts to use the OpenAI API more efficiently, we expect we could bring this down to $10 or much less.
Intuitively, the static and AI-based tools are quite different in how they ‘see’ the code. The LLM is able to use information such as variable names and idiomatic code patterns that the static tool cannot take advantage of. On the other hand, the LLM may be fooled by (for example) mis-named arguments or even comments, which the static tool does not even take into account. In some cases, the LLM-based analysis is fooled in quite simple ways that could be detected by a local static analysis.
We expect that an LLM analysis will work well for C code, but the same analysis would work much less well for a more obscure imperative language such as Ada. This is simply because the datasets of code include a large amount of C code to train on. We would also expect the analysis to be less successful on Rust code for the same reason.
The trade-off space between the static and LLM-based tool is interesting:
Here are some interesting cases where the AI wrongly reports nullable. In each case, the quoted text comes from the LLM’s natural-language explanation of the code.
We conclude from this experiment that GPT-4o is viable as a first-approximation static analysis for this kind of task. However, the LLM does not produce perfect results. If we wanted to use the LLM-based analysis in practice, we would need to carefully consider where mistakes in the analysis might result in unsoundness in the translation. Despite this, we are optimistic in the utility of LLMs as an alternative to static analysis. It is remarkably easy to build an LLM-based analysis compared to the (often arduous) process of building a static analysis tool. And many of the LLM failures we have encountered could be mitigated by quite simple static tools or symbolic/LLM refinement loops.
We could improve our prototype in several obvious ways. As discussed above, we are currently completing work on a runtime analysis for nullability. Once we have completed this, we can compare with the LLM as well. We have not experimented much with the (huge) space of possible prompts, nor other LLM tricks such as multi-shot prompting or including few-shot examples in the prompt. Additionally, nullability is only one property of interest as we develop our transpilation tool.