by Mark Tullsen, Stuart Pernsteiner, and Mike Dodds
Overview
We think that Rust is a great language, and maybe you agree! Unfortunately, even if you do, there’s a good chance whatever application you’re working on is written in some older language such as C. To help with this, Galois has been developing c2rust, an automated transpiler (source-to-source translator) from C code into Rust code. c2rust can take almost any C and turn it into C-like Rust code, the first step in creating a new Rust application. And we’re building more features to turn C into safe, idiomatic Rust code.
Recently, we have been experimenting with LLMs to help with transpilation from C to Rust. This blog describes one such experiment, where we built an analysis for determining nullability of function arguments in C. This is a necessary stage in the c2rust translation pipeline, and we already have an existing interprocedural static analysis tool that performs this task. We built a companion LLM-based tool using GPT-4o, and compared the performance between our static and LLM-based analysis.
Summary of results
Our results are promising. With a modest amount of effort, we were able to build an analysis that generated useful nullability results. We examined by hand a few results when the static and LLM analysis disagreed. In these cases, the LLM analysis is more often correct in restricting the analysis, rather than making it more permissive.
In the end, our experiment reveals clear pros and cons for both our static and LLM-based tools. The static analysis tool generates high confidence results with near-zero cost to run, but is slower and more expensive to develop and modify, and is limited to a restricted set of well-behaved programs. By contrast, the LLM-based tool is cheaper to develop and modify and can be applied to almost any program, even one that does not compile. However, it is also more expensive to run and has a high probability of error. The fact that the LLM-based tool may make mistakes means that it could only be used in cases where the results can be checked by some other mechanism.
We plan to continue experimenting with LLM-based solutions as part of our ongoing work to increase the accessibility and utility of our formal verification and software assurance tools.
Problem description: what is nullability?
The problem we want to solve is the following:
> Given some C function `f()` that takes pointer argument `x`, is it guaranteed to be safe to pass `x == NULL`?
There are two possibilities. If all dereferences of `x` are protected by checks to determine whether it is `NULL`, then `x` can be passed as `NULL` (and so we say `x` is *nullable*). Or if there is at least one access to `x` where the argument is not checked, then the caller must not pass `NULL` for that argument, as doing so would cause undefined behavior.
This is a tricky static analysis problem because it requires figuring out (1) where the value `x` flows in the program, and (2) what accesses to `x` are downstream of null-checks (if there are any). Many types of analysis are possible, and as usual in formal methods the problem is in general undecidable.
For `c2rust` we previously built a simple interprocedural static analysis to solve this problem. Our analysis is flow-insensitive, so it does not consider null checks, which makes it easier to implement but limits its precision. This analysis performs acceptably well by itself, but there are many cases where it is overly conservative.
Of course, we could make our analysis more precise without using an LLM. For example, we are currently building a runtime / dynamic analysis to augment the static analysis results. We could also replace our simple static analysis with one that is more precise and takes advantage of the information provided by null checks. However, each of these approaches have costs, both in terms of engineering to develop the analysis, and in trade-offs between precision / speed / predictability etc.
Use of nullability in c2rust
`c2rust` uses the result of our existing static nullability analysis to determine how to translate C pointers. The starting state for translation is that a C pointer is converted into a raw pointer, *T. Then the analysis attempts to determine whether the C pointer was nullable. In cases where a pointer is `NON_NULL`, then it can be translated to `&T`. If it is nullable, we must use a Rust type which allows such values, `Option<&T>`, which is more verbose and less idiomatic.
In the worst case, we can conservatively assume all pointers are nullable, so this is a safe failure mode for the analysis. The consequences are more significant if we wrongly infer that a function parameter is `NON_NULL`:
- When the function is called in internal code that is visible to c2rust we will insert an `unwrap()` in the caller, which might panic at run time. However, this mostly shouldn’t happen (except when overriding the static analysis with results from the dynamic analysis or LLM analysis) because we can see all callers and can determine whether they potentially pass null.
- When the function is externally visible and may be called by code that is not known to c2rust, such as through an FFI call, we might change the type to `&T` but the calling code may still pass null for that argument. This is undefined behavior because null is not a legal value of type `&T`. As a consequence, accuracy of the analysis is much more important for exported functions.
Design of LLM-based Analysis
As an alternative to our static analysis, we built an analysis using a state-of-the-art LLM, GPT-4o. We used a standard prompt-construction style, with a Python driver script, and JSON results. The analysis pipeline works as follows:
1. Search the target C file for function names
2. For each function name `f()`, construct a prompt consisting of:
1. The whole C file
2. A query asking the LLM to determine nullability for each argument of `f()`
3. Query GPT-4o via the OpenAI API
4. Parse the resulting JSON output
We also instructed the LLM to produce a natural-language analysis which justified the claimed result. This served as a lightweight ‘chain-of-thought’ and allowed us to examine the LLM’s reasoning in cases where it made mistakes.
We experimented with several different prompt structures and designs. Our final prompt is as follows (here `FUNCNAME` was replaced by the target function):
```
Your task is to analyze the usage of each pointer argument of the
function 'FUNCNAME' as defined in this C code.
For each argument of 'FUNCNAME':
(1) Determine the type of the argument.
(2) If the argument has a pointer type, inspect the uses of the
argument in the code of 'FUNCNAME', to determine if it is ever dereferenced
without a null pointer check, that is, any violations of CERT Rule
EXP34-C "Do not dereference null pointers".
(3) if the argument is ever dereferenced without a null pointer check,
indicate "unchecked dereference" and list the places in the code where
the dereferences occur.
Analyse each argument of function 'FUNCNAME'.
For each argument of 'FUNCNAME', compute these values:
'argument_name', containing the name of the argument,
'argument_type', containing the argument type,
'is_pointer_type' containing a boolean indicating
if the parameter is a pointer type or not,
'unchecked_dereference_locations' should contain a list of places
anywhere in the code that might dereference the argument without checking
for a null pointer.
These values for each pointer should be returned to the user
in the 'response' function.
In the "analysis" field of the 'response' function provide details as to
the reasoning behind your answers.
Go slow and try hard to analyze the program correctly.
```
Results
Here are some results we generated from analyzing our standard benchmark application, lighttpd, comparing the static and LLM-based analysis. We ran the tool on the lighttpd-rust/src/*.c set of modules (https://github.com/immunant/lighttpd-rust.git, commit 385ca548).
Here, the ‘Manually inspected’ column was generated by looking at a few cases where the static and AI results disagreed on the nullability of a function argument. Intuitively, our manual inspection is looking for the same thing as the LLM-based analysis, namely the presence/absence of null-checks guarding dereferences.
Note that the manual inspection process (and the LLM) might disagree with the static analysis:
- The static analysis is performing a forward dataflow analysis: “do any callers pass NULL here?”
- The LLM and manual analyses are both kinds of backward analysis: “will this function generate undefined behavior if the argument is NULL?” That is, the manual and LLM versions are trying to answer the same question but in different ways.
A Few More Observations
- When the AI disagrees with the static analysis by saying that a pointer must be non-null, it’s mostly right (correct 6/6 times), but when it disagrees by saying a pointer may be nullable, it’s mostly wrong (correct 2/8 times).
- In cases where the AI wrongly claims a pointer is nullable, it often reports in its natural-language analysis that there are no visible dereferences, even though there is a clearly visible `p->field` operation near the start of the function with no null check. That is, the analysis is misunderstanding the code structure, often in a simple way.
- In all the cases we examined where the AI produces a correct answer (whether nullable or NON_NULL), the accompanying natural-language analysis is also correct.
We’ve included a specific case analysis of some instances where the LLM and static analysis disagree. See [Interesting Cases](#interesting-cases) at the end of this document.
Reflections
Our LLM-based analysis was remarkably successful, given that it required no work on building domain knowledge or infrastructure. The key limitation is that the LLM is not always right, and when it is wrong, the errors are not predictable. This means that results from the LLM cannot be used unless errors can be tolerated – for example, if they do not lead to unsafe code.
The LLM-based analysis is also costly in money terms. Analyzing `lighttpd` cost us over $100. With optimizations of our scripts to use the OpenAI API more efficiently, we expect we could bring this down to $10 or much less.
Intuitively, the static and AI-based tools are quite different in how they ‘see’ the code. The LLM is able to use information such as variable names and idiomatic code patterns that the static tool cannot take advantage of. On the other hand, the LLM may be fooled by (for example) mis-named arguments or even comments, which the static tool does not even take into account. In some cases, the LLM-based analysis is fooled in quite simple ways that could be detected by a local static analysis.
We expect that an LLM analysis will work well for C code, but the same analysis would work much less well for a more obscure imperative language such as Ada. This is simply because the datasets of code include a large amount of C code to train on. We would also expect the analysis to be less successful on Rust code for the same reason.
The trade-off space between the static and LLM-based tool is interesting:
- Confidence in results
- Static tool: high confidence in results, low probability of error
- LLM-based tool: low confidence in results, high probability of error
- Cost of developing analysis
- Static tool: expensive to develop, deep expertise needed
- LLM-based tool: very cheap to develop, low expertise needed
- Cost to modify analysis
- Static tool: high cost to modify, small space of designs
- LLM-based tool: very easy to modify, large space of designs
- Cost to run analysis
- Static tool: Nearly zero, but more sophisticated analyses may take time
- LLM-based tool: Significant monetary cost per run
- Roadmap to increased capability
- Static tool: Stable research topic, slow progress year-on-year
- LLM-based tool: rapid progress, next GPT likely makes the analysis better
Interesting Cases
Here are some interesting cases where the AI wrongly reports nullable. In each case, the quoted text comes from the LLM’s natural-language explanation of the code.
- [`h2_send_settings_ack`](https://github.com/immunant/lighttpd-rust/blob/385ca548101242cfb8f8bc61533f1073042b5036/src/h2.c#L280) `con`: “Upon analyzing the function body, no dereferences of the pointer ‘con’ were found.” However, there is an unchecked `con->write_queue`.
- [`h2_recv_headers`](https://github.com/immunant/lighttpd-rust/blob/385ca548101242cfb8f8bc61533f1073042b5036/src/h2.c#L1337) `con`: “The code carefully uses the pointers without any null pointer dereference issues.” However, there is an unchecked `con->h2`.
- [`gw_handle_trigger`](https://github.com/immunant/lighttpd-rust/blob/385ca548101242cfb8f8bc61533f1073042b5036/src/gw_backend.c#L2855) `srv`, `p_d`: “Both parameters srv and p_d are pointer types. Further inspection is required to check for dereferences without null checks. We collected broader relevant contexts for additional automated pinpointing.” In this case the AI seems to have failed to do any analysis, as it doesn’t even claim that `srv` and `p_d` are accessed safely. `srv` is accessed via an unchecked`srv->srvconf`. `p_d` is copied into a local `p = p_d` and then `p` is accessed via an unchecked `p->cvlist`.
- [`SHA1_Transform`](https://github.com/immunant/lighttpd-rust/blob/main/src/algo_sha1.c#L70) `buffer`: The AI correctly recognizes that `state` is dereferenced through `state[0]` etc, but fails to recognize `memcpy(…, buffer, …)` as performing a dereference of `buffer`.
- [`cgi_trigger_cb`](https://github.com/immunant/lighttpd-rust/blob/385ca548101242cfb8f8bc61533f1073042b5036/src/mod_cgi.c#L1131) `p_d`: “The ‘p_d’ parameter is assigned to a local variable ‘p’ of type ‘plugin_data *’. The ‘p’ variable is dereferenced (e.g., ‘p->cgi_pid’). However, before any dereferencing, it is implicitly checked if it is NULL (no code paths indicate unguarded dereferencing).” Correctly notices `p->cgi_pid`, but there are no “implicit checks” before the deref.
- [`deflate_compress_cleanup`](https://github.com/immunant/lighttpd-rust/blob/385ca548101242cfb8f8bc61533f1073042b5036/src/mod_deflate.c#L1421) `r`: “The ‘r’ argument is not dereferenced within the function.” It’s dereferenced via `r->conf` if the call to `mod_deflate_stream_end` returns an error.
Next steps
We conclude from this experiment that GPT-4o is viable as a first-approximation static analysis for this kind of task. However, the LLM does not produce perfect results. If we wanted to use the LLM-based analysis in practice, we would need to carefully consider where mistakes in the analysis might result in unsoundness in the translation. Despite this, we are optimistic in the utility of LLMs as an alternative to static analysis. It is remarkably easy to build an LLM-based analysis compared to the (often arduous) process of building a static analysis tool. And many of the LLM failures we have encountered could be mitigated by quite simple static tools or symbolic/LLM refinement loops.
We could improve our prototype in several obvious ways. As discussed above, we are currently completing work on a runtime analysis for nullability. Once we have completed this, we can compare with the LLM as well. We have not experimented much with the (huge) space of possible prompts, nor other LLM tricks such as multi-shot prompting or including few-shot examples in the prompt. Additionally, nullability is only one property of interest as we develop our transpilation tool.