Thus far in our ongoing series on artificial intelligence we’ve spoken in depth on questions of trust, human perception, and limitations of generative models. We have focused specifically on large language models (LLMs), due in part to their recent successes and media attention. We’ve explored questions of data, testing, and broad model implications. However, LLMs represent only a single subset of the much broader category of machine learning (ML) models, many types of which have remarkable, immediate, and wide-ranging applications from personalized healthcare to autonomous vehicles.
For this post, we’re going to zoom out to machine learning models more broadly: building from prior blog threads and diving deeper into what it means to develop verifiable machine learning models for use in safety critical systems. The meteoric growth of machine learning, in particular its use in safety critical systems such as medical devices, smart grids, and autonomous vehicles, has resulted in significant interest for methods to formally evaluate models and systems with AI components. Galois has an interdisciplinary team of formal methods, machine learning, and human factors researchers working on this problem right now, and our team continues to grow. It’s a critical turning point for ML, and an exciting time to be working at the intersection of these fields.
In his 1949 paper Checking a Large Routine, Alan Turing asked the seminal question of formal methods: “How can one check a routine in the sense of making sure that it is right?” Since then, the field of formal methods — the set of techniques that utilize mathematics to rigorously and efficiently analyze system behavior and prove code correctness — has grown to become an integral component of the development of safety critical systems, as well as a methodology for software assurance more broadly. Tools such as static analysis are being used to ensure the absence of run time errors in everything from your smartphone to the international space station.
Complexity brings risk. Software and machine learning tools have never been more complex, and there has never been a greater need for the assurance and risk mitigation provided by formal methods. And yet, the vast majority of machine learning models — the very models that are revolutionizing how we interact with technology — simply do not lend themselves to formal verification. Current methods are insufficient.
In short: cutting-edge, novel research is desperately needed to crack the code of how to apply formal methods to the machine learning puzzle.
Here at Galois, we are particularly interested in developing novel methods and integrating tools to ensure that systems developed with machine learning components are trustworthy, safe, and secure—in a sense, that the model’s underlying behavior aligns with expectations, without any side effects. This is an entirely non-trivial task when it comes to ML.
But first things first: Before we dig into the challenges, it’s important to understand the context. What does it actually mean to verify machine learning models for use in safety critical systems? And what’s the difference between “testing” and “verifying” an ML model anyway? Why does it even matter?
Testing vs Verifying
The most common method used to validate ML models is testing: checking how well the model works on a set of previously unseen data. If you are familiar with software testing, you can think of the two (testing ML models and testing software) as roughly analogous. For both cases, the idea is that you gain an understanding of how well a system will perform (or fail to perform) when it’s deployed in the real world. While standard testing suffices for many applications, the ever growing complexity of software in general and machine learning models in particular means that testing is always incomplete. There are always more scenarios to test, more paths to validate, and thus, always room for error – especially in ML models.
By contrast, formal methods provide a more thorough analysis than empirical testing by efficiently reasoning through all potential behaviors rather than a limited set of tests. Formal verification (a subset of formal methods approaches), enables testing over infinite scenarios — providing a more thorough analysis of system behavior and allowing for entire categories of flaws to be eliminated. However, though fun and useful, formal analysis is much more difficult to perform, and not every model or task requires it.
Now, if your immediate thought is “Wow! How exactly do you perform infinite tests in finite time?”, please consider joining us for a career in formal methods land, dive into our other blog posts, listen to public tech talks we’ve hosted, check out our Building Better Systems podcast, or keep reading to see what it means to intersect formal methods with machine learning models.
The Case for Formal Methods
Historically, we have seen machine learning being applied to fairly low-risk tasks, such as playing games, or yielding personalized (i.e., targeted) advertisements. In these cases, testing sufficed. However, in recent years we have seen growing interest in the use of machine learning to support high-stakes decision making in systems with serious consequences: everything from medical devices and prison bail, to defense platforms and large language models. In these cases, testing has shown time and again to be insufficient.
In sectors where safety and security are critical—such as healthcare, energy, aeronautics, defense, and finance—the repercussions of hidden defects and security gaps can be catastrophic. And even with rigorous testing, there are countless ways for things to go wrong. Traditional test-based validation techniques simply do not provide the requisite high-confidence assurance guarantees.
When Testing Isn’t Enough
Below are three key questions to consider that can help determine when to test a system versus verify a system, and inform test design. We’ll highlight challenges inherent to the most prevalent class of ML models, neural networks, though the considerations hold more broadly.
1) What are the consequences of an incorrect model?
Is the model making recommendations for movies you should watch next from a set of curated films, or is it making decisions about how to drive a car? With respect to the prior, if the model has a flaw or makes a mistake a user may get a movie recommendation that isn’t completely what they want to see, though we know it will be a film from our curated list so consequences are low and testing is sufficient. However, in the latter scenario, the model is being used to make decisions which can affect human lives. Here, a flaw or mistake has such high potential consequences that it is considered a safety-critical system. It is these systems for which we believe verification is vital.
I should note that we view “affecting human lives” very broadly: it is not just autonomous vehicles and medical devices that affect human lives, but more mundane seeming algorithms such as chatbots and video recommendations over non-curated open internet sets.
2) How completely do the test cases represent the scenarios encountered in the real world?
Scope: Tests must be designed such that they represent the full scope of scenarios a model will encounter in the context in which it is implemented.
Complexity: Designing these test cases is dependent on the complexity of the model input space. For example, you can imagine that testing a model where the input space consists of 10 discrete inputs is a very different task than testing one where the input space consists of a potentially infinite number of real-world scenarios.
Machine learning models are incredibly effective at learning complex system dynamics from a relatively limited input space: such as modeling the human glucose-insulin regulatory system using only insulin pump and continuous glucose monitor data. When designing test cases, it is natural to assume that if you cover the entire sensor range of glucose and insulin data, that you have adequate coverage of real-world scenarios. However, both of these inputs (glucose and insulin) have hidden within them signatures of a variety of other metabolic functions such as the impact of meals, exercise, and hormonal fluctuations. Determining how to quantify and represent these hidden states is vital for properly understanding model dynamics through testing and verification.
The Unknown: It is further complicated by the difficulty of anticipating use cases which may previously never have been seen. Humans and the world at large are inherently complex and creative. It is almost impossible for people designing model tests to anticipate exactly how a novel system will be used. This is exactly why you see recent generative model developers following the “throw it in the wild, and then adjust” mindset, and also why we have phases of real-world clinical trials for pharmaceuticals and medical devices. However, any trial period will be limited, subject to selection bias and may not hold for long-term use. What if instead we had a better solution – one that intersected engineering, formal methods, psychology and machine learning?
Inputs and Data: When considering machine learning models, the input space (data) is typically not only highly complex, but also difficult to reason over. As an example we can think about large language models: what does it mean to have a representative set of possible inputs, when the potential model inputs can be any question or statement thought of by users? We can say we pull a large enough example set from all of the text online, but because these models have a sense of memory (they are recurrent neural networks) ordering and sequence matters. This is exactly why, despite exhibiting high evaluation accuracy, these models are prone to exhibiting “weird” and “unexpected” behaviors that have to be “fixed” by creating safety layers and boundaries on top of the model after the fact: testing did not sufficiently cover the input space.
3) Is the model smooth?
The consideration of model “smoothness” is inherently tied to the prior consideration of scenario representation. Researchers think of smoothness for ML models in a similar sense to mathematical sensitivity analysis. In short, “smoothness” indicates the degree to which “similar” inputs yield similar outputs. The consideration of model “smoothness” is inherently tied to the prior considerations of scenario representation.
If a model is smooth, such as the basic function y = 2x, we have that x = 2 and x = 2.1 give inputs which are “close” (4 and 4.2). Furthermore, we can say that any input between the two we have tested will give us a result that is also between 4 and 4.2
The problem with testing neural network models is that they frequently exhibit almost 0 smoothness, to the point where an indistinguishable level of noise in an image can lead to a picture of a dog being classified as an ostrich. This inherent property means that we can never test enough to know that a classic neural network-based image classifier used in an automotive vision system will not mistake a stop sign for a tree and fail to break. While various mitigation approaches exist, none fully ensure the problem is overcome.
Bringing Formal Verification to Machine Learning
Suppose we have convinced you that for high-risk scenarios, simply testing ML models is not enough, and we instead want more formal guarantees. Wonderful! But here’s the catch: most traditional formal methods techniques simply don’t work when applied to machine learning, and in particular when machine learning models which are used as components of safety critical systems. They fail due to difficult-to-define specifications, and unique challenges in scalability, soundness and precision.
Recent years have seen advancement in novel techniques for applying formal methods to machine learning, however most proposed so far can be applied to only relatively small, trained models.
Thus, we are faced with a dilemma: ML tools are being widely adopted for use in critical systems, and applied formal methods are needed to provide assurance guarantees, but current approaches fall short.
So what needs to happen in the formal methods field to meet this challenge? What does it take to get there? And why is it so difficult?
These are exactly the questions and challenges that we think make research fun! Here at Galois, we’re all about harnessing the power of formal verification to secure real-world critical systems. This intersection of formal methods and ML presents an exciting branch of research where we’re making big strides in enabling assessment and securing of ML models for high risk scenarios.
In this final section, we focus on key challenges in applying formal methods to machine learning. For those interested in a detailed overview of current technical approaches, we recommend this survey paper on neural network verification approaches, and this tech talk on verification of data-driven models for healthcare applications. Within the tech talk in particular, you can see how we reason through designing clever optimization schemes for performing “infinite patient trials” in order to reason about efficacy and safety of an automated insulin pump, as well as how we verify neural network models for conformance to known physiological requirements. In this approach, we use techniques such as reachability analysis, counter-example search and falsification amongst others.
Defining Specifications
In order to take a formal-methods based approach, we must define key properties (or specifications) which must hold for a model to be deemed “correct.’’ Defining these specifications, especially for complex machine learning models, is perhaps the most challenging aspect of formally verifying them. For example, consider again the example of a neural network performing image detection: We would love to ensure that images are classified correctly (i.e., a stop sign is not mistaken for a tree), but to turn this into a specification we need to define what this means algorithmically. If your network inputs are images represented as a matrix of pixel values, and your output is a label, how do we define the relationship between pixel values that represents a stop sign rather than a tree? How do we expand this more broadly to represent a notion of general correctness?
The reason this is a key challenge for machine learning verification more broadly is because the places ML excels are exactly those places where we ourselves do not have a good definition of system dynamics: scene recognition, language modeling, complex biological processes, etc. At Galois we are focusing on key questions such as “How can we automatically derive specifications from physics-based models to prove over ML models?” and “What does it mean to develop a ‘verifiable’ ML model?”
Complex Development Pipelines
Unlike traditional software, the machine learning pipeline encompasses data, model training, and deployment. When the result of your ML pipeline is highly dependent on the actual data going in, techniques such as type checking are no longer sufficient and many tools are subject to floating point error. ML is subject to not only programming errors, but errors and attacks in the data, especially for models which adapt over time. In terms of model training, ML models are rarely uniquely convergent. This means that every time a training algorithm is run, a different model results, to the point where soundness and reliability of models have become a key challenge in ML.
Complex Operating Environments
Many safety critical systems need to not only operate in, but interact with, highly complex environments. Think, for example, of medical devices that control human biological processes, and autonomous aircraft and vehicles. In many cases, the sole reason for integrating ML into these systems is specifically to deal with uncertainty of the operating space. We don’t see ML being used to turn on windshield wipers in the rain, but rather to identify, model and anticipate behaviors of people, cars, and objects in the outside world. If we want to utilize formal methods to verify that these models will perform well in their complex operating spaces, we need to be able to first model these complex environments and interactions.
Here at Galois we have been working on projects of modeling complex systems, human behavioral and intent inference, and reasoning under conditions of extreme uncertainty, amongst others.
Looking Ahead
Hopefully by now we have convinced you of the need for novel research at the intersection of machine learning and formal methods. Addressing these open challenges requires interdisciplinary teams including formal methods and machine learning researchers, but also human factors researchers, mathematicians, and domain experts across each application space, whether pilots or clinicians. We’ve been working to build our team internally as well as through collaborations and we encourage anyone who wants to collaborate with us to please reach out to any of our research leads in the field: Shauna Sweet, Katrina Schleisman, Eric Davis, David Burke, Taisa Kushner, and Walt Woods.