Digitally Engineering Infinite Patient Trials

In prior posts, we’ve discussed techniques and benefits of digital engineering (DE) applied to system design, in particular hardware and software for cyber-physical systems (computer systems that interact with the real world). In this post we’ll simultaneously go big and small to present a case study in how formal methods + models can enable us to design infinite patient trials (big!) for medical devices (small!). 

First, let’s introduce the problem space. 

We Need Better Clinical Trials for Medical Devices

In the early 20th century, Sulfanilamide was commonly used in powdered form to treat streptococcal infections. In 1937, popular demand for a liquid form of the drug prompted the S.E. Massengill Company to develop, test, and begin selling Elixir Sulfanilamide. Unfortunately, testing at that time relied only on flavor, appearance, and fragrance, and the elixir was never tested for safety or toxicity. In the months that followed, over 100 people across 15 states tragically died in what became known as the Elixir Sulfanilamide Disaster. Congress acted quickly and the following year passed the Federal Food, Drug, and Cosmetic Act, establishing a set of laws giving the US Food and Drug Administration (FDA) authority to regulate safety of foods, drugs and cosmetics. Since then, the act has been amended over 30 times as science and technology advances and we better understand what it takes to rigorously evaluate safety. 

The Federal Food, Drug, and Cosmetic Act represented a turning point in how medical professionals and government regulators assured the safety of medical innovations. Today, we are at another crossroads. While visions of healthcare technology futures tout personalized data-driven medical devices and treatments, we continue to rely heavily on a clinical trial process developed for the evaluation of pharmaceuticals which are (relatively) static. This process works in steps. Hardware and software is tested for bugs and errors, lab testing is done to evaluate base-level safety with animal trials or on models, and then devices undergo three phases of human clinical trials beginning from small-scale, in-patient studies with close monitoring of a few individuals and expanding to multi-center multiple-week “free living” studies.

While this process is effective for systematically evaluating medical devices across scenarios of increasing risk, devices are inherently different from pharmaceuticals and it’s time we re-think how we evaluate them.

Hardware + Software + Humans

Medical devices are incredibly complex in their architecture (e.g., involving sensors that depend on biochemical reactions) and functioning environment (e.g., interacting with, understanding, and controlling an individual’s physiology). This means that the types of errors that arise are not only those found in each component (hardware, software, etc.) but in the ways in which the hardware + software + human elements interact. In our current process, we wait too long (often until clinical trials) to test devices in complex environments, and when we do, the testing scenarios are too limited, which can fail to account for errors related to these system-level relationships and interactions. 

In large part, we rely on manufacturers to develop their own hardware and software tests, and we lack the tools for assessing and ensuring test coverage or formal safety in complex environments. Furthermore, despite inevitable integration into clinical ecosystems, most medical devices are developed and tested as stand-alone systems in fairly limited environments which do not rigorously evaluate complex failure modes, even for a single device. This means that critical errors and malfunctions can remain undetected until large phase III multicenter studies where complex hardware + software + human user interactions can be observed. This is, however, too costly, too late, and too limited. Large scale clinical trials are incredibly time, labor, and cost intensive to run. What’s more, stopping one of these trials if or when a problem is detected can be incredibly harmful to participant morale, and negatively impact the reputation of a device for years to come.

Furthermore, even if we wanted to design a clinical trial to adequately capture all potential scenarios and errors, it is impossible to do so in finite time. Thus, we are left with our current system in which we see many rare and critical errors appear only after products are released on the market, resulting in patient harm and device safety recalls. These issues are only increasing: this past year, a report from Sedgwick found device recalls hit an all-time high with 268 recalls in the second quarter alone, most of which were due to safety and software issues. Furthermore, as connectivity of devices grows, so does the likelihood that a critical error or malfunction in single devices can cause cascading effects in our healthcare system.

Moving in a New Direction

The good thing is that regulation is moving in the right direction, especially as interest in machine learning (ML) and artificial intelligence (AI)-based algorithms grows. For example,  the FDA has proposed 510(k) regulatory process updates for AI/ML-based technologies which ask manufacturers to commit to transparency and performance monitoring. The framework is getting there! However, we still need the tools with which to independently test such claims, or ensure performance metrics translate to device safety. This is exactly where the type of work Galois and others are doing can come in: designing methods for simulating “infinite patient trials” in-silico (on the computer) which can enable evaluation of complex hardware-software-human interactions, and identify edge cases prior to in-vivo (human-based) clinical trials. Unlike with in-vivo trials, we can push in-silico simulations to dangerous extremes and observe device behavior which, left unattended, may result in serious patient harm. In-silico trials are also cheaper and faster to perform.

Formal Verification, Math Models & Digital Engineering

So how can we begin thinking about “infinite patient trials?” Infinite, after all, sounds impossible by definition. Not so! The answer lies in a combination of formal verification and mathematical modeling. Through a combination of these fields, we can reason about things such as device efficacy for different individuals, edge cases, and safety, and even find ways to improve device function. In the section below, we’ll outline how formal verification and mathematical modeling work together, and tackle key questions within each area. Next, we’ll dive deeper by going through a case study we’ve worked on. You should view this as just one illustrative example. The techniques and applications in this space are vast, and there is lots of room for novel research!

Math Models

The first key component of being able to run in-silico clinical trials, is the ability to model (or replicate) the “trial” in a virtual setting. There are a multitude of approaches to modeling, and model complexity can vary in fidelity and abstraction level, according to your goals. For example, you may want to study fine-grained data flows between hardware and software, or you may want to abstract these and focus solely on interactions between device algorithm and operating environment (e.g., the human body). Perhaps unsurprisingly, modeling the human (specifically developing personalized models for individuals, rather than a “general” human) is the most complex aspect, and the focus of ongoing research.

In all cases, mathematical modeling can help us develop useful and comprehensible models of this vast and complicated ecosystem. By defining, visualizing, and abstracting the various elements of the complex system in question, mathematical models serve as a kind of “digital map” to enable reasoning about infinite scenarios of critical patient trials before we go into clinical trials, in particular the complex interactions between software + hardware + (virtual) human. 

Formal Verification

The second key component of in-silico clinical trials, is the ability to reason over the infinite real-world scenarios a device may encounter within a finite amount of time. For this, we can’t simply enumerate and run infinite tests, instead we get clever and employ formal verification. Because we are dealing with hybrid systems – meaning we have both discrete components (such as the algorithm within a medical device) as well as continuous components (such as the human body the device is operating on/in) – we can’t use classical formal methods such as temporal logic alone, as they are limited to analysis of discrete systems. Instead, we’ll employ what’s called hybrid systems verification, which is a combination of formal verification, control theory, and optimization. This enables us to abstract away individual test cases and instead define the boundaries of the space of operating conditions (e.g., all scenarios a device might encounter), along with a property we wish to hold (e.g., avoid harm), and construct an optimization problem that says “push this system to extremes, and see if there are a series of actions that result in the system violating our desired property.” 

Putting it All Together

To create an “infinite patient trial” we replace the real human and device in a patient trial with a model of the human and coded-up version of the device. We wrap it all up in our formal verification framework and test “infinite” scenarios to see if we can verify the device’s safety, or give a counterexample for a sequence of events that result in a violation. For completeness, we’ll note that we may find our problem set-up to be unsolvable or undetermined, which opens another fun can of worms we’ll leave for next time.

A Case Study: Optimally Tuning an Artificial Pancreas System

The Artificial Pancreas (AP) refers to a closed-loop system involving an insulin pump, continuous glucose monitor (cgm) and algorithm which seek to automate insulin delivery for individuals with type 1 diabetes. The first such system received FDA approval in 2019, and since then a handful of systems have been approved, and numerous remain in various stages of research and development. 

Unfortunately, for the first on-the-market system, expectations exceeded efficacy and many individuals have been left frustrated with the system’s inability to effectively control their blood glucose levels. A variety of factors contributed to this, one in particular being that while the device is developed to work on a “general” human, every individual using it is unique—significantly different from a “general” human, and meaning that the device needs to be personalized. In order to personalize the device, it has a single “tuning knob” which controls how aggressively the AP system behaves, and while there is a “rule of thumb” for personalized tuning (take your daily insulin, divide it by 135), this initial setting is often less than ideal. 

Consequently, the system requires significant time, effort, and back-and-forth between patients and clinicians in order to tune it. In our work, we used the notion of “infinite patient trials” to achieve three goals:

  1. Identify for which individuals the system works well “out of the box” and for which it does not.
  2. Determine if we can optimally re-tune the system for individuals where it does not work.
  3. Utilize results from the prior two goals to determine if we can learn a novel “out of the box” tuning rule which is easy to implement and improves efficacy across all individuals, while removing the need for patient and physician back-and-forth.

Results from this project were peer-reviewed, published, and presented internationally, and even led to an invitation from the device manufacturer to attend their offices, meet the team, and demonstrate our results. Let’s dive into what we did!

Infinite Patient Trials – Determining Device Efficacy

As we overviewed earlier, we combined math models with formal verification to set up our “infinite patient trials.” For our human models, we developed personalized models for 50 patients using previously collected clinical study data. These models were designed to enable use in the control scheme and capture uncertainty in predictions. We’ll skip over the details, but any readers interested in the heavy math should refer to our full paper. Using these models along with a coded-up version of the AP system, we used a hybrid systems verification approach called forward reachability analysis to determine if the AP system would control an individual’s glucose levels to be in a safe (euglycemic) range after six hours of use. 

Reachability Analysis

Reachability analysis is the process by which we take a system from an initial operating space (blood glucose and insulin levels), and model where the system will be after a fixed amount of time. We then see if this “reachable” set is within our desired safe operating space (euglycemic blood glucose range), or if there is a possibility that the system could push an individual into unsafe blood glucose levels which are either too high (hyperglycemia), or too low (hypoglycemia). As the set of initial conditions (past blood glucose and insulin levels) is an infinite set, we utilize a clever optimization scheme to check “worst case” bounds. 

For those interested in the details, we formulate our problem as a mixed integer optimization program. The AP system we are testing delivers a new insulin dose (control input) every five minutes, but since blood glucose dynamics are nonlinear, the worst case scenario at 10 minutes might not be the same as the sequence that pushes someone into the worst case scenario at 120 minutes. So, instead of just keeping track of worst case levels (reachable set) every five minutes, we “unroll the loop” out to six hours by connecting 72 copies of the AP “loop” in series and optimize across the entire time horizon. This enables us to efficiently determine the maximum and minimum possible blood glucose values – the full analysis takes only about 0.6 minutes per person to run on a laptop. 

Initial Conditions – Blood Glucose and Insulin History Before AP Use

For our initial conditions, we assume the individual could have any history of prior insulin and glucose levels, as long as they are within a safe (euglycemic) range of 70-180 mg/dL. To do this, we convert basic physiological laws into computer-readable equations, and cap insulin active in the body (insulin-on-board) at a clinically reasonable rate, to ensure we are setting the device up well for achieving effective control.

  1. Blood glucose can change at a max rate of 1 mg/dL/minute
  2. Insulin-on-board can range from 0.5 to 4.0 units / hour

Results: It works well for some, terribly for others!

Because the glucose-insulin regulatory system is slow to change, we need to give the device time to function. Thus, after six hours of use we check to see how well the system has kept people in the safe euglycemic range using the out-of-the-box tuning setting. What we find is that for some individuals, it works really well, while for others there is a high likelihood that their blood glucose levels fall too low, or remain too high. Overall, we find that the standard out-of-the-box rule is only effective at keeping 14.8% of individuals in a safe range at six hours!

An example run for an individual who remains well controlled (left) compared to one for whom the device has a high likelihood of leading to glucose levels that are too high or too low (right). 

Can we improve?

This doesn’t mean the device will fail for the remaining 85.2% of individuals, it means instead that they will have to take significant time and effort going back and forth with their physician trying to evaluate their data and adjust the system settings – a process which is inefficient, frustrating, and has no guarantee of converging on the best setting. We wanted to see if we could do better.

Fortunately, finding the optimal tuning setting in a virtual patient trial is much faster and easier than for a real individual: we can quickly narrow in on the optimal tuning value using binary search, adjusting aggressiveness up or down based on the verification results. 

However, this approach of developing a personalized patient model and then optimizing for a best-fit parameter is not a feasible approach for widespread device use. What we want to do is build from these optimal results to see if we can find a pattern, generalize the results, and develop an improved out-of-the-box tuning rule that is easy to use (requires no new data collection or tests), and provides a better starting point than the current out-of-the-box tuning.

To do this, we searched for patterns in how much we had to re-tune from baseline. After many failed analyses, we found a promising result: that there was a strong relationship between the amount of re-tuning an individual needed, and their HbA1c levels, a standard biomarker of how progressed your diabetes is. Specifically, two clusters emerged: the current setting worked well for people with high HbA1c levels (>7), but those with low HbA1c required significantly more aggressive tuning. This was very exciting because testing HbA1c is a very standard test, and every individual with type-1 diabetes knows their HbA1c value.

Based on this finding, we fit a new tuning rule which uses only an individual’s HbA1c levels and daily insulin requirements – no formal modeling, additional tests, or analysis required to use it “out of the box”! We also included a safety saturation to ensure the system does not get too aggressive:

How well does the new rule perform?

Now we want to test our new tuning rule relative to the old out-of-the-box settings. Again, we’ll use our infinite patient trial approach. To our excitement, we found this new rule gave an 82% improvement in system performance in terms of how many individuals were now guaranteed to remain safely controlled. Furthermore, this new rule ensured 100% of people would avoid severe hypoglycemia, which is the most critical risk of an AP system. This is notable as we are not using the perfectly optimized setting for each individual, but rather testing our newly developed “out of the box” tuning rule. These results hold significant promise for future clinical trial evaluation, and the device manufacturer brought us in to demonstrate our results for their team.

Verification results for each individual with the original out-of-the-box tuning rule (left) and our new rule (right). Individuals are represented as a vertical stack, arbitrarily ordered by their daily insulin requirements.  The large dot represents the mean attainable maximum and minimum blood glucose values after 6 hours of system use, while the smaller dots give the probabilistic 95% confidence intervals. The green shading denotes the optimal euglycemic range.

So what?

In this work, we were able to show how in-silico “infinite patient trials” can be utilized to rigorously evaluate and improve safety of a medical device, filling a gap between traditional component testing and large-scale clinical trials. By combining mathematical modeling, formal verification, and optimization we were able to evaluate the device in a complex environment of human-hardware-software interactions. By cleverly pushing a device towards extremes we can catch complex failure modes quickly and cheaply prior to a large-scale clinical trial. Of course, any results found in-silico, such as our improved tuning rule, would need to be tested in real clinical trials before we can have certainty in the real-world efficacy, but research like this is laying the groundwork for a new, better way for evaluating medical devices.