Organizations seeking to integrate digital-first practices into their engineering processes often rapidly discover a common roadblock: critical dependencies on the individual expertise of specific employees embedded in legacy workflows. Discovering this issue has prompted some to ask what role might generative technologies play in supporting digital engineering transformation efforts: “Can they help us reduce reliance on individual experts, or boost the robustness and resilience of our engineering workflows? Can these models effectively democratize engineering expertise for the masses?”
As we delve into this intersection of digital engineering and generative models, it’s important to address the elephant in the room — large language models (LLMs). In the weeks immediately following the public release of ChatGPT3, it seemed as though LLMs could do no wrong. Yet, within weeks, the narrative in tech and popular media headlines shifted dramatically as it became readily apparent that language models might proffer more problems than solutions.
Research into the behavior and performance of these models continues to grow, and the truth about large language model performance (and generative technologies more broadly) is undoubtedly somewhere in the middle.
But as someone recently pointed out to me, the middle is rarely where you think it is.
So when it comes to the application of generative technologies for potentially high-risk and high-stakes applications, where exactly are we? Is “somewhere in the middle” a place from which we can operate and implement with confidence? What is the current risk profile, and is it one that we understand enough to accept? Perhaps more importantly, where are we today relative to where we want to go, and how do we get there?
“Are we there yet?”
I readily acknowledge, and am excited by the prospect that LLMs and similar technologies hold promise for advancing automated or semi-automated workflows. There are a number of ways in which we can imagine applying these technologies to bolster or transform digital engineering pipelines. We might wish to provide robust natural language support for code development to expedite deployment of new technologies in a quick-turn development environment. LLMs could be leveraged to reverse engineer high-level specifications from existing legacy code bases to help with ongoing system maintenance or provide rapid and unprecedented access to poorly documented code to enable the application of critical patches. Perhaps high-capacity language models could facilitate automated support for proof engineering and proof repair.
All of these envisioned applications require building upon or extending the models that are publicly available, because rather than centering production of natural language, digital engineering applications of high-capacity models demand performant generation of code and code-like artifacts.
One might ask: “What is it that we as a research community need to extend, and how significant is that lift? After all, shouldn’t the trajectory of development continue to improve model performance in a predictable fashion? Although Stanford researchers recently demonstrated that emergence may simply be an artifact of measurement, didn’t they also show a consistent upward trend in generative performance as models continue to grow in size? What are we even worried about, anyway?”
In response, I would first say that we need to carefully attend to the targeted domain. Our digital engineering applications are focused on safety-critical and secure system implementation. We therefore need to differentiate between generative model performance measured against codebases characteristic of those employed in safety-critical applications versus codebases which feature more prominently in examples found across the internet (complete with tutorials!) and those specifically designed to mimic the syntactic structure of natural language (e.g. Python). I would fully expect LLMs to do a passable job of generating coherent and consistent code given both access during training to a sufficient number and diversity of examples, and if the predictive patterns learned on natural language are also informative when generating exemplars in that codebase. Yet our experimentation continues to expose clear limitations of LLMs as applied to code and artifact generation of the kinds involved in creating high assurance secure systems, particularly in their current training regime(s).
Of course, this isn’t to say that there are not already candidate mechanisms for improving generative model performance. In fact, with a review of the available literature, it’s possible to construct a notional architecture integrating many of these ideas for fine-tuning models and remediating output, as well as providing feedback for the construction of more effective prompts to elicit desired behaviors.
But underneath and behind this sketch is a still-unanswered and more foundational question that needs to be answered by each organization seeking to leverage generative technologies: what does it mean for a generative model to be performant when integrated into a safety- or mission-critical digital engineering pipeline or workflow? Put simply, what do we expect and need these models to do?
I find it incredibly interesting that in discussions around LLMs, there is broad recognition that these models are immensely powerful and that the definitions of performance continue to be incredibly thin and also imprecise. Performance is often measured in terms of perceived utility which is qualitative, as well as task- and user-dependent; accuracy measures, when available, are also necessarily task-dependent and narrowly scoped.
Of course, measurement of performance is less important to individual users writing a Python script, or a developer who wants to author code for a “hello, world!” demonstration. The bar for “good enough” is much lower when there is a wide or even undefined range of acceptable system behavior, there are minimal consequences for error, and processes don’t need to be repeatable or even correct in order to be useful. In low-risk applications of generative models, interpretation is actually critical to performance: ambiguity offers opportunities for correctness, and variably close is often close enough. In high-risk digital engineering workflows, approximation can’t be automated; and “opportunities for correctness” are instead areas of ill-defined (likely unacceptable) risk, because errors lead to cascading failures.
There is also another point that needs to be made here about interpretation and the implications of requiring a human in the loop to ensure utility of generative model performance: it defeats the purpose. After all, a reliance on interpretation of output for success wouldn’t remediate dependencies on expertise. It would architect those dependencies into the emerging digital pipeline, and the brittleness organizations are seeking to avoid.
Digital engineering is fundamentally about principled design. If generative models are going to be integrated into highly precise digital engineering workflows, our thinking about and evaluation of these models needs to be as exacting. Automating generation of code and code-like artifacts requires that we provide clear and precise definitions of generative performance and utility. We need to define performance parameters for generative model application because effective optimization requires the definition of clear objectives.
Defining where we want to go
The challenge of defining desirable behaviors is, I think, also a particularly exciting research opportunity: code and related artifacts require and enable the clear and precise definition performance metrics, without needing to engage with theories of cognition, challenges of interpretation, or the anthropomorphization of these models which arguably add unnecessary layers of complexity to research into LLM performance. Code and code-like artifacts have specific requirements which lend themselves to the construction of unambiguous, quantifiable metrics. In particular, pipelines featuring the integration of exquisite code, the tooling around such artifacts is – by design – often fragile to subtle incorrectness.
I would argue both that clear definitions of performance matter here, and they also are going to underscore critical differences between the resource availability and associated incentive structures within the commercial sector (within which and according to which publicly available generative models have been optimized) versus that of safety-critical systems. The former is a resource-rich, low-risk environment, where “close enough” is often sufficient; the latter domain features data scarcity and sparsity and demands first-time precision against low-probability events.
We need to extend publicly available models because our concerns as developers and engineers focused on safety-critical systems are fundamentally distinct from those within the commercial sector, and we need to optimize differently.
Two steps forward requires taking two steps back
To answer the question I posed at the beginning, I want to first take stock of where we want to be. Broadly, I would say that we envision using generative models to, in essence, bring expertise to the masses, transforming what has historically been exclusively the domain of experts into a much broader, more accessible, and more sustainable development pipeline. Natural language support of code development, automated specification extraction and generation, and automated proof engineering are all geared toward opening up the aperture for the application of rigorous best practices in development and design.
Unfortunately, where we currently are is not with a pipeline but with what I refer to as an inverted funnel. The bottleneck of expertise hasn’t been removed, it’s just been relocated. Generative models produce output that requires expert review, interpretation, filtering, and even augmentation.
If we truly want to enable digital engineering pipelines on safety-critical systems and workflows, we need to develop a framework and language for articulating precisely what it is we want and need. We then need to engage in research programs that build the necessary foundations to achieve those goals.
It is critical to the effective application of generative models to digital engineering efforts that we are able to identify and characterize where the bounds of coherent generation are, and how well these models recover the full range of the desired output space. The near-frictionless generation of text that consistently has surface-level coherence but lacks groundedness or alignment to semantic meaning makes LLMs quite different from other models that we have built and used in the past to reason about software and systems, including earlier (and “smaller”) learning systems models based upon neural networks or symbolic logic, and those which are formal methods-centric. Because of their open availability, fluid user experience, apparent generalizability across a wide range of tasks, and rapid adoption in consumer use and products, there is a growing momentum driving adoption; but in the context of safety-critical systems, model utility does not and cannot replace correctness. There is an emerging body of research which is intended to examine and enforce the correctness of generated content. There needs to be a parallel research effort to examine, characterize, and improve the achieved coverage of models, particularly within specialized domains.
We need to decide if and when we want novel generation or predictable pattern regeneration. Large generative architectures are compelling in that they appear to readily generate diverse and even novel content in response to well-designed prompts, but this impression remains largely unexamined and unverified. This impression also runs counter to what transformer architectures mathematically support, which is distributional replication versus enabling a true creative capability. Do we expect and require that generative models fully explore or exercise an anticipated output space? There are workflows where heterogeneity is a strength, and others where that same variability is a liability. There are scenarios in which consistency is critical to safety and security and others in which that same consistency may be exploitable.
As we seek to integrate generative models, where we are today is at a point where leaning in means taking a step back. We need to ask ourselves what we need and recognize that it is very likely that we are not there yet, and also that we cannot rely on the perceived trajectory or momentum of commercial development to address or attend to our central concerns. This is work we will have to engineer ourselves.