System engineering involves a delicate interplay between three tasks: specification (what a system should do); implementation (what a system actually does); and verification (determining whether they agree). Already, novel generative AI technologies have emerged that can assist with implementation and verification (e.g., respectively, Microsoft’s CoPilot, and formal verification tools developed on DARPA’s PEARLS Artificial Intelligence Exploration). Yet the first task, specification, remains a largely manual and time-consuming process.
This begs the question: Could a system leveraging generative technologies assist with specification?
If so, such a breakthrough could represent a paradigm shift for how systems are designed, implemented, and maintained. AI assistants could dramatically reduce the human time and effort required to develop specifications, enabling rapid prototyping, automating more rote elements of the process to allow engineers to focus on more complex tasks, and ultimately reducing costs as the number of engineer hours needed drops. Done well, AI assistance could also help mitigate human error, improve specification quality, and improve consistency – all leading to system verification processes that are simultaneously more rigorous and more efficient.
Recent advances in AI raise the possibility that generative technologies could plausibly assist operators in the authoring of specifications which meet the necessary requirements, constraints, and challenges. That is, AI systems may be able to help designers generate system specifications that are accurate, complete, and sufficiently granular to support system implementation and verification of corresponding code in a way that also enables continuous integration, deployment, verification, and certification: a kind of AI-assisted DevSecOps.
Why This is Difficult
To define some terms, by a specification, we mean a clear description of desired behavior and structure for some system component. This definition is intentionally quite broad, and includes high-level textual specification, architectural specification languages such as AADL, function interface specifications such as APIs, protocol models such as state machines, executable reference implementations, and code-level specifications such as types, code contracts, and logics. The most common kinds of specifications in engineering practice are concrete test-cases. A single system may use all of these kinds of specifications at once.
Specification development is known to be a challenging task. Specifications must:
- Faithfully capture human requirements which are often contradictory, unclear, or unwritten,
- Be implementable in a correct, efficient, maintainable manner given the available technologies,
- satisfy constraints imposed by the system context—for example, the specification of a library call must respect the language’s calling conventions,
- Be verifiable, both in the sense of checking that implementations match specifications, but also that the specification matches human intuitions,
- Be developed at many levels of abstraction—for example, a system might have an architectural specification, a specification for each system module, and specifications for each function inside a module, and
- Be updated over time under multiple types of change: requirements change, implementation change, changes to specification and verification technology, and others.
In addition, specifications are often built in retrospect to account for the structure and behavior of pre-existing systems that are part of a legacy ecosystem. Such systems often have messy, complex interfaces and behaviors that are difficult to specify.
AI to the Rescue: A Notional Architecture
Notably, many of the challenges to authoring specifications cannot be reduced to, or solved through, pure optimization or logical deduction. Achieving alignment with human operators while ensuring well-formedness (consistency, completeness, realizability, etc.) of specifications and also their accuracy requires careful balancing between different constraints. For this reason, we imagine a workflow where a human designer works with an AI assistant through a natural-language interface to iteratively develop specifications. The AI system we imagine would propose candidate specifications and then help the designer efficiently navigate the space of possible specifications, tracking constraints and design choices, offering options, while ensuring any strict constraints are respected. The human and AI would work together in refining the specifications from an initial, incomplete state to one that enables formal verification.
Different stages in the specification, implementation, and verification process may require different tools. A tool that helps a user convert English documents into a domain engineering model, for example, might be very different from one that helps convert a domain engineering model into a system architecture.
We propose the diagram below as a generic architecture that could be specialized into multiple domains.
Figure 1. Notional architecture of inferential system
Here, the human designer (top) generates multimodal inputs (1st box) that represent the current state of the draft specification. This might include requirements documents, source code for implementations, draft specifications, directives to the tool.
This input is processed by a prompt construction engine (2nd box) which converts these inputs into one or more prompts which are suitable for consumption by the downstream inference engine This process might involve significant AI or classical analysis in itself—for example, the tool might take this opportunity to analyze the draft specifications for consistency with any implementation code. The goal is to produce a prompt or a set of prompts that maximizes the effectiveness of the generative technologies within the pipeline, and ultimately reduce the number of refinement iterations required of the designer.
The constructed prompt is passed to a generative model (3rd box). Again, here we are intentionally broad, accommodating a range of AI/ML architectures, of which LLMs are a strong candidate. Models could include a general large language model such as GPT-4 or a domain-specific model which has been trained or tuned to the particular specification domain and languages.
The generative model constructs a set of candidate specifications (4th boxes) which are passed to an evaluation engine (5th box). The evaluation engine may perform significant amounts of work to introspect on the proposed specifications—for example, checking that they are mutually consistent. We imagine that this step may involve multiple iterations, and that the evaluation engine will itself construct prompts that elicit more refined specifications from the generative model.
Finally, the result of the generation process is presented to the user through the user interface (UI) (6th box). For example, a user might be presented with one or more modifications of the specification to choose between. Alternatively, the user might be asked a series of natural-language questions intended to refine the requirements. We do not imagine that the user is presented with the raw output of the LLM—the results will be processed to present users with meaningful choices through a predictable interface.
Anticipating Challenges
Adapting this architecture to real-world scenarios is far from straightforward. First of all, it is very generic and could be instantiated many different ways. Different choices of specification type, interaction mode, domain, system etc. will present different technical challenges.
Other challenges are likely to present themselves for any instantiation of such a tool. These challenges include:
1: Designing Effective Refinement Loops: Ensuring each human to AI interaction drives the specification closer to finalization.
2. Automated Prompting: Developing strategies for prompts that yield useful responses, aligned with user-intent amidst inherent ambiguity.
3. Adapting models to niche tasks: Since generative models typically excel in data-rich areas, fine-tuning for the specialized, low-context task of specification will require careful fine tuning.
4. Generating High Quality Change Suggestions: The AI needs to consistently provide high-utility changes that maintain the upward trajectory toward the ultimate specification goal (i.e., consistently improving even if they do not result in a completely correct specification).
5. Defining Success Metrics: Establishing useful benchmarks to gauge the AI’s effectiveness (e.g., comparing specification task speed with and without assistant, bug frequency, etc.), as well as training-critical measures of what constitutes success or failure of a particular task.
Digging In
Generative technologies are exciting because they can bridge between everyday human text and complex technical artifacts—be they digital images or digital circuits. Specifications in software engineering have been, up until now, brutally divided between informal prose and formal models. We are hopeful that generative technologies can create a bridge here as well.
Specification, as we have described it, is a fundamentally human activity, because the start of writing a specification is asking what we want a system to do. So our starting point in building these tools will be collecting datasets of existing specifications in any form, and building prototype tools that we can use on our own projects here at Galois. There is a lot to do and the technological landscape is changing fast. Much that appeared impossible now seems within reach. We are building tools, running experiments, and trying to figure this problem out, and we hope some of you will do the same.