In my ongoing efforts to deeply engage with research into large language models, I have continually wrestled with and confronted a persistent sense of dissatisfaction. Unfortunately, the source of my dissatisfaction has also been frustratingly difficult to articulate and to pin down.
At times, I have wondered if I’m not dissatisfied but rather uneasy because I feel unmoored and unable to gain traction. It is virtually impossible to read and deeply appreciate all of the research that is being published on large language models. Understanding the current state of the art with respect to large language models is an exercise in synthesis and, I would argue, principled approximation. Given the rapidity of advancements, there is no single “state” as much as there is a trajectory of development which is being shaped by a distributed network of researchers whose work is having multiplicative effects on the pace of continued research. Training models from scratch is getting faster, pre-trained open source models make lightweight experimentation readily accessible. Advancements in the tooling and frameworks for domain adaptation promise even more rapid diversification and specification of models against a wider array of tasks. Processes are evolving and rapidly optimizing to do more with less, be it with lower compute requirements, in less time, or with fewer queries.
As much as I have read about large language models and their continued evolution, I still find myself asking, at what, precisely, are we becoming more efficient? Are we merely driving down the cost of generating sonnets about peanut butter toast? Or are we on our way to reducing the cost of integrating these models as generative components of larger systems? Are we enabling their cost-efficient application to safety or mission-critical objectives? If the answer to the first question is “yes,” is it the case that the answers to the second and third questions are also in the affirmative? The truth is, I don’t know, but at this time I am unconvinced.
I readily acknowledge that high-capacity generative models have incredible potential and have already demonstrated truly impressive capabilities. I describe those capabilities as the plausible completion of sentences in a way that convincingly deploys context-appropriate syntactic artifacts. That sentence is a mouthful, and it is far from a ringing endorsement.
While struggling to articulate what, exactly, is the source of my discontent, some have noted my lack of enthusiasm for these models and understandably accused me of being a skeptic.
In response I want to protest: I don’t feel skeptical! When I think about large language models I don’t register a sense of doubt. On the contrary, I find the simple elegance of the transformer architecture to be inspiring. I have read the foundational paper on self-attention mechanisms at least a dozen times and repeatedly marveled at the simplicity and effectiveness of masked self-attention as a way of achieving the language modeling objective. I find it truly incredible how the insights in that paper have transformed the machine learning landscape since they were introduced five years ago, which begs the question: if I am so excited by the mathematical underpinnings of the technology, why am I not more excited by its current and envisioned future implementations?
Perhaps the answer is that I don’t find the promise of frictionless generation of plausible content particularly compelling. As someone who has devoted her life to solving hard problems, I deeply value friction as a necessary part of the creative process. In fact, I initially drafted this essay long hand. Most of my early drafts are completed in this way, as somehow I find the act of writing to have just enough resistance to assist in my ideation. The speed at which my thoughts come together is better matched to the pen than the keyboard. My handwriting, like my thoughts, is more fluid; typing, which is accomplished in rapid bursts, is efficient, perhaps, but I also find it deeply distracting.
There is a fair amount of research which supports the idea that, as far as creativity is concerned, the challenge of the journey is an important part of – not just a path to – the solution. A colleague recently told me of Robert Bjork’s coining of the phrase "desirable difficulties" to describe this class of challenges. On Sunday over brunch and a deeply satisfying cup of coffee, a good friend who works as a school psychologist was talking about the kids whose cases she manages and the importance of “productive struggles” to their development and success; that conversation reminded me of headlines from a few years back citing neuroscientific evidence of the importance of boredom for creativity.
And just like that, many paragraphs in, amidst piles of handwritten words that have been struck through and rearranged dozens of times on pages reflecting a process that doesn’t translate well (if at all) to the internet, the pieces fall into place: my dissatisfaction lies not with the mathematics of these models but with the absence of a theory of measurement applied to language as it is necessarily interpreted by these models, because language is a noisy measure.
So what does it matter that language is noisy, and what do I mean by “noisy” anyway? (It is worth noting that the latter half of this question contains an important clue.)
I imagine that there is a little argument within the machine learning community that language matters, insofar as it applies to the construction of useful prompts for LLMs. As an illustration of this, there are a growing number of papers that fall under the rubric of “prompt engineering.” Broadly, these papers illustrate the capability and dynamism of LLMs to generate coherent text or even usable code and code-like artifacts in response to (often a series of increasingly) well-tailored prompts. This research is wide-ranging, and includes papers on chain of thought reasoning in mathematics, as well as papers demonstrating how LLMs can address previously generated coding errors when prompted using compiler output. Across the board, conditions of success are simple: prompt language should be direct, explicit, concrete, clear, and unambiguous. To the extent that LLMs can be presented with clear, atomized, explicit text, there is a measurable difference in the utility of the output.
Anticipating my argument that requiring LLM prompts to be explicit, clear, and unambiguous is somehow limiting, someone might counter by pointing out that people - not just large language models - also benefit from clear, direct, and unambiguous communication!
It is true, conversational vagaries are difficult to navigate. I don’t imagine that I am the only person who has been in a meeting or engaged in a conversation over dinner, only to say something in earnest and then realize that I completely missed the point of what someone else was saying (often sarcastically) or that the words I used, while not inaccurate, conveyed something that was not at all what I wanted to say. However, equivocating human reasoning and the inferential mechanics of large language models conflates observable behaviors with underlying mechanisms: just because ambiguity presents challenges to people and models alike doesn’t confer similarity in why ambiguity is a problem. And it’s the “why” that matters here.
Here, ambiguity is when the words used do not or cannot completely and uniquely represent the meaning being conveyed. Put another way, ambiguity is the existence of a gap between syntax and semantics. When presented with linguistic ambiguity, a human interlocutor can and will reason through a one-to-many mapping of words to possible meanings, and choose how best to respond based on their interpretation. This is not an error-free process, and misunderstandings happen not infrequently. Ambiguity is challenging for people because there exists a variety of ways in which they could meaningfully respond, and they have to quickly navigate those possibilities. By contrast, an LLM is challenged by ambiguity because, for these models, there is no semantic space to reason through. When syntax decouples from semantics, the LLM can continue to effectively generate what it is designed to produce: syntactically reasonable output, but there is no mechanism that enables automated reasoning over a space of possible intended meanings, and there is no mechanism for guaranteeing that what is produced is semantically relevant.
When confronted with ambiguity, people make mistakes of interpretation. LLMs make mistakes because they don’t interpret at all. And importantly, ambiguity is a feature of language, not a bug. Words point to what we mean, and they do so only imperfectly.
Even if I had the perfect words to describe all the attributes of my mother that I could describe, there would still be aspects of her that I cannot capture with words. There are feelings about her, memories of her, experiences with her, and ways that she moves about the world which are ineffable. This is by their very nature and because of how imperfectly and incompletely we perceive and then process and then describe the world. Even if I had unencumbered access to a complete vocabulary, there is no explicit description of my mother that could completely and uniquely capture who she is, much less who she is to me. This fact is not a fact about my mother, but rather is a fact about language: it is an incomplete, imprecise, and often error-prone reflection of the world. Again, language is a noisy measure.
This imprecision, this noisiness, of language is limiting because as of yet, it remains unaddressed and unaccounted for within the inferential framework of LLMs.
I want to draw a distinction here between my concerns about language as a noisy measure of meaning and criticisms elsewhere in the literature about the distribution of language on which these models have been trained. Yes, I am fundamentally of the mind that if you train large language models broadly on all of the digitized text found on the internet that you are going to induce a number of unwanted biases with respect to rhetorical styling and content. Yes, I am very committed to the idea that we need to understand these models’ capability independent of their training set. And no, these concerns are not my focus here.
I am not dismissing large language models wholesale and saying “garbage in, garbage out.” To extend the metaphor, I am suggesting that we may need to sort the trash. I want to provide measured pushback against the hype surrounding LLMs by specifically advocating for deeper theorizing about how these models are applied to language and what assumptions we are making when we unreflectively conflate meaning-as-language-as-tokens. In large language models we have implemented an incredible mechanism for the precise determination of highly contextualized correlational structures. These correlations reflect some unobserved and undetermined combination of both noise and signal, and we have done nothing – yet – to distinguish between the two. If all we care about are syntactic patterns, then perhaps attention is all we need. But if what we truly care about is something more than syntax, and we seek to leverage these models in a way that generates meaning versus merely its expression, we need more.
In a recent interview, current Google CEO Sundar Pichai described “a spectrum of possibilities… If you look even at the current debate of where AI is today , where LLMs are… you see people who are strongly opinionated on either side. A set of people who believe LLMs are just not that powerful. They are statistical models [which autocomplete sentences]... And there are people saying these are really powerful technologies, you can see emergent capabilities, and so on. We could hit the wall two iterations down. I don’t think so, but it’s a possibility. [Or] they could really progress in a two year time frame.”
I don’t know that I subscribe to the binary he presents, but his point is well made: there is not a consensus among experts regarding what large language models are capable of. In other words, the jury is still out. The way forward could be a technological revolution or it could be a bridge to nowhere, though I would argue that what will ultimately determine our path is not prompt engineering or training data curation. Engineering cannot solely determine which of Pichai’s scenarios comes to pass. If what we are after is a revolution, mathematics alone will not get us there. Attention is not all we need.
If we wish for these models to accomplish more than the plausible completion of sentences in a way that convincingly deploys context-appropriate syntactic artifacts, it is important that we meaningfully integrate theory and engineering. If we are to advance the state of the art in a way that is revolutionary, not merely evolutionary, then we need to work together and discern the signal from the noise.