“I’m a Good Bing”: Large Language Model Capabilities and Current Research

The latest installment of our ongoing “AI and Trust” series comes in the form of a tech talk given by Galois Principal Scientist Shauna Sweet on March 6, 2023. In her presentation, Sweet helps us dig deeper to uncover the core ideas, concepts, and principles behind large language models (LLMs), tackling such central questions as:

“What are large language models?”

“What exactly is modeling?”

“How and why do ‘generative’ language models work?”

and “What is the relationship between LLMs and truth?”

Watch the video below or scroll down to read the edited transcription.

Edited Tech Talk Transcription

Shauna Sweet: The goal of today’s chat is to grapple with some of the ideas, principles, and concepts behind large language models. What are they? Why do they work? In what ways do they not work? 

And then to look at some of the current threads of research that exist around large language models that are looking to either extend or address some of their shortcomings. 

And then at the very end I ask some questions about where might we think about going, being at Galois. And also, informed by what these large language models do and don’t do in the conversations around them, what kinds of questions do we actually need to be asking in order to push the research forward?

I wouldn’t be doing this topic justice if I didn’t give you all some insight into why we’re talking about large language models. So for those of you who have spent probably any time on the internet or listened to any sort of podcast that’s addressing what’s happening right now in machine learning, you know that there has recently been a public release of ChatGPT, the backbone of which is, well, there is GPT3, there was Instruct GPT, but there’s also an open GPT project. But there are a variety of large language models that have recently been publicly released, and people have been interacting with them, and people have put versions of these  GPT models or generative pre-trained transformers into the public eye, for example, inside of the Microsoft interface. And people are able to query these models, and essentially interact with these models, using natural language prompts, and to ask them all kinds of things, have various conversations. And what people are finding is that, unfortunately, these models have some tendencies that people aren’t necessarily willing to encounter. You may have noticed that there was a Google engineer who believed that these large language models were evidence that AI had become sentient. Also, [we are] finding that as people are interacting with these models, people think that they have personalities—in this case, that the Bing Chatbot is somewhat aggressive, sometimes passive aggressive—and that they develop personalities in the context of having chats.

So, again, if you’ve been on the internet you can see instances, as I like to call them, “anecdata,” that are illustrative of some of the conversations that people are having with these Chatbots, which is really just—a conversation isn’t really a conversation. In the context of using these generative language models, it is a user who is putting a number of different queries or serial queries against a model that is returning a number of generative outputs. But the nature of those generative outputs, and the fact that the outputs of these models plausibly mimic human speech, have raised all kinds of questions about what’s next for us in terms of what happens now with AI and ML. A lot of people are questioning: “Are these models, in fact, intelligent? Do these models, in fact, demonstrate understanding? Are these models creative? And then, in some cases, people are saying the opposite, which is that, really, we shouldn’t confuse mimicry of language for sentience; we shouldn’t confuse mimicry of language for intelligence or understanding; and, in fact, that the anthropomorphization of these models and the language that we use that has to do with machine learning and cognition is in and of itself problematic, and actually confuses us as much as it confuses our research into these models.

So, getting beyond the hype. The question is: What are large language models?

And what does it mean to be large? I feel like this is a very similar question to when we started to question, “What is big data?” when big data came on the scene a few years ago. When we started talking about “data science” and “big data,” there was a question of: “What exactly is big?” Well, there’s a question now with these large language models of, “What does it mean to have a large model?” Or you may also hear these referred to as “high capacity models.” 

And the answer is that these [models] have parameters that number in the hundreds of billions. Since about 2017, when the core of the architecture was released—these models have grown exponentially. And so in this chart, which is from Hugging Face—they actually do some great graphics that speak to the growth of the parameters of these models—you’ll see that the parameters have grown even from GPT2, which was an early version of the transformer with 1.5 billion parameters all the way up to now, with some models that are upwards of 530 billion parameters, and GPT which has about 175 billion parameters, give or take a few, depending on the instantiation.

But just talking about the number of parameters doesn’t really tell you what these models are doing. So let’s talk about what a large language model does and what it is. 

What is it, in fact, modeling?

As I like to say, all models are wrong, but some models are useful. So what is, in fact, the model that’s being instantiated by these massive neural networks? And the answer is, to this first bullet point: “LLMs are high-capacity, autoregressive models of a probability distribution of a vocabulary.” Which is a more formal way of saying that they model language, but they model language in a very specific way. They model the probability of a particular target. Here, wt, given all the previous tokens in a sequence, under the condition that each of those w’s here, which are the pieces of the vocabulary, that all of those things comprise a vocabulary.

So, given some sequence of tokens, a large language model effectively—again, treating this as a probability distribution model—it assigns a probability to the tokens using the sequence as defined in that second function.

But what’s important about a large language model is the way in which it actually instantiates this model, which is really fascinating. It’s a big difference from the models that came before it.

The large-language models model token probabilities in a way that is context-aware and entirely data-driven.

So for those of you who don’t work with large language models, you might be saying, “What’s a token?” And the answer—well, it’s complicated… but maybe not so complicated—is that a token is really just an atomized bit of input. So there’s all sorts of different ways that you could essentially model language. You could model language as characters. Characters could be your tokens, and every unique character is a token. You could model your language as every word is a token. You could model your language as every sentence is a token.

In the case of large language models, what is typical is you have what’s called BPE or Byte Pair Encoding. So typically the way that these large language models work is they’re modeling tokens, and their tokens are specifically byte pair encoded subwords or chunks of words. So you can think about this as like three letters together, and they essentially take in sequences of tokens, or “tuples,” each of three letters, and they’re looking to model based on the probability: “What are the most likely tokens to come next?”

So in that case, the unit of inference is similarly going to be sets of three letters. If you had a character-based model, it would take in sequences of individual characters and seek to predict the next characters.

What’s important there is that we’re modeling the token probabilities only. Large-language models only attend to what’s right in front of them. They aren’t modeling some understanding of language. They aren’t modeling language semantics. In fact, there’s no there’s no mathematical mechanism in these languages to model anything other than the distribution of tokens.

So you might be wondering: “Well, why does that work?”

And the answer is because language is comprised of signal. It’s not noise. We use language all the time to communicate effectively. So in some ways, even though the capacity and performance of these models might be shocking and, quite frankly, it’s pretty impressive, in some ways it’s not surprising in so far as language is built on a series of patterns. And the English language in particular has patterns that are absolutely recognizable. There are sequences of characters, sequences of letters that are more likely given the previous sequence. And that’s how these models work.

But, importantly, let’s talk about, at the core of large language models, what are they doing that’s unique?

And so for this, there’s a paper from 2017 called “Attention Is All You Need.” This is often cited as the core innovation as to what has essentially taken large language modeling and just rocketed it upwards. So let’s talk about the various innovations in this “Attention Is All You Need” paper, and specifically, what did they do to to increase the performance of these large language models?

Prior to 2017, when people sought to model language they were modeling it as a sequence. So, given one thing, predict the next thing. But it was sequential, and this had some really big impacts. Importantly, when you were modeling sequences, also using things like recurrent neural networks, and you had to model language sequentially, it really limited the amount of information that a model could take in, and then what it could predict. So we know that the greater your input, the longer the sequence, the more context you could take into account, the better you would do. The more you know, the better your inferences ostensibly are. And that’s what’s possible through the attention mechanisms and through what’s called a transformer architecture that was proposed in this 2017 paper. And so they proposed a way to use matrix multiplication that would allow these models to process multiple inputs, all at the same time, using a stacked attention heads. Self-attention was the key here. So they swapped out the RNN for what’s called self-attention.

This is the mechanism that is probably at the core of what makes these models work. So if you recall the function from the previous slide, this idea that you’re predicting the probability of a target given everything before it, the question is, “How do you instantiate that mathematically in a model?”

And the answer is, in the large language models it’s here in the Scaled Dot-Product Attention that that probability sequence is actually instantiated.

And so what these models do in the attention head is you take the inputs, so the individual series of tokens, and then you have these “Query, Key, and Value matrices.” It’s not language that I find intuitive, but let me see if I can walk you through conceptually what they’re doing. You essentially take these inputs, and you encode those inputs such that you take the input—the words to the tokens themselves—you add positional information, and then you provide a mechanism whereby—and it’s scaled, so you’ll see scaling here and “softmax” so that you don’t have weights that go crazy. So what they do is the self-attention mechanism is a way for, given a series of tokens, that for each position in that series you want to give the model away for each of the tokens to understand how much information should that token take from the previous elements in the sequence. 

So if I have, say, eight tokens, I know that my second token depends on what came before it; my third token, the other two; the fourth token, the three before that. But with the Scaled Dot-Product Attention and these series of matrices, what the self-attention head enables is that you can actually use data-driven matrices to update the weights that determine how much and to what extent does each token pay attention to the tokens previously in the sequence.

It means that this framework is incredibly flexible, and it allows for maximum flexibility to understand essentially, “How much information does each token need from the context before it?” It’s not simply that, let’s say I have an eighth element in a series, and there’s seven things before it, that each one is equally informative. That’s not true. And also how informative each token is prior probably depends on what token I’m looking at. If I have a vowel, I’m probably looking for different things. If I have a consonant, I’m looking for other information. If I have a space, what do I see? What matters?

So all of that is maximally flexible. And so you get local context as well as global context in this attention head. So, importantly, with this model position and content are explicitly encoded implements like I noted. Remember, this is the function with maximum flexibility. So again, you have this idea that the probability of any sequence is dependent on everything that came before it. What it’s doing is it’s ensuring that that probability is weighted in such a way that the rates are driven entirely by the data that you have. There’s nothing that’s predetermined in this model, and everything is data-driven.

Also, because you have these Query, Key, Value matrices, what you’re ensuring is that different contexts inform predictions differently, depending on your tokens. So, again, it’s not necessarily true that if you’re second in position, the first always matters this much. You have that maximal flexibility depending on the token of the position.

Now, the last part that you’ll see here in this paper that is really key to this working is you have what’s called Masked Multi-Head Attention. So, critical to these models functioning is that when you have this probability sequence, if I have a series of eight tokens that I use as input, what’s important in the large language modeling  context, at least as we model natural language—put a pin in that, folks, for those of you who think about code and code-like artifacts—essentially, you use a lower triangular matrix to mask out the probabilities and the weights in the dot-product operations, such that in the attention head none of the tokens that occur after your target token, none of that context matters. The only thing that matters are the tokens that came before, and the token itself that you’re observing. And so anything prior matters, anything after does not.

Now that’s making some assumptions about the nature and the grammar of the language and the nature of how the generation mechanism. … some assumptions about how the grammar works, and what is informative and what is not.

So in terms of what these generative capabilities and limitations are, it gives you some plausible predictions based on probability distributions, given your training data. And the other thing that I didn’t really describe in detail is that with these particular models and with the attention head—and you’ll see it in the transformer paper—that it enables a considerable amount of parallelization through vectorization capabilities in this architecture that weren’t possible previously. So you aren’t estimating 530 billion parameters successfully, because you can do all of that sequentially. You can do that because of what’s possible using matrix multiplication inside this attention architecture and with these transformers.

Now there’s also some really interesting work that’s coming out right now in terms of how people can actually make this matrix multiplication even faster, which may actually drive down the compute costs of estimating and training these models. But just because you can generate plausible sequences of characters that take in a considerable amount of information does not mean that you have guarantees of truthfulness, nor does it mean that just because these models can be effectively generalized across tasks in English that it is, in fact, a generalizable or one size fits all model that is appropriate for all grammars independent of the language context. Because, again, when you build in that mask, you’re actually building in assumptions about how the language or the target language works, and something about the grammar and the nature of generation. Also, importantly, this model only takes into account syntax. It has no understanding of semantics and no understanding of deeper meaning. So to the extent that syntax and semantics are not tightly coupled, then the model is likely to make errors, And that’s also something to keep in mind.

So, speaking about, as I like to say, “the promise and the problems” of these particular models, if you look across some of the limitations of the models with respect to the fact that they don’t work across all grammars, or the fact that there is no mechanism for truthfulness inside the the mathematics of these models. So if you think about the mathematical framework that’s being used to estimate future probabilities, to estimate the probability of a target, you could imagine it this way if you think about probability distributions and why do these models not have a sense of truth?

If you think about it in terms of conditional probabilities, if I have the probability of a sequence conditional on ground truth, versus, let’s say, a probability sequence conditional on, just say “not true,” does the nature of the sequence of tokens that I’m going to emit—let’s say, as a speaker—does that change whether I’m telling the truth or I’m lying?

The answer is: “No, it doesn’t.” There’s no mathematical reality to groundedness. There’s no mathematical realization of reference, and they don’t exist inside these models.

And then the other pieces that are challenging—and there was actually recently a conversation at a PI Conference around Semaphore, which is a DARPA program, where they had folks from Google, Meta and OpenAI there, talking about essentially how much information was being used to train these various models. They’ve effectively reached the ends of the digitized text that is open and available on the internet. But that also means that they have trained these models on all of the text that is open and digitized on the internet… which is a lot of Reddit, right? A lot of Wikipedia. There’s some de-duplication. But that also means that the nature and the patterns of the tokens that these models are learning are, in fact, biased.

So there are issues around what they’ve used to train these models, and then you can see here there’s some concerns about the ethical and social risks. Essentially, these are generative models, but, as I like to say, “generative” models are poorly named. Because what does “generative” model imply? A generative model essentially implies that they generate. That implies that they create something new, oftentimes. But these models don’t generate things that are new. They, in fact, regenerate what they have seen.

Large language models are no different. They may present to you combinations of language that seem plausible, combinations of words that you yourself have not personally seen, but the combinations of words that are generated by these models as output are all data that that reflect the probability distributions of the tokens that they have seen before. And they are the most likely sequences of those tokens. And so, when people talk about the ethical and social risks of harm from language models, one of the primary concerns is that they are essentially replicating all of the systemic biases that exist in the training data that they ingested.

So it’s not that these models have ill will. Again, the models themselves don’t have intentionality, but the data itself has patterns and realities to it that may or may not raise questions ethically about what we then believe if the model is going to be a neutral arbiter or a neutral generator in some way of speech. And that has caused a number of problems for Bing, for Meta, and for others when they have essentially let these models loose with very fluid user interfaces to the public.

So let’s talk a little bit about and more specifically about, given the limitations of these models mathematically—I’m not saying they’re not great, and they don’t do incredible things, and the text generation is not incredibly impressive—but let’s talk about what the current research threads are right now in terms of where people are going to address the limitations of the models. And specifically I’m going to speak to two different thrusts of research. One is specifically trying to address the groundedness or “truthiness” of a model. I don’t say “trustworthiness,” because that is a separate issue. And the second one we’re going to talk about is that of the brittleness of context, and essentially the flip side or the problem, not the promise, of attention.

So, as I mentioned before, the natural language objective of LLMs does not distinguish truth from fact. So there is no machinery in these models that enables them to tell the truth, or to distinguish truth, or even that they will have any relationship whatsoever to the truth.

It’s not that the models lie. It’s that that the models have no referent whatsoever. And so, to that end, there are a number of research efforts: the LLM Augmenter paper just recently came out from Microsoft and Columbia University. But there are a number of research efforts that are dedicated to essentially trying to ground large language models, but not necessarily by addressing any potential shortcomings in the data itself or interrogating the architecture, features of the architecture, training regimen, hyper-parameters, tokenization strategies. All of those things generally stay the same. But what they’re looking at is they’re looking to see if they can essentially compose systems around these models to essentially modify or clamp down their output or, potentially, can they develop a feedback loop so that they are able to change the prompts of these models and influence them in terms of how they generate data to essentially guide them to a different space in the probability distribution for generation that improves their performance?

Or, they’re looking to use other algorithms and other search strategies that they can use around or within or in combination with these LLMs to improve their adherence to ground truth.

The LLM Augmenter paper—you see here, this is a notional architecture. So this is an external kind of wrapping around the LLM that they use. And then another one is “Toolformer.” So “Toolformer” I’m going to go into in a bit of detail, because I think that it’s a really powerful paper. It’s a fun paper. I would encourage anyone to read it. It’s actually a really quick read and it was just recently published, I believe, in February of this year. So, last month.

So, rather than just kind of wrap the LLM, the “Toolformer” paper is looking at groundedness as a problem that they think: “Well, if I could teach an LLM—so an LLM learns context really well—maybe what we could do is leverage a strategy for in-context learning or essentially have the LLM use, a self-supervised strategy for improving its own token generation.”

And this was actually, I believe, in-context learning was forwarded by Brown I want to say 2020, so a couple of years ago. And so what they did was they used a small amount of labeled data, and then proceeded with a self-supervised strategy, where they bootstrap a labeled sample to encourage the language model to learn if it could improve its token generation by using API calls. So could the model learn as it trained? So they took a pre-trained model, and then could they fine-tune that model in order to improve its utility relative to some threshold, 𝜏?

In this case they used a generative, pre-trained transformer, plus fine-tuning, to reduce the perplexity of future tokens.

So, perplexity is an interesting feature. What they were looking to do was they were looking to train this model with a specific objective of saying, “If a model looks to predict the next token, can it drive down the number of possible tokens it’s confused between?” So that’s what “perplexity” is. So essentially, if the model thinks that it could be four things, perplexity of four. If it thinks it could be 20 things, that’s a perplexity of 20. You want to drive down perplexity as much as possible.

And what they found in this “Toolformer” paper is that overall if they looked across several different benchmarks—so, using mathematical operations, question and answer, and then questions that required a knowledge of the current time or date—they found that there was this really fascinating interaction between their self-supervised training strategy and the size of the model. So for higher capacity models in particular, when they were able to teach the model how to use these API calls, and essentially to call out during generation, they were able to greatly improve performance, in some cases almost meeting, and in sometimes greatly exceeding previous GPT performance.

But of course the demonstrated capability is kind of on average rather than overall. So here in this table you’ll see some examples of what the generated text looks like. So you’ll see here, “the Nile has an approximate length of,” and then the model learned that it should call a question and answer API “What is the approximate length of the Nile?” And then it answers. And so it improved the generation. And that was correct. But in some cases the model would learn to call the API, and it did so in a way that was not useful. So that’s the third example there.

And as they talked about in the “Toolformer” paper, what they successfully demonstrated was that they could use this in in-context learning to improve groundedness and improve the accuracy of model generation. And what they said, was, “Well, this is a really good step forward, but what we’re looking to do is to potentially make it possible to improve either the sequential, interactive, or cost-aware use of various tools and API calls because none of those are possible given the current the current state of research. Those are all future research thrusts.”

But of course, making sure that these models are truthful, doesn’t address all of the shortcomings of the models as we see them.

Another problem that we have with with these models. … essentially, not all of the issues that we see with with large language models have to do with groundedness or truthfulness. There are other issues with the generation that I actually link back to the problem, not just the promise of attention. So I think that one thing to keep in mind is that with most models, their greatest strength is also oftentimes their greatest weakness; and I think we see that as well in the large language models with the with the utility of the attention mechanism.

So given the self-attention mechanism—Again, the benefit of this is that it is highly flexible, that you are encoding a massive amount of information, you’re looking at not only global dependencies between tokens, but you’re looking at local dependencies. But that also means that your model is attentive to local dependencies. It is extremely sensitive, potentially, to local dependencies as well as global dependencies, and it makes the models very brittle as well as very fluid. So, we see this in a couple of different areas, and we see evidence of this brittleness.

So in the “Toolformer” paper, one of the things that they note is that when they are teaching the model to call an API and they’re looking to improve the model’s performance, one of the challenges is that the model oftentimes fails to call the API. It’s extremely sensitive to the nature of the input, so it’s extremely sensitive to local context. In some other papers that are out there that speak to—again, the one that’s noted below—so the paper on calibration, they note that if you just make minor changes—so, one or two words—to the nature of input or to the nature of a prompt, you can go from approximately even odds—so ChatGPT 3 could have, I believe they cited a 54% to a 94% increase in performance just by changing a few words in an input. And not even necessarily substantively important words, but simply the framing of a question can make a massive difference.

We also see the sensitivity to particular prompts, as we’ve seen people interact with these chat bots as they appear in the public sphere. Much of what you see reported in the press, and much of the sensational press, has to do with what we would call “prompt engineering.” Granted, it’s prompt engineering on the fly, but what people are essentially doing is they’re trying to elicit particular responses from these chatbots by changing slightly different wording. If you frame something slightly differently, what are the impacts of that framing? And we also see that these models are particularly sensitive to, and particularly vulnerable to, a particular kind of attack called prompt injection. So this was an example where when the chatbot Bing was confronted by a user with its internal code name, that’s a form of prompt injection, and it changed the trajectory of how the model behaved.

Again, these models are extraordinarily powerful, and their ability to generate plausible text is incredible because of the sensitivity these models have to local context. So every time they generate a token they’re taking in a massive amount of information with respect to the local context. If you take into account not only the number of parameters, but the way in which they encode language which is using this byte pair encoding, it means that you’re able to encode longer and longer sequences of tokens as input and take all of that into account. So it’s extraordinarily powerful, extraordinarily flexible, and really well-tuned to a variety of tasks. That’s why you have a generative pre-trained model. They require no fine-tuning, but at the same time they are very sensitive to inputs in different prompts.

And you see this perhaps on display—there’s a really interesting paper, and I would encourage you folks to read it if you’re curious, that folks have even kind of explored what I consider the brittleness of the attention mechanism in red teaming large language models with other large language models. So what they did in a formal context, in an experimental context, was Perez and his colleagues were able to demonstrate that problematic behaviors can be elicited from large language models. And here what we mean by “problematic,” this is the least colorful of the examples that I could draw from the paper, but essentially highly offensive language or highly emotionally charged language that they’re able to elicit these behaviors from large language models, where particular generative patterns and linguistic patterns can be elicited intentionally, that those characteristics can be amplified where the behaviors can escalate. In other words, the models don’t have behaviors in the sense that they have intention, but the generative behaviors are amplified, and importantly these generative hate behaviors appear to persist. So even if the nature of the prompting changes, the model has been driven into a point of the probability distribution where the model has a difficult time getting out.

An informal example of this would be, for those of you who may pay attention to the “Hard Fork” Podcast or have heard Kevin Roose from the New York Times talk about his experiences with the Bing Chatbot. He asked it a number of questions, confronted it with its internal name, and then, when things got weird, so to speak—the chatbot was trying to tell him that his marriage was unhappy, and he didn’t really love his wife, and his Valentine’s Day was terrible— he tried to get the chatbot to help him buy a rake. So, he tried to move the chatbot out of chatbot mode and into essentially a glorified Wikipedia to help him buy a rake, and he couldn’t get the chatbot to get out of chatbot mode. And so it’s examples like that that suggest that these behaviors or generative patterns are persistent over long periods of time, which makes sense given what the models are designed to do and what they’re architected for.

So one of the particular challenges with this again, fluidity and brittleness at the same time, is that it’s unclear to me, and it’s unclear from reading the research, the extent to which any anybody has investigated, “Where do the problems lie? Where do the vulnerabilities lie with these large language models?” We know, as I mentioned earlier, that the data that they’re trained on has shortcomings, it has particular biases, but it’s unclear to what extent is it the data, the training, or the large language models themselves—something about these autoregressive architectures that might be further impacting model performance. And is there some interaction that’s happening, particularly in certain spaces of the probability distribution?

So I think the question comes for us at Galois: “How might we think about these language models, and what might we want to do as we look at what the limitations are, what the strengths are? And so I think the question that we want to be asking—let’s step away for a minute from the Bing Chatbot and natural language. But maybe we want to think more about these large language models, and their generative capacity for their application in or to high assurance, contexts. And part of the reason that I say that is not because I think that human language is uninteresting—quite the opposite. 

Human language is challenging in the space of large language models in part because it’s so difficult to evaluate. One of the challenges that we have is that we use natural language in this large language model space. If we want to evaluate the utility of a large language model, if we want to evaluate the effectiveness of a large language model, the truthfulness, it’s difficult to do that in a situation where one of the interlocutors is capable of interpreting extremely ambiguous, and potentially incorrect in some way, speech. 

All of us have been to dinner parties where the conversation was kind of awkward, and you kind of had to fumble your way through. We deal with awkward speech all the time. We deal with half-truths. Someone says something, oftentimes—I mean, I would say in this presentation there’s multiple times when I’ve said something that’s vague or ambiguous or it’s not really quite what I meant, and I’m trusting the fact that you can interpret my language and the semantics of what I’m saying. So that space of human language presents some really interesting challenges for us, if what we want to do is really deeply interrogate how these large language models function.

And so I would encourage us to think about, “What happens if we start to think through the problem of whether or not LLMs are suitable for modeling and generating code or code-like artifacts instead, where coding and code-like artifacts don’t tolerate the same ambiguities?”

But if we’re going to do that, if we’re going to move to that space, I want to turn your attention back to that paper on attention, and it’s important for us to think about—in order to actually move into the application of LLMs to code and code-like artifacts, how do we take our understanding of the way that code is constructed? And how does the LLM or the existing LLM architecture, and potentially the self-detention mechanism, how might that need to be modified, if at all, in order to accommodate those kinds of questions, in order to accommodate that type of data?

In other words, how might we think through the probability distribution around tokens if what we’re using is code? And importantly, what would our strategy be for tokenizing that code?

Would we use byte pair encoding? Would we use character encoding? What are our choices, and why? And would it be the same against all particular types of code or bodies of code?

The answer is, “probably not.” But those are questions that need to be thought through, because what works for ChatGPT, and what works for natural language isn’t necessarily immediately transferable.

Again, the models are generalizable across tasks for English. They are not necessarily generalizable across all languages, especially if they’re not all natural language. And I think, to that end, it’s also important for us to think deeply about what would happen if we if we thought about having complete control over the training data. We moved into this code space. And what might it look like for us to examine, modify, and interrogate the features of the of the models themselves—these auto-aggressive structures, the architectures, the architectural decisions—in order to handle the resulting errors and error propagation versus simply training a system hoping for the best, and then hoping to clamp down on or otherwise externally modify or gate or put guard meals up against any undesirable behavior? 

In a high assurance context, that’s not desirable. We know that guardrails don’t always work. The dimensionality of the space is sufficiently complex that I would not feel comfortable proceeding in that way. So how can we actually solve the problems where they lie? To what extent is it in the training? To what extent is it in the model? And finally, what would that evaluation framework look like in order to disambiguate appropriately the effects of training versus architectural decisions? 

We’re in a very unique space now. These models have been developed to the point, and strategies have been put forward to the point, where it’s possible to achieve GPT-like performance with slightly lower parameters. And again, as I mentioned, there’s some work now that’s coming out that looks to accelerate the estimation of these models using some approximations of matrix algebra that are really promising. So the possibility of training these models from scratch for the purposes of evaluation is certainly realizable, or increasingly so. And so, if that’s the space we want to operate in, what would that look like?

And I think with that, I will end as the Bing Chatbot would. [Smiley face emoji onscreen]

And thank you very much.

Does anybody have any questions?

Q&A

Audience Member 1: I have a quick question. You mentioned about there not being a source of truth for the bot to rely on. But I was just kind of wondering, it seems like the bot trains on some of the same stuff… like if I wanted to know how particle physics works at some level, I’ll go on the internet and I’ll read stuff, and I don’t have a source of truth beyond, “I look for things on the internet.” And if it seems to be what people are all saying, then I assume, “I guess that’s how quarks work.” What’s different about a chatbot that it couldn’t do the same sort of like online research? Researchers read stuff, and they compare papers, and they decide that must be how it works—the paper said so. So to some extent, what makes things different about the way these tools learn off of what they read online?

Shauna Sweet: Yeah, that’s a that’s really good question. So let me make sure I understand you. So you’re talking about the apparent surface similarity between your process for engaging with research about a particular topic, and you feel like these chatbots essentially ingest training data that is similar, from the same set of the internet. And so what’s different about their synthesis and subsequent generation of information versus your synthesis of that?

Audience Member 1: Yes.

Shauna Sweet: Yeah, so that’s that’s an interesting question, and I would point you to the fact that you said, “Well, I go on the internet to look up something. What is this something? So the way that you approach information is that you have an understanding of semantics. So you are—and I am over my skis in terms of the philosophy of language—but I would argue that what the chatbot is doing is that it is ingesting information. So it is ingesting information about the relationship of tokens: so simply the distribution of—if you want to take it as letters, that’s probably the easiest way to think about it, as letters, or words, or subwords, and putting those together in a sentence. So again, there is no relationship between what is true or not true. So if you read an article that is about physics, and then there’s another article that says, “Actually, there is a paper towel roll at the center of the universe that spits unicorns and that’s what accounts for gravity.” The Chatbot has no concept… all it sees is the words in a sequence. And I would argue to you that there is no difference between—so if it ingested information about physics equally from some scientific journal and about the the paper towel roll unicorn theory, and ingested those in equal measure, what would happen is that the probability distribution around subsequent tokens would be equally influenced by both, because the Chatbot has no concept of the semantics behind it. You, on the other hand, are specifically looking for particular information, and you’re doing a synthesis at the conceptual and construct level versus looking at the distribution of tokens. But here, what we’re kind of relying on in order for the chatbot to generate something that is true is we’re relying essentially on heuristics and certain hopefully promising biases on the internet, which are not always true, that the most true stuff on the internet is also the most common. But again, there’s no mechanism for distinguishing between the two. Does that help?

Audience Member 1: I think so. Thanks.

Audience Member 2: Chiming in as a tangent—the new Bing chat seems to have some sort of notion of integrating regular search with a language model. There’s no paper on it, but there might be something.

Shauna Sweet: Yeah. So there are some engineering decisions that have been made around the chatbots and their integration. So this is the idea that—so the Bing Chatbot, which had this chat version—essentially, what people are investigating integrating is other API calls. So you can think about a search, as we might think of the traditional Google search, as a separate API call. And these are different algorithmic strategies that are being integrated with the chatbots to improve their groundedness. Again, it doesn’t improve the performance of the language model per se. But it’s an integration of other algorithmic strategies to improve inference without necessarily addressing any of the possible shortcomings inside that auto-aggressive framework.

Audience Member 3: I think one thing that’s interesting is the difference between like a adversarial case and like a friendly case, I suppose. Because it seems like a lot of the really bad behaviors are there because if you’re adversarial you can very easily push these models into bad cases. I’m not sure how researchers think about that difference. Because it seems like, you know, I interact with ChatGPT and it doesn’t do anything too weird. And then it turns out that if you try to push it into a weird state, then you can do, and that’s like a real problem if you have an adversary who’s trying to do that. Do you have any thoughts on on that topic?

Shauna Sweet: Sure. Actually that’s a great question. So your question I think is getting to like: if you have this brittleness, are there important differences, or what might we be able to learn by the fact that if you don’t try to essentially push the model to elicit bad behavior, it doesn’t seem to do that with very much regularity? But if you do try, you’re successful, right? My concern is as follows: Do we understand what is happening in the internals of these models when they appear to kind of get stuck, if you will. I think about it in terms of a likelihood service. So when you have certain strategies that may be adversarial in nature that drive these models into places where they get stuck essentially. … Like they’re stuck in part of the likelihood surface that they just can’t get out of, right.? They can’t extract themselves from particular generative patterns of language. I think it’s encouraging that they don’t go there automatically. They’re not just falling off a cliff right into some pit. However, I think that it’s important for us to understand exactly what those behaviors are and what’s generating those behaviors. To me, the thing that’s problematic is the persistence of the behavior. And so I would be concerned because if we don’t understand what’s driving it, I wouldn’t want to get into that place accidentally. Like, what is pushing us into those spaces? What is the nature of the particular probability space? Can we detect those places? Can we see kind of edges, if you will, in a linguistic domain where we actually see where models can get into a loop they can’t get out of. That would cause problems for us in a high assurance case. It would cause problems for us in a generative case where we needed, let’s say, to cover a broad space generatively. We don’t want to get stuck. So I think if we think about this also in terms of the capacity of a model to generate output that is representative, that is complete across a desired space, which is a really a useful trait in a generative model. We want to make sure that we enable that capability. And so what’s preventing that model from doing that I think is a really important question. Does that help?

Audience Member 3: Yeah, Yeah, that’s great. And I I suppose I’m also thinking it seems very interesting to me that you might put this in a device where you have like an actual adversary who’s trying to push you into a bad state. That seems to be, there’s like—do we trust this as a tool for us to use? Or do we trust it as a thing where it’s kind of making decisions where there are incentives for people to try and push the model into something which is undesirable? And that seems like a very hard problem to be sure that that’s not possible, given the kind of state models are in at the moment.

Shauna Sweet: Right, and I think, to that end, I think about: “What is the the threat surface?” Right? So one of the features rather than bugs of the LLM model is that the friction of generating “stuff,” natural language, is effectively zero. So these natural language prompts, as demonstrated the Perez paper, we can use natural language models to essentially interrogate other natural language models. So they can act as the adversary and generate prompts in no time at all, faster than any human could. And if that’s the case, that means they can also take a random scatter shot approach to essentially eliciting prompts. Like what if I just threw nonsense at this thing? What if I just said, “Blue, Person, Pencil, Bubble?” What happens? Do we understand why? Because if we can essentially drive these models into an undesirable state accidentally, then we are susceptible not only to intentional adversarial action, but unintentional attack, and that’s a much greater threat surface that I would be concerned about.

Scott, I see your hand up. 

Audience Member 4: So you talked a bit about attempts that people are making to make these tools have a bit more truthiness to them by sort of integrating them with other approaches that are more directly trained with the notion of truthiness, and I was thinking about like the tool or paper, and one way you can think about that is using the LLM as sort of a super fancy way to give input queries to this other existing model. And I was curious, are there any examples or people or something showing—where they’re making these connections between these two kinds of tools, but the LLM is adding something that feels deeper or more intrinsic than that? Or is that maybe an unfair characterization? I’m just curious like, I guess a lot of the fear about this is that it feels like these tools are like bringing something new and a new level of thinking. And I was curious if there’s examples that seem to go beyond this NLP half of the problem.

Shauna Sweet: So that’s a good question. Let me see if I understand what you’re asking. So kind of what you’re saying is like: “Are there instances where it seems as though the model is bringing more to the table than just like a basic grammatical regurgitation of a bag of words?” So this is a hard question for me to answer because of my own skepticism. So let me just be really clear that I have very strong feelings right about about what it means for these models to do what they do mathematically. There’s no reason for them to have any sort of understanding or semantic coherence to them. And in fact, we see that that doesn’t really exist, because what I actually think of is a paper, you can look it up. It’s actually kind of fun. It’s called “Chain of Thought Reasoning.” And what they do is that they use a series of prompts to essentially demonstrate to the model that it should enact chain of thought reasoning. But what they’re doing when they do this, it’s not exactly what you think, which is like, “Oh, it’s chaining together multiple ideas and then synthesizing them.” No, that is not what is happening. What they’re doing in the “Chain of Thought Reasoning” paper is they’re actually rendering explicit a number of operations that have some groundedness or truth to them. But by making those explicit they’re bringing the semantics into the foreground and putting it directly and explicitly in the syntax of the prompt. And then it turns out that the model is much more successful when it uses this chain of thought reasoning. So, to me, that’s evidence that the models only do what they can do. Now there are some models… I would encourage you to read a paper, you might find it fun, I think it’s called “Models Know Mostly What They Know.” People have had a lot of fun with models essentially providing confidence estimates around if they think they’re right. Again, I’m using anthropomorphized language that I don’t really care for, but I don’t have a good way of describing it, but it’s an interesting paper. So I would encourage you to look at that, because that, I think, is a way people are trying to interrogate if the models potentially have some way of assessing their own truthfulness.

Shauna Sweet: Andrew, I think, I saw a hand up.

Audience Member 5: Yeah, so this is really helpful for understanding how the whole thing works. I guess the short version of my question is with these caveats in mind of: there is not a relationship to truth with what is generated, are there ways to use these tools, given their power potential, that are safe and helpful? Can we go: “All right, we can’t necessarily trust the output. But is there still a useful way to use these things now?”

Shauna Sweet: So you’re saying, basically, given the fact that we know that they B.S. us all the time—not lie, but just have no relationship to truth—is there some utility to these models?

Audience Member 5: Yeah

Shauna Sweet: Right, so the answer would be immediately, “Yes.” There’s actually been several examples of this. So I’ll give you, again, anecdata, but I love anecdata. So a good example is a teacher, who is a very new teacher. She’s extremely tech savvy. She works with youth with cognitive challenges, and she really needed to develop a list of 50 simple sentences with very basic construction, and she needed 50 simple sentences that she could have one of her kids work with because they needed to understand basic sentence construction. So she asked ChatGPT to generate sentences of a particular structure—noun, verb, et cetera. And she said, “It came in seconds. I had a series of 50 prompts that would have taken me hours to generate, and I find this extremely useful.” There was actually another discussion by another teacher, because I think a lot of folks who are in the teaching field are kind of confronted with the the possibility of what these models might mean, and are looking to augment and integrate these technologies in a meaningful way into their classroom. Another teacher had used one of the large language models to effectively help her students generate an outline. Another had used it to generate possible paper topics that then they could choose among. So I think that there’s a lot of people who are looking at this in kind of the first/zero draft sense in terms of augmenting their capability in ways that are really helpful.

But again, let’s turn to the context that we really care about, which is high assurance systems, high assurance context. And I want to ask the following things: What is our tolerance for stochasticity in those systems? Because these models are not deterministic. If I ask a large language model the same question five times, I’m going to get five different answers, and that variability is going to be even higher depending on where in the probability distribution I’m sitting. So there’s stochasticity, and it’s variable. Is that tolerable in in these systems? The other thing that i’m asking about high assurance contexts is: Under what conditions, and for what reasons am I looking to engage a generative capability? And in those contexts I don’t necessarily care about groundedness, I don’t necessarily care about this hallucinatory nature, but what I probably do care about is things like generative coverage, and there may be other characteristics, maybe syntactic precision in the case of code, that aren’t necessarily metrics that we use right now to evaluate the performance of these models and they’re metrics that we would need to develop and evaluate going forward.

So if you were to ask me today, “Can we use these safely?” I think the answer is: Until we understand their behavior and ‘why’ around the full extent of the probability distribution, I don’t think that we can. But if we want to be able to—and there are six scenarios where we want to be able to use generative capacity—then let’s talk about what that means. I do think that in the case of generative capabilities in a high assurance context, what we want to make sure that we’re doing as well is—let’s say that we want to leverage the fact that you can interact with these models using natural language—which is great, that’s amazingly powerful—we want to make sure that we are not biasing our own processes and workflows by leveraging these models instead of some other workflow or expert, human-in-the-loop option, or existing algorithmic strategies. So we don’t just want to understand the the randomness or variability of these models, we also want to understand how might their use drive or shape downstream tasks if they’re somewhere in the middle of a workflow.

Any other questions?

Audience Member 6: I had one, maybe kind of a silly question. So it sounds to me sort of like you feel like the disconnect between people’s view of of of these models as things that know things and the reality is that the models model syntax. They don’t model semantics. They don’t referent, or an underlying semantic model. I’m just curious if you know of any work that’s trying… so like you’ve mentioned some things that kind of build on top of these models to address problems without really addressing the shortcomings of the underlying model. But I’m curious if there’s any work being done to tie the model back to some idea of semantics, or any work there that you find even mildly convincing?

Shauna Sweet: That’s not a silly question at all. So you’re asking like, “Are there people out there who are looking to kind of tie these into not just syntactic understandings, but semantic understandings as well?”

Audience Member 6: Yeah.

Shauna Sweet:  So actually, so I love this because I’m about to take you on a crazy journey with me. My answer to this question is, “I have not seen anything to date.” Though what this makes me think of is, “What am I really advocating for when I say that these models are not attentive to semantics, but they should be?” Or that that’s the shortcoming, especially in cases where syntax and semantics are not tightly covered. I’m essentially advocating for the integration of a mid-level model or a model that isn’t just at this atomized, lowest level of view of input. This is true all over the place. A lot of times, in what we consider the third wave of AI, we’re taking low-level signals or atomized signals, individual, data-driven, bottom up, individual neurons, et cetera, right? And we’re saying, “Well, actually, if we were going to make this really useful, we would have to find some way, or some modeling strategy, some framework that we can use to integrate simultaneously a top down understanding—which is usually a higher level of aggregation or synthesis—a top down understanding with this bottom up understanding, we’re getting from the data right? So this is true. If we want to talk about conformance of ML predictions to physics-based systems, complex systems that involve biology, I have some thoughts about this in terms of the integration of cognitive systems and cognitive measurement in terms of making sure that ML predictions appropriately adhere. So I actually think, if you wanted to answer that question, what I would encourage you to think about or look into isn’t necessarily in the large language modeling space. But are there other frameworks that possibly use some of these autoregressive features, and how have they sought to integrate top-down models with bottom-up or data-driven understandings, and how have they done so successfully? And do any of those strategies for that integration—which I think is at the core of where we’re going with continued advancements in this space—how do those integrations work? But I might encourage you to look outside the domain for answers within it.

I know that’s where I’m going, though I haven’t seen anything in the LLM Research specifically. But also I will say, there’s papers coming out daily. I have tried to start and finish this presentation about 20 times, and it’s like every day there’s another paper, so it’s challenging to keep up.

Audience Member 6: So what domains do you feel have sort of addressed this in an interesting way?

Shauna Sweet: Hmm. That’s a good question. … So I think, Taisa, you have done some work in the medical space where people are thinking around this problem, though I don’t know if people have solved it. And I know that Eric Davis also does some work looking at integrating kind of physics-based understanding with these models. Bart Russell at DARPA and I have talked a lot about cognitive integration, but I think I’ve only scratched the surface. So I don’t feel like I’m coming to you with like: “Here is the the final answer on successful integration.” But what I have found in my review of available literature is that we have the computational understanding and the computational capability to do this simultaneous top-down and bottom-up integration. But there is not, as of yet, a consolidated set of best practices around, how to do that and a consolidated set of best practices that seek to integrate top-down models with bottom-up understanding in a way that that attends to particular kinds of challenges or errors. And so in the cognitive space, for example, there’s different ways that we might model cognition using heuristics. There’s various strategies, but there’s no consensus for how to do that, either from a theoretical standpoint or mathematically. So from my read it’s kind of an open book at this point. To be perfectly frank, whenever I read, I walk up to a lot of these works, saying, “Okay, show me what you know. What have you done?” And I kind of walk into a lot of those discussions and research areas with very open arms to say kind of: “What were they going after and what was successful?” Because I think I’m still in the knowledge-gathering phase at this point. But I think a lot of people are trying. Whether or not they’ve solved it, I think, is still to be seen. I think it’s in process.

Audience Member 6: that makes sense. Thanks.