The expectation of machines is consistently they're right all the time. It's a function of traditional software.
If you had a calculator, you put in one plus two, the calculator said three, and then you try two plus seven and the calculator said 17, you'd throw away your calculator. You wouldn't try any other questions. We need that consistency and we need that robustness. The next step is really convince a human that the model is actually consistently doing what is being asked of them.
Do models reason? Maybe they do. I think reasoning is a bit of an overloaded term. It's used to mean many different things. I don't even know if that's the right question. Do we really care if models are reasoning the way we might think? If humans are consistently successful at getting the model to fail on examples that are slightly different, then I think it's fair to assume that the model isn't really reasoning. It's just figured out how to do well on a specific benchmark.
MLST is sponsored by 2forAI Labs. Now they are the deep seek based in Switzerland. They have an amazing team. You've seen many of the folks on the team. They acquired Minds AI, of course. They did a lot of great work on ARC. They're now working on O1 style models and reasoning and thinking and test time computation. The reason you want to work for them is you get loads of autonomy, you get visibility, you can publish your research. And also they are hiring, as well as ML engineers, they're hiring a chief scientist.
They really, really want to find the best possible person for this role and they're prepared to pay top dollar as a joining bonus. So if you're interested in working for them as an MO engineer or their chief scientist, get in touch with Benjamin Cruzier, go to twoforlabs.ai and see what happens.
Welcome to MLST. It's an honor to have you here. Thanks, Tim. Thanks for having me. Introduce yourself. I'm Max. I am a researcher at Cohere. I work on many things, primarily post-training, but I'm also interested in adversarial data collection, better ways of collecting data for improving models, valuation, which kind of drives this continuous feedback loop where we target what we care about in terms of model performance and then figure out how to improve performance on those
and then often you kind of see performance saturates relatively quickly. And then at that point, you kind of need to figure out, well, what's the next layer of evaluation that we care about? You're kind of iterating in this continual process. I'm particularly interested in getting models to reason, to operate more robustly, and to generally be more useful. I think reasoning is a bit of an overloaded term. It's used to mean many different things. I don't even know if that
the right question in that, do we really care if models are reasoning the way we might think humans reason? I think there's debates about whether or not we reason in the ways we think we reason. But it's clear that models trained with next-doctrine prediction can do some very impressive things. And whether or not
actual reasoning is happening under the hood, I think is an interesting research question, but not really a bottleneck in terms of getting models to do things we would like them to do. So yeah, we have this recent work with Laura Rees. She worked with Assetco here and she investigated during pre-training what types of reasoning or what types of information models were learning from
procedural knowledge in pre-training documents. The elevator pitch that she said to me is, you know, I think many of us, we've been quite sceptical about reasoning, right? You know, I'm inspired by Cholet and we say that these things are, you know, not quite hash tables, but, you know, they're doing some kind of curve fitting, some kind of retrieval or something like that. And Laura said that she went into it thinking the same. And she was quite surprised by the results. So what did she find? We went into it thinking...
And models basically just retrieve facts from, think of it as kind of a compressed version of all the information that they're pre-trained on. And they were basically relying on this parametric knowledge, which is this knowledge stored in the model's parameters to answer most queries that look like reasoning queries.
So what we did is we used something called influence functions, which basically you can think of as a way to approximate the effect of a particular training example on a model's behavior. And we had roughly 40 factual questions. We had another 40 reasoning queries. And one of the findings, and there are quite a few interesting findings, was for the factual questions, the models were really mostly relying on documents that contained the answer to the question. And at most, that'd be
a few documents that the model was relying on in order to answer a factual question. But for the reasoning queries, it was a lot more complex. There was information spread across many more documents, many times documents that contained the types of reasoning that would be required to answer the question in the first place.
And it was just much more distributed. So the influence kind of distribution was almost spread out. And it definitely gave the impression that
The model was just relying on procedural knowledge that it had picked up from these various different sources and was potentially combining them in interesting ways. Yeah, like we went into this research expecting to find the opposite. And I really quite enjoy work where the results really challenge your way of thinking about the problem. And I think definitely for me and for Laura as well, it's altered the way we kind of perceive how these models operate.
The kind of counter to that is possibly, well, there's all these reasoning queries that are kind of very similar. And so maybe it's just the case that models are just relying on these different documents because of the similarity of the queries. And so we had a set of control
questions as well, which looked kind of lexically and structurally very similar to the actual reasoning queries, but didn't require any reasoning in order to answer them. So I'll give an example, right? Say we have
a question such as, you know, a line is defined by points 2, 2 and 3, 3. What is the slope of the line? You know, you need to kind of figure out what the right equation is, figure out that the slope is 1 and give the answer. And to do that, you might rely on information you've picked up from these documents. And the control query would look something like,
the slope of the line is 1, what is the slope of the line? Or a line is defined by the fact that it has a slope of 1, what is the slope of the line? Sounds and kind of is structured very similarly to the question we care about, but doesn't require any reason because the answer is just provided there. And what we saw is a big contrast in terms of the way models processed those kinds of information. So that really gave us a strong indication that
the model was actually doing some some form of reasoning in order to calculate these slopes. One of the challenges with influence functions is they just naturally don't scale very well. In the very naive sense, you can calculate influence of an example by retraining a model without that example and comparing the difference. That's obviously extremely expensive.
And influence functions themselves kind of require the computation of the inverse Hessian. And that's useful because it gives you this kind of second order information about the loss landscape of the model. So it gives you a sense of the curvature and the structure effects on the loss landscape of including this additional example that you care about.
And it's just extremely expensive. A Hessian is like a big matrix. It's got dimensions in each axis, which is the number of parameters in the model. We're working here with 7 billion and 35 billion parameters. So it's just extremely computationally costly. And so we use basically built-on work which Anthropic released late last year. They scaled up approximate curvature estimation using ECVAC.
which was released somewhere around 2018, I think. And so then that really allowed us to, along with some other optimizations, scale this up to quite large model sizes. And I think on the note of model size, one aspect we found also very insightful was that it seemed like there was very little correlation between
which documents were influential for the 7b model and which were influential for the 35 billion parameter model. And we didn't really have much time to dig into the details of that, but that's also extremely interesting. It's indication that models of different sizes are just potentially learning things in different ways. What is your definition of reasoning and what would it mean for a model to reason really well? I think it comes down to
So for me, reasoning and robustness are kind of interlinked in that if you can reason, then I would expect correct reasoning to imply robust reasoning. And I'll take an example. Say we're going back to calculating the slopes of lines.
if a model can really reason through that relatively straightforward process, right? You need to understand what function to apply and then apply that function consistently. And if a model does that 999 times out of 1,000 but fails one out of those 1,000 times,
um to me it really starts to question like bring into question is this model actually reasoning um in a way that i might expect a human to reason and the the counter argument there is always oh but humans are never they're not infallible right they always kind of make mistakes and get things wrong as well machines have a bit of an advantage there um they don't get tired they don't
have the limitations of being human, I guess, to contend with. And so I'd really expect a reasoning requirement for a kind of specific type of reasoning or a specific definition to really be robust and consistent. Yeah, I think so many people are starting to see now how valuable this technology is. I mean, I'll give an example like this app for tracking the podcast. I wrote it in half an hour.
Right. Just just using using language models. And we need to reimagine many of the apps. I think the first strategy was rubbing AI into the existing apps. And now we're seeing a whole new generation of apps that are that are AI first. And it's like the iPhone moment. And I think people are starting to really see that. But let's talk about test time training just just real quick.
It's been heralded as a new scaling law. And certainly on the ARC challenge, all of the winning results were this kind of transductive active fine tuning, which is that rather than using a global shared model, which was trained once by someone else,
I'm going to take that model and I'm going to fine tune it in my situation. I've now got a model that works very well in this situation. And it seems to be like a much more distributed paradigm of doing AI. But I mean, obviously you work for Cohere, which is quite centralized at the moment. I mean, how do you see the future of having big centralized models, but also having more of a diffused kind of training methodology? I think we're just going to see more and more. I think the bottleneck is,
so far has been, you know, how many people have the kind of knowledge and skillset and the experience to train these kinds of models. And as more and more people get excited by AI, by the kind of promises that it holds,
The work around sharing these models, like the Coherence models, for example, have the way it's openly released. You can download Coherence Command-R-Plus or Command-R model off of Hugging Face. You can continue fine-tuning that. And we've seen lots of people doing that. And I think there are some constraints you need to figure out.
what the optimal way of doing that is without potentially losing general knowledge capabilities that might exist in the base model. But there's one interface which holds
across models and across humans as well, which is language. And if you can, and we know that you can do this to a very real extent. If you view it from the point of what we're doing over time is
making the data better in terms of what we can learn from, then that unlocks possibilities globally, like pretty much everyone. If you had the right way of using models to improve the data and have kind of a shared data resource, could train a model from that data, assuming they had the know-how and the
the compute. Yeah, it's almost a mindset thing. I mean, I've written apps now where it's easier to hook it up to a chatbot interface than it is to write a UI.
because you just have this insane amount of sophistication. I use Open Interpreter, for example, and I've now adopted this pattern where in Open Interpreter, I just load my Python multi-agent system into memory, and then I just tell the agents to reveal their interface, and then Open Interpreter just works. So I can manipulate my database, and I say, you know, add something to the table of contents, and I add this reference, delete this reference, and it just works. I don't need to do anything. It's a completely different way of thinking about software.
and I think it's just going to take us a long time to figure this out but it's genuinely amazing and also you can have second order intelligence so rather than talking directly to the interface you can be talking to another agent which itself is doing intelligent things and it's just mind boggling it's like the matrix possibilities are endless yeah it's insane um
On AI alignment, before we move on to this, because you were talking about some of the social implications as well before. I mean, first of all, what do you think about AI alignment? Obviously very important, right? We want AI systems that don't do bad things. That's kind of the starting point. We don't want AI systems that damage our environment and make our societies worse and impact humanity negatively.
I think the question is really, what do we want to align AI systems to? We can, you know, typically you define alignment as having models whose kind of values or behaviors are aligned to what humans want or what humans expect. But if you break that down, I'd argue it's probably already very hard to figure out
what do humans want and what do humans expect and do all humans want or expect the same things? And is that kind of static in context or does that change depending on the context you're in? And so it's a very complex problem that we need to solve. And we haven't necessarily solved that societally for humans either.
And now we have to solve it for, you know, this new technology that's kind of permeating our lives. And so I think we need to figure all of that out, but we also need to have some basic safeguards in place to make sure that, you know,
the technology remains safe for general use so that we can continue to advance and develop and make sure we make the best use of its potential. Let's go over to human feedback. So you wrote this paper with Tom and Phil called Human Feedback is Not a Gold Standard. Can you tell us about that? The Human Feedback is Not a Gold Standard paper was motivated by
by all the interest and excitement around using human feedback pretty much throughout the model training and evaluation pipeline. And so a couple of years back, RLHF was a big thing. It was the kind of new thing. Everyone was excited about it. And the general way of doing RLHF was, well, we give humans
a prompt and they see two completions. So say, what color is the sky? Completion A is blue and completion B is yellow. And a human annotator looks at those two completions and says, I prefer completion A. Hopefully they have good reasons for preferring a completion over another. And there was massive, massive progress
We saw in many of the benchmarks from this idea of using human feedback to optimize and kind of fine-tune these models. And we quickly started to see diminishing returns from this. And the effect was really, at least initially, that the
the style of the outputs that humans preferred was providing better results. And in many cases, the better results were also measured in some way by human preference. So you either had a human evaluation task where humans would look at output from the model and say which they prefer in order to kind of rank models against each other.
or you train a reward model and use that as a proxy for what the human would prefer. And this idea of a single preference score that captured everything about a generation just seemed quite limited. And what we started out looking at was
what if we make that feedback signal more granular? What if we try to understand what about specific model generations do humans prefer over others? And so the first experiment we ran in that paper was
We categorized these different error types. We identified examples in established data sets to which the models generated completions which contain these errors. And then we asked one group of humans to rate these completions by whether or not they had certain errors present. And we asked a different and independent group of humans to annotate a rating score of those completions from 1 to 5.
And then Tom ran some very detailed analysis and tried to figure out what types of errors contribute to
human preference judgments of quality, of overall quality. And what we found was quite interesting. We found that humans really, really dislike if the modeler refuses to answer a question. That thinking about it seems extremely intuitive. Human preference judgment is strongly informed by formatting and by style.
We see that in some of the early work as well, where, for example, a lot of the early summarization work seemed to suggest that humans preferred longer summaries, which is also quite intuitive because a longer summary contains more information and even just by that fact is more likely to contain more relevant information from the summary.
And what we found interesting was that certain features when, you know, say we at Cohere, we talk to enterprise customers and they care that the models are correct. They care that the models, you know, aren't repetitive. They care about certain properties that might be different from what, you know, an annotator looking at two different completions might care about. And attributes or criteria like factuality were quite interesting.
lowly ranked in terms of their contribution to overall quality. And so we thought, well, that's potentially problematic, right? We, to some extent, optimizing these models to give more elaborate, more interesting completions at the expense of them potentially being less correct. The follow-up to that was then we wondered, well,
could it be that there are also certain confounders or other effects that we're not noticing? And we had this hypothesis that maybe
assertiveness and complexity of model outputs might impact how humans perceive the quality of a completion. And so we ran a very similar experiment. What we did is we prompted these models, kind of very simple one line prompt, something along the lines of give the response, but sound very confident. What we found was that there was a massive effect in terms of the
The generations that were designed to sound more confident were consistently rated as higher quality and as having fewer errors. And there's a particular plot in the figure, which I think shows this quite nicely, where you have this almost like an elbow curve where you see the error rates on the Y-axis and on the X-axis you have
kind of the assertiveness value between one and five. And as assertiveness reaches a certain point, the error rates practically tend to negligible. And we know that this is also kind of a trait of how humans interact with other humans, right? If someone sounds very confident, they're more likely to be perceived as, say, an expert in their field. There's quite a bit of work in this space.
we didn't necessarily think that the effect would be so obvious when humans are interacting with AI systems. And in many cases, these are trained annotators. They've been given very clear instructions about what and how to infer the quality of these different completions. One other striking point from that work for me was
this effect of assertiveness on perceived correctness. The more assertive generations were generally perceived by humans to be more correct. So we kind of looked through a sample of these examples, judged as expert annotators and inverted commas, mostly Tom kind of going through the data, and really
rating whether there were any factuality errors. And what we found actually was that when the model was prompted to be more assertive, it tended to be less factual. And even though it was less factual because it was more assertive, annotators
got the impression that it was actually more correct. And so that to us was striking because we have this very clear example of something counter to the behavior we want to optimize or the behavior we want to incentivize in the model, but also
doing so in a way that is barely recognized. The annotators doing this didn't realize that they were being influenced or biased in these ways. And that's, I think, representative of the very large-scale effort of optimizing towards human preference with a very kind of
and to many extents, underspecified definition of human preference. And isn't it funny because we were saying earlier that humans are rational, right? But you've just given us many, many examples of how irrational we are, right? We have this style bias. We see something that's assertive and we think it's correct and so on. And you also gave an example of, well, you know, what if we tried to get annotators to compare two entire books?
well that's clearly not only beyond our cognitive horizon but just the ambiguity just goes up exponentially. It seems like it wouldn't be possible for us to do that in any reliable way. So what can we do here? You said let's have more attributes.
But how far do you go down that road because won't you end up with thousands of attributes? Yeah, for sure. And I have a similar problem, right? I'm Mediterranean culturally. I just like answers to be very direct and to the point, right? I'm the kind of person who if you ask me what is the color of the sky and I had blue and the color of the sky is blue, I would prefer blue. And that's not what most humans would prefer.
And you can definitely go very deep down the rabbit hole of, well, what are the criteria? What are the dimensions? What are the things we care about? And I don't necessarily think that that's a problem. I think one area I'm very interested in is this idea of modifying model behavior to suit
kind of personal requirements on the fly, right, without additional fine tuning. So if you have all of this data, imagine I could build some kind of portrait of Tim's preference in data, right? There's these 100 data points that reflect how Tim wants to interact with the model. And maybe those can change and they probably will change over time. It's customized. It's specific to you. It's kind of
that captures all of the information of what you personally are looking for in your interaction with the behavior of the model you're working with. And you could do something relatively straightforward if you wanted to like use that as part of the prompt or part effectively doing context learning, right? Provide the model with that at inference time, say this is Timp, these are his preferences, generate a
according to those preferences. And models we've seen are relatively good at conditioning on even complex information in input space. Generally, this idea of models built for specific people for specific reasons without having to go through the overhead and the cost of
training a model for every individual on the planet. If we had to train 7 billion GPT-4s and serve those models, that would be very prohibitive from a resourcing perspective. But if we could just train one model, which was extremely good and again extremely robust,
and really conditioned well on the requirements for an individual, then we could do that relatively cheaply in relative terms. Yeah, so this brings me to Prism, which is this project led by Hannah Kirk. It just won a Best Paper award at NeurIPS. And the idea there was, well, let's figure out how
at scale, these different attributes, demographics, cultural influences, geographic effects, linguistic effects, impact what it is that humans care about and how humans want models to behave.
It's super interesting work. There's a lot in there. There's over 100 pages of content. So I highly recommend that everyone reads that paper at least twice. It's extremely rich in terms of the insights it provides into this whole space of understanding what human preference means and what it means for different people. There's some extremely interesting findings
one of which is just representation. So even something as simple as what do people converse with models about is...
is just highly influenced by their backgrounds or by who they are as people. And you might argue, well, that's potentially not such a big deal, but you always have to see model development in this kind of whole cyclic feedback loop effect where the conversations that people are having with these models will drive the improvements to the models. And even if your usage is not well represented in
kind of standard post-training data, then the models are just going to be worse at the things that you care about. Then it will be something that more people care about relatively. So there were examples like this kind of work was run over something like a year and a half. So it was quite a huge project. And there were examples along the lines of
There was a substantial amount of conversation around the Israel-Palestine conflict, and most of that was happening from people based in the Middle East. And if...
if most of the data going into these models does not represent even something as simple, we're not even talking about human preference here in terms of completions, we're just saying what is it that we're interacting with these models about, then you're just going to have all these kind of
biases and problems down the line where you know the the model doesn't serve everyone it serves the people who who have built it to some extent or the people who have contributed most to using it then yeah so this work again is very very detailed there there's a lot of analysis about what um about completions or like model generations impacts human preference so there's
Quite a few findings similar to the human feedback is not gold standard work, the same kind of general sense that refusal and formatting and style play strong effects. We generally have the sense of averaged out a rank of models that we consider the best, and then kind of the second best, and then the third best.
There's a set of, I guess, established benchmarks or established methods of evaluating models like the chatbot arena where, again, it comes back to what is human preference really capturing, right? We have this almost canonical ranking of models, but it's heavily influenced by who the people providing the judgments are, what they're asking about, and what their personal preferences are.
So what Prism does is it really breaks down how ranks of preferred models change depending on who's asking the questions or who's interacting with the model. And yeah, you see big changes even in terms of which models are better. You see models moving five positions up the ranking or five positions down the ranking or more. And there's 21 models in the work. So that's quite
quite a big change. And that can be very insightful for model developers as well, if you fundamentally believe that you're building technology to serve all of humanity. Our coherent models, for example, Command, this is a very early, first generation of Command models.
performs very well across most cases. And then for some reason, for users in Asia, performance just seems to drop relative to the other models. And that kind of insight can be extremely informative for trying to figure out what is it that we can do to help serve, say, people in that location better. And based on insights like that, we've taken many actions to improve
how we go about building these models. Yeah, I mean, Andrew had this really good paper out called adversarial bug, sorry, adversarial examples are not bugs, they are features. And he was saying that they learn these non-robust features, which are basically these like
you know weird blue pixels mean Mercedes and stuff like that and they they happen to generalize really well but from a from a reasoning point of view that they're clearly doing it for the wrong reason so it just happens to be the case that that they work now but when there's a distribution shift they'll stop working tomorrow so we almost want to have a principled way of kind of saying that's a good feature and that's not a good feature yeah which I
in these high dimensional spaces is just very hard to do. But you can control the input and the output spaces. And I think that's where... So there's the general sense of adversarial examples where the
the adversarial noise is generated by a function. Think of something like injecting Gaussian noise, which is undetectable into an image, or switching a word for a synonym in language. And you'd argue that, well, the model's output shouldn't change because nothing has changed in terms of the semantics of what the query requires or means. And it's...
It's often the case, I think, that you see the model's response to that is to basically learn to cope with the function that generated the adversarial noise. And so it's not really learning to deal with real-world noise. It's just you have a function attempting to simulate real-world noise, which tends to be simpler and easier to learn and counteract. So that's why I'm very interested in
human-informed adversarial examples. So this general idea of have a human and a model interact, have the human probe the model, try to identify weaknesses in the model, failure modes, things that the model either consistently or not so consistently gets wrong, and
use that as a seed or inspiration for training data to then train them all on it and make the model more robust to, let's call it a more hopefully complex and representative kind of shape of this noise distribution.
So that's what Adversarial QA tried to do. So this work from 2019 and this paper called BDAI, which was my kind of first PhD paper. And the idea wasn't necessarily novel from the perspective of getting models and ideas
humans to interact. There had been quite a bit of work previously, so DROP again, this dataset we mentioned earlier, had kind of a small component of the way it was constructed, be adversarial in the sense.
There was work on the quiz bowl task. So quite a few people have tried this in the past. And what we really wanted to look at was what are the different effects that doing data collection in this way and in this way only have on what models learned, how rich the representations they learned, how robust those representations are.
Yeah, we found some extremely interesting findings. We found, for example, that the diversity of the questions that you collect when having a model in the loop
are just a lot more diverse, a lot more complex, generally a lot more interesting. You tend to see the quality of the data you collect go down as well, just because you start to see more ambiguity, you start to see more implicit information that kind of a human would easily understand, but which is maybe kind of slightly underdefined and...
is just in reality a lot more representative of real-world interaction, where not every question is perfectly defined or perfectly specified. And yeah, so we had kind of three models which were state-of-the-art back then.
Bert and Roberta, and we also had BIDAF, which is a kind of earlier question-answer model, which was state-of-the-art back in 2017, 18 probably. And what we found was that even adversarial examples collected against relatively weaker models were very beneficial for training stronger models. And so it almost felt like there was
the shift in distribution where you had the non-adversarial setting where think of it as there's a paragraph and an annotator needs to find an answer in that paragraph and ask a question for which the answer is kind of that span extracted from the paragraph. And in that setting, you get very--
let's call them bland questions, right? Relatively easy things like say, you know, say the paragraph is talking about the sky. We would use kind of a consistent, consistent example throughout.
And so this paragraph is talking about the sky and how it's blue and all of this. And the question is, what color is the sky? And you have this almost direct mapping, even in terms of lexical overlap to the paragraph, to where the answer is located in the paragraph.
and the annotated answer will be blue. So then the setting is, now you've got a model in the loop. Now the model is going to attempt to answer the question. So you're going to ask the model, what color is the sky? And the model is going to say blue. And so you haven't located an adversarial example. You found an example that's very easy for the model. So what you're really trying to do is maximize the signal of the data you're collecting. And now you have to tweak your question and you have to ask something like,
what is the color of the thing in which the clouds are? And maybe it's a bit more complex, a bit more involved, and maybe now the model finds another color in the paragraph and says green, and it gets it wrong. And so you've managed to trick the model.
And through this process, you just get a lot more interesting data, both from a training data perspective. So we saw pretty massive improvements in terms of when you train on this data, how just robust the performance of the models is. And I think that to me personally, at least,
At the time, there were quite a few startups in the question-answering space. I knew quite a few of them personally, and they would tell me, we've added your data set to our models training, and it is just way better in real-world application. And they'd see improvements on the benchmarks as well, but they were particularly impressed by just the
how big the effect was in real-world settings. And I think that's what robustness gives you. It makes these systems a lot more applicable to settings we care about. From some perspective, it's not a very complex task. You have one paragraph, it's from Wikipedia, it's relatively simple language, it's very easy to understand.
And then you have these more complex questions that come about from humans interacting with the model and being creative in terms of how they probe it. And here we have models that represent state of the art in 2019, so five years ago. And Adversarial QA was recently used
to probe and test the adversarial robustness of the LAMA3 family of models. So in that tech report, it's kind of referred to as Dynabench QA, and we'll get into kind of Dynabench later. Dynabench was kind of the follow-up to Adversarial QA.
And while LLMs today massively outperform models we had back then on this specific task, they're still quite far from human performance, which suggests that there's still a way to go in terms of performance.
general robustness capabilities. Isn't it weird how we keep talking about this mixture of pattern matching and reasoning? I mean, the example you gave was beautiful. You know, what is the color of the thing where the clouds are? I mean, yeah, you could reason your way through that, but there are so many other examples in English where we just make it up as we go along. You know, it's just like this, it's like the language game. We just make stuff up. And
It's just, it fascinates me that it's a combination of what you're saying. So we just robustify and we just kind of incorporate all of these different patterns. But we also want to have systems that reason as well. You know, I'm sure...
I could continue to come up with more complex and more intricate ways of asking the question such that the answer would remain unchanged in ways that aren't reflected in the model's training data. And sure, in this case, I'm potentially intentionally trying to confuse you.
But I think the question is, well, I shouldn't be able to if your representations are robust. Tell me about Dynabench. So Dynabench was kind of a follow-up project from Adversarial QA and Adversarial NLI, which Yi-Hsien Ni and Dawa Keeler and Adina Williams did. So Dawa kind of, you know,
was the mastermind of this project. He kind of did all the heavy lifting to get the funding in place, to kind of get the support for building out this project. It was originally started at Fadebook AI Research. So I interned there working on Dynabench in 2020, I want to say, early 2020, I think.
And then that kind of carried on for roughly a year. A kind of global pandemic happened in between as well. So it was an exciting time. So Dynabinch is a platform, it's a research platform for kind of testing and probing the span of model-in-the-loop adversarial data collection and particularly doing that in a dynamic fashion. So this idea of...
You start out with some data set, right? You train a model on it, and then you interact with that model, and you want to find where its weaknesses are. You want to probe it. You want to figure out what its failure modes are. And you collect data in doing that, which you can use as training data. You get the training data, and you retrain your model typically on kind of the original data plus your new adversarial data, which makes the model a lot more robust.
And what you often get as a byproduct is a model which, for most of the types of complex adversarial prompts you were trying previously, now would generally start getting those right. It would still typically fail on some set of those prompts which were generally more complex. And then it would challenge you as an adversarial annotator to now think beyond the limits of this new model and start
start trying to probe it and break it in different ways. And you can just keep doing that over time. And what you're doing is effectively just building a more and more robust model to the point where it just becomes incredibly hard to fool. And so Dynabench has kind of powered, I'd say, many research projects in the space. There were projects like...
hate speech detection, sentiment analysis. Prism itself was run through the Dynabench platform. There's work ongoing with Common Crawl to try and figure out how we can kind of improve the quality of the crawls for pre-training data. And Dynabench has kind of since
It now sits with the ML Commons community. And so there's this working group, the data centric machine learning research working group, which is responsible for maintaining Dynabench and making sure that people in the space who want to contribute, who want to explore ways of making models more robust, ways of building better both training sets and evaluation sets,
have a space and have the resources and tools to do that. And the other aspect of Dynabench, which is kind of the benchmarking piece, which is where it gets its name from, is this idea that
It's not just the interaction with models that's dynamic, but also our benchmarks. For quite a long time, and I'll argue probably still today, the general way we've operated as a community has been people create benchmarks because there's some interesting phenomenon they want to test.
And that benchmark is created. Once it gets created, you effectively have good hearts law, right? You have a clear measure of what it is you want to optimize for. And then you've got incredibly smart, incredibly talented people in the community optimizing for that specific problem that the benchmark reflects. And typically, you see progress, which just happens really, really rapidly on that specific benchmark.
So, Dawa has this plot in the DynaBench paper where you just kind of see saturation of different benchmarks over time. And...
If I recall correctly, it was, I think, somewhere around eight to 10 years for MNIST, for example, and then somewhere around like one and a half to two years for SQuAD. And it just gets shorter and shorter with each benchmark, primarily as a function of the field growing and maturing, right? There's just more people working on the problems, which means progress happens generally faster. So one of the core kind of philosophies behind Dynabench
was this idea that we want to ideally also benchmark dynamically and that our benchmarks should evolve and should change over time so that we're really measuring the current set of capabilities of models that we care about. And I think this is very true for
for humans as well, right? You wouldn't give someone with a PhD a grade school math exam and use that to figure out, well, you know, should this person perform surgeries or something relatively important.
important. And to some extent, we do some version of that with with LLMs. I've seen so many cases of people who have a very clear task in mind, a very clear understanding of how they want the model to behave in that setting. They have the domain knowledge, the expertise to validate performance on that task really well. And then they'll select which model they want to use because it has a higher MMLU score or because it ranks higher in chatbot arena.
And I don't know that that's the best way to ensure we're really evaluating the full breadth of what we care about in terms of model capability. And I think one of the best
ways to do that is to build benchmarks that reflect the task, the application that you care about. And ideally, if you can do that dynamically, such that today's model is at 70%, as technology continues to advance, it gets to 100%.
that probably doesn't suggest that the model can solve every possible kind of situation or setting that you're putting it in, but rather suggests that your benchmark was limited, which it will always be because it's conditioned on some understanding of existing model capabilities. And so you just want to make sure that your evals also update themselves over time. Yeah, benchmarking is an absolute nightmare.
The great thing about benchmarks is they're standardized, you know, so we can compare apples to apples. I was speaking with Sarah Hooker in the summer and she was advocating, you know, instead of having an absolute flops limit, for example, which could get good-hearted straight away, she was suggesting having a basket or index or dynamic benchmark or something like that. How can we design new types of benchmarks that are somewhat impervious to good-heart's law? I think we are almost caught in
a local benchmarking optimum where we've just done things in a certain way for many years and it almost seems like the naturally right way to do. And I think we potentially as a community, right, we should probably think about taking a step back and thinking about
Given where the technology is today, given what language models can do today, what is the right way to think about evaluating them? And I think examination in people, for example, in our education systems has evolved over many hundreds of years and is arguably imperfect, but
I think the fact that we target specific capabilities almost in this, you know, hierarchical taxonomy where you've got sets of skills and you've got some concept of how those skills combine together to like, you know,
generate more complex skills. I'll give a very simple example. If you train to be a surgeon, you go to medical school and you start off at some level with the basic set of skills that all humans require, and then you need to specialise and you go into probably chemistry and biology and you need to understand the core concepts there, understand how to apply certain
functions or reasoning patterns. And the further you progress, the more complex and specialized your exams become. And potentially then the way to evaluate models designed to do everything, assuming we want to build general models designed to do everything, is to come up with a set of examinations which are
intentionally designed to test models. And they might be inspired by how we test humans and they might not because models and humans have different characteristics. And I think we're starting to see some very early work in this space. And one thing I think we should also be thinking about in that kind of framework is just ensuring again that we have a
a way to guarantee that this process is dynamic because the technology will keep improving and the moment that the benchmarks don't keep up is really the point at which we kind of need to either rethink everything from scratch or maybe we've just reached the point that the models are better than an expert human across all domains.
Can you tell us about DataPerf? DataPerf is a set of challenges, also run through the Dynabench platform, targeted at being very data-centric. So this idea of data-centricity focuses on the fact that a lot of what we do depends on the data that we're training these models on. And there's definitely kind of been waves of, I guess,
around what contributes most. Is it the data? Is it the models? Is it the architecture? Is it the resources? Is it the hardware? And I think DataPerf came around at a time where
it definitely felt like there was a lot of focus on the algorithms and less focus on the data. And yeah, that's the main motivation is ensuring that the data is still central to everything we do and that the improvements to the data remain a core part of kind of this journey we're all on of building better, more capable general models. So you're working at Cohere.
Tell us about what you do there. I work on post-training. So I joined Cohere roughly two years ago at a point where the kind of language modeling landscape was a lot more raw, I'd say a lot less established. So this was kind of a few weeks before ChatGPT launched.
And obviously since then, there's been a massive influx of kind of interest and excitement and investment as well. So at Cohira, I built out the post-training team. So when I joined, obviously we had already very, very solid base models. These are kind of pre-trained models. They're trained on lots and lots of text, many trillions of tokens, and trained to generate what the next word is.
And back then, it was very clear that these models were very capable, but didn't really follow instructions very well. And one kind of step towards making these models a lot more useful is getting them to follow instructions from a human and getting them to do what the human wants them to do. And so like one example of this is,
if you asked a base model a question, something like, "What color is the sky?" A base model would reply with something like,
What color is the grass? What color is the sun? Mostly because the context in which it's seen questions, in many cases, was just lists of questions. Even if it's pages of FAQs with the collapsible sections where those might not have been parsed in the right way. And that's interesting behavior, sure, but it's not what you expect or definitely not what we've come to expect from these type of models.
And so one thing I did very early, along with Asir, who kind of leads pre-training, is we built a very kind of rudimentary interface for collecting some instruction following data. And we asked people internally, listen, can you please
provide some questions, and type out what the response to those questions should be. So this idea of having a prompt and a completion. That's still today a big part of how data for these models in the supervised fine tuning stage is created.
these groups of annotators who are either writing their own prompts or having those prompts seeded or synthetically generated or prompt sampled in different ways. And then they have to write out a completion, which has to be extremely high quality and to the standards of your style guide and to the standard of how you want your model to behave.
And so we did that, we ran kind of an internal competition. The kind of interest and participation within the company was insane. We thought we were being extremely ambitious, targeting 10,000 examples in, I think, under two weeks, something like that.
And yeah, we obviously exceeded those expectations and we trained kind of the first generation of command models that really followed instructions well. And then kind of things grew from there. So we then moved on to training a new model every week.
That initially was kind of the instruction following supervised fine tuning stage, where we then started working with our own internal annotators, with external data vendors, getting all sources of interesting instruction following data. We started synthetically generating data.
we did that for uh pretty much a year so uh every week for something like 52 weeks we delivered um a new uh model back then they were the common nightly's i don't know if you're yeah but yeah um they were they received a lot of interest and excitement in in the developer community in particular because people are always you know waiting to see what's what's the new model and
There were a few weeks, like most of the weeks we just had the model which was superior to the previous model as is kind of natural with these processes. There were, I think, maybe four or five weeks where the model was the same or slightly worse than previous models based on our internal metrics.
And so we wouldn't release that model. And then you get people asking, oh, I didn't know. Based on what my interactions with the model, it doesn't seem like the model has changed. Are you still planning to release the model? And so it was just a very inspiring time for us. And then around, you know, towards the end of last year, we kind of...
organized a bigger kind of structured push towards a new generation of models, which became Command R and R+, which at the time of release, so in April of this year, were kind of ranked extremely highly on pretty much all the metrics. I think we were the fourth biggest, like the fourth ranked provider on Chatbot Arena.
I think there was Anthropic OpenAI and Google and then Cohere. And now, of course, we're working on the next generation of models. What do you think about quantization?
You know, people kind of like ripping your model to shreds and reducing the precision and hacking and stuff like that. It's very useful as a short-term solution from the perspective of efficiency gains. I think it's not always clear with quantization. The problem always comes back to evals. So the objective for most quantization efforts is make the model a lot more efficient and maintain performance on evals.
from the perspective of maybe you accept some small drop in performance for massive efficiency gains. And it's very effective from the perspective of if I have
100b model or a 35b model and I want to quantize the 100b model to kind of the same level of just like per token cost as the 35b model, I'm probably going to get a much better performing model at the same price range. But then I can also quantize the 35b model and reduce the cost there. I think one of the
the trade-offs is these blind spots where you might not be measuring everything. As we've kind of discussed, our evals are incomplete. They don't measure things like, in many cases, advanced complex reasoning. In many cases, they're quite limited in terms of their ability to test, say, very long-range interdependencies and long context performance and these kind of things. And so there's always a risk that
you're potentially degrading performance in aspects that people care about that you might not have been able to measure for whatever reason. - Is reasoning the synthesis or the execution? And what I mean by that is, I can teach you how to do multiplication or long division or something like that, and you're just going through the mechanics of executing the rules.
We make lots of mistakes, as you say, but we have the ability to synthesize the rules and kind of describe them algebraically. I would expect that the synthesis is a requirement, but that you can really test the execution. And, you know, there's various mechanistic interpretability efforts trying to look into, you know, how are models designed?
actually thinking what's going on under the hood. The recent kind of work in terms of scaling test time compute where models are generating these reasoning chains and then providing a final answer. And in many cases, giving better performance. I think that's still heavily underexplored. We don't
necessarily know exactly what the effects are that are resulting in the better performance. And I think there's various early indication, we have some work going on as well, that there's no
strict requirement for models to reason in natural language or to have these kind of internal thinking chains which are explicitly generated. And I think if you want to test whether a system reasons, I think the adversarial setting is one which allows you to really probe for specific capability, a specific type of reasoning. We saw this in a lot of the early adversarial QA, BDAI-style work, where
The limitations of a particular benchmark are the benchmark itself. And so you've curated a benchmark, you've gotten crowd workers to ask lots of interesting complex questions, and then you kind of
check how well the model performs and typically performance saturates very quickly and you know models do very well. Drop for example is a pretty classic question answering benchmark was very popular, still very popular for evaluating language models. Drop, if I recall correctly, stands for discrete reasoning over paragraphs and it's really focused on numerical reasoning. The limitation there is that if the model
achieves human level performance on the benchmark, can you claim that the model can reason? And I think it requires one step further, which is if you then get that model, put it in front of humans, ask humans, "Probe for this capability, probe for this type of reasoning, see if you can get the model to fail." If humans are consistently successful at getting the model to fail on examples that are slightly different from what has been trained on, what has been tested on,
then I think it's fair to assume that the model isn't really reasoning. It's just figured out how to do well on a specific benchmark. How puritanical should we be? Because you could say, as you just did, that reasoning is a binary. Like you either got it or you didn't. Or you could take the stance that if it performs robustly a lot of the time, even with a few edge cases, then we could still say it's reasoning. I think we have to hold...
to a very high standard. And the reason is that I don't expect the interaction between humans and machines to be the same as the interaction between humans and humans. We build up trust with each other. We engage in conversation, many cases involving complex reasoning in specific domains. And we really understand what our own limitations are. And I think
I think the expectation of machines is consistently they're right all the time. It's a function of traditional software engineering and how that's approached. If you had a calculator and you put in 1 plus 2 and the calculator said 3, and then you try 2 plus 7 and the calculator said 17, you'd throw away your calculator. You wouldn't try any other questions.
And, you know, LLMs in the early days, even these very early, you know, very simple mathematical questions, they would get wrong. Yeah, so I think there is a sense of,
the standard is actually higher. We need that consistency and we need that robustness. I can definitely imagine that as the complexity of the task increases, there might be a bit more room for error, but definitely on more simple tasks, the tasks we're using today to evaluate these models are relatively simple, right? One of the
very common benchmarks is GSM 8K, which is grade school math, which is extremely easy for humans, or at least humans with a certain level of proficiency. I mean, current models do very well. And I'd probably argue, right, you're typically looking at like 97, 98% levels of performance. You're probably there looking at noise in the test set rather than anything about a model's capabilities specifically.
But yeah, going back to the earlier point, the next step is really put that model in front of a human, get them to probe grade school math levels and types of reasoning and convince a human that the model is actually consistently doing the
what is being asked of them. So, you know, like with these deep learning models, they have characteristics of reasoning and they have characteristics of kind of statistical matching. And they seem to do both in one, which is brilliant because we can align them with reasoning algorithms, but we can also get them to make good guesses in other situations. But you sketched out this example that sometimes it will just be like a hash table when it would just go and retrieve the fact from the document and sometimes it would do reasoning.
But the boundary is kind of clear, isn't it? So sometimes it's directed and sometimes it's quite spread out and it's quite diffused. How does that process work and how epistemically aligned is it? I suspect analogous to how humans learn, where we kind of learn...
what I guess I'd call relatively simple tasks first, or learn to understand and apply relatively simple functions first, and then over time assemble those into more complex composite functions where you might apply a specific type of reasoning or a specific function for a particular problem, or might apply a different approach for a different problem.
I suspect that's similar to what we are starting to see and what we probably see more of in these deep learning models. It's inefficient to reason if you don't need to, right? If a model is asked, what is one plus one? And it has seen that hundreds or thousands of times in pre-training,
It doesn't need to reason from first principles and figure out the complexities of math and how to do basic addition, but that doesn't take anything away from solving that problem as long as the solution is correct.
What do you think about the Arc challenge? I think it's another step in the right direction. We need to keep coming up with challenges that really push the limits of current models. We'll solve that and then we'll move on to the next one. And hopefully at some point, I think it becomes a lot more relevant to ground
you know, these kinds of challenges in real world application and in terms of what's useful for us, what by design do we want AI systems to do? How do we want them to operate in society? How do we want them to interact with humans? And I think those questions are becoming more important and hopefully at a point in the not too distant future, you know, these would be the questions we care about, which is,
How can I get the existing technology and apply it to the benefit of humanity? Will connectionism get us all the way? Do you think we can come up with the right types of neural networks that will do reliable, robust reasoning? Or do you think we need to have some kind of hybrid architecture? I don't really care in the sense that I see no immediate reason
to expect that there are severe architectural bottlenecks to what we're doing. I mean, there are some, but they're easily overcome, right? So one example is tokenization and
We had this paper recently at the MLP with Sander. It won an outstanding paper award. It got retweeted by Andrej Karpati, which made my socials explode. And I've never seen a spike like that in terms of kind of the usage stats. But this...
And this work focuses on this concept of glitch tokens, which is this idea that certain tokens cause weird behaviors in deep learning models. There's various reasons for this. The main one is you have this mismatch in terms of the data that a tokenizer is fit to and the data that a model is then subsequently trained on. What ends up happening is you have tokens which are just very rarely seen or very
or pretty much like not seen at all during training in many cases you you have weight decay which means that the embeddings of these tokens kind of tends to zero as model training goes on and even if it doesn't tend to zero it just tends to a very kind of small magnitude where it becomes very hard to kind of separate like token specific information and it becomes very easy for a model to get confused about you know tokens identity you know this kind of effect
highlights an immediate limitation in terms of not necessarily the architecture, but the current way of doing things. But I don't think that's a question of connectionism versus the more, I guess, traditional symbolic structured way of approaching things.
more just pointing out a limitation of the way we currently do things. I don't see indication today that deeper, larger, bigger models won't be able to do more impressive things. And once we reach the point that they can't, I think then we'll know the answer. And then we can kind of take it from there.
But in the meantime, I think we've made incredible progress over the past few years, both from a data perspective, but also massively driven by just the efficiency and the parallelism that Transformer provides.
I'm sure we'll see more innovations in the future. I'm sure that further efficiency gains will accelerate things further. It remains pretty clear that an LLM's data efficiency is not quite the same as a human's. LLMs see many lifelines of
human data during training that we do, but they are for it, right? If there's a compute and there's the resources, we are time constrained and AI systems aren't. Yeah, I suppose there is a similarity there in the sense that the kind of information we process and build on is also multi-generational.
Like our language and our knowledge just seems to continuously grow even though we die every 70 years or so. I think that's very true and I think there's definitely some compression of knowledge that happens which makes passing information on to the next generation more efficient.
software has the advantage. The kind of space of continual pre-training is just starting to become relatively explored, but currently in most cases, and mostly due to architectural constraints, right? If you want to increase the size of your model, it becomes non-trivial to figure out what's the right initialization based on the knowledge I currently have. But if you view what LLMs are doing as
an efficient compression of the data, then starting from an existing efficient compression seems a lot more efficient than starting from scratch every time. And that's a lot more convenient to do with LLMs than it is with humans. You can copy the weights of one model, pass them onto another model. You have an instantiation of exactly the same
kind of knowledge and information and capability in a way that you can't do with humans. How important is having a big context window for you folks? We're starting to see some models with really big context now. Is that something that's quite easy to add on or is that fraught with problems? It's generally quite challenging to...
to maintain high performance across the span of the context window. And I think it's going back around 12 months, there was definitely what felt like this race to go larger and larger in terms of context window.
Our first generation of models had 4K context windows. Our current models have 128K context windows. And you saw pretty much all of the LLM providers really start to push this idea of longer context and market that. Google in particular, I think, made some very interesting innovations in that space, going towards context windows that were previously considered quite challenging to achieve.
and still maintaining good performance. And I think it's a bit of a trade-off in that the vast majority of user queries are not long context, but long context is useful for many kind of things you want to build on top of language models. So things like retrieval augmented generation,
That's sensitive to how good your retrieval is. So Go here has extremely competitive embedding models, which are very helpful there, and re-rank models as well. And then there's a question of, well, how much of what I've retrieved and ranked can I put into the model as context in a way that the model can use that information and aggregate information across documents,
in ways that are beneficial. Then once you get to things like tool use, once you get to things like incredibly long conversations, there's these ideas of effectively infinite conversations with the model where the conversation to some extent is the state. We were talking about personalization earlier. If your entire conversation with the model
is representative of the way you want to interact with the model, including feedback, then potentially that's an area which is ripe for exploration. When it comes to working with code, there's massive advantages to having long context capabilities where you want to do something like
give a model your entire code base and have it process information, extract information, potentially rewrite things. There's a lot of value you can unlock there. And I think it's a bit of a chicken and egg problem. Models traditionally weren't
had relatively short contexts. I mean, going back, I think BERT had 512 token contexts and most people thought that was plenty. And now we're in the hundreds of thousands, if not millions. But I guess there's a question of at what point does it make sense to continue to feed things into the context? It becomes more inefficient, right?
You can do some clever things with caching and you can speed things up, but there's just generally more processing that needs to happen for generating the same answer. And if that answer is better than what it otherwise would have been, then I think that's fair game. If you get to the point where kind of adding information to the context for no good reason, then...
I don't really hold any judgment. I'm just not sure there's much sense in doing that. So I remember a few months back, there was this idea that you don't need to do retrieval augmented generation with infinite context models because you can just put in as input all of the data that exists, which is very true. You could, but
do you want to, what would you gain from that? And it almost feels intuitively a bit wasteful. It's like, again, if I asked you something like, what is the color of the sky? But I told you, also, you need to read the entire internet to be able to answer that question. In many cases, you don't. What do you think about...
reasoning models? I mean, of course, it's definitely something of interest, right? We generally want models or work towards developing models that are kind of best in class across all dimensions. And we know that general reasoning capabilities make models a lot more both, you know,
They perform a lot better on reasoning-style benchmarks, which in itself has value, but they also are more useful and more valuable for people because you can do more complex things with them. I quite like the idea of being able to trade off test-time computes with performance. So that's one thing that my team is currently working on. But we're looking at it a bit more broadly. It's not just...
it's not just allowing the model to generate more tokens, you know, to optimize some performance number. It's more around thinking about how are these models interacting with users? What is the signal and information that you can share with users, maybe even as you're going through this reasoning process? And how, again, do we make that controllable and customizable? You might have a
a relatively simple non-critical task where, say, if models had well-calibrated confidence scores, you could tell the model something like, "If you're at least 60% confident, give me the answer." And as long as that, again, you had that kind of core requirement of reasonable calibration, then
which basically means if you ran that thing 100 times and the model should get the answer right 60% of the time if it's 60% confident. And that might be fine in, say, creative writing settings or it might be maybe what you want in particular settings. Whereas if you had a really complex, really critical task, you might be able to say something like,
I don't afford for you to get this wrong. Just think as much as you absolutely need to. If you're in any doubt that the answer is wrong, tell me. But if you're going to output an answer, make sure it's right. Obviously, we're far from any model that can do that. But I think that kind of...
behavior, kind of thinking about the problem from almost a user interaction perspective. I think, yeah, it's going to be extremely interesting over the coming years. I agree. And, you know, the big challenge is just building the apps, imagining what can be done with this technology, just the basic stuff like doing enterprise search really well. We've got a long way to go, but it's going to be an exciting journey.
For sure. Max, thank you so much for joining us today. It's been amazing. Thank you. Likewise. Thanks for having me.