We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Neel Nanda - Mechanistic Interpretability (Sparse Autoencoders)

2024/12/7

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript

People

Neel Nanda

Tim Scarfe

Topics

Neel Nanda 认为，机器学习领域独特之处在于我们创造了能够执行令人印象深刻的任务的神经网络，但我们并不理解其内部工作机制。他将此比作拥有能够完成人类程序员无法编写程序的计算机程序。他的工作重点是机制可解释性，即试图发现和理解这些网络中出现的内部结构和算法。他认为，对神经网络内部机制的理解有助于评估AGI的实际风险，并有助于解决关于AI风险的困惑和争议，提供实证依据。他还讨论了链式思维推理的有效性、实践编码的重要性以及机制可解释性在AI安全中的作用。他详细介绍了稀疏自动编码器的工作原理、挑战和解决方案，以及其在Transformer电路分析中的研究应用。他还讨论了模型行为分析、特征学习和扩展、工程实现以及如何改进模型的推理能力。 Tim Scarfe 则从多个角度对Neel Nanda的观点进行提问和探讨，例如推理的定义、链式思维推理的机制、机制可解释性在AI安全中的作用、稀疏自动编码器的应用、模型行为分析以及如何改进模型的推理能力等。

Deep Dive

Shownotes Transcript

Translations:

中文

You could, I could just hear the like slides, creepy sounds of the backgrounds and the footsteps. Three things are going to learn everything about sporting encodes. What is the right way to do you causing interventions or models? And how to think about IT, why mechanistic and capability can help us make model save machine letting is, in some sense, a really weird field because we produce these artifacts, neal networks, but no one designs them.

They're not like a computer program that someone made a design and they wrote the code and said, we just create some flexible architecture, shall lots of data in and effect system comes out, but we have no database going on inside or how works. And these are really impressive systems. They can do lots of things, lots of complex of the engineering task and complaints, ts of reasoning and get I M O silver medals.

We just don't understand how they work and what's happening inside. We have computer programs essentially that can do things that no human program knows how to write. And my job is to try to fix this. Is A G I N actual risk is a pretty polar ized question with lots of precious people and strong opinions on both sides, but like frustratingly little empirical evidence. And I think one of the things the interactive city he could potentially give us is like a clearer sense of what going on inside the systems.

Like, do they do things like you that we would hold planning? Do they have any meaningful luck of goal? Will they do things like deceive us? And the more we can understand how these things manius and like whether they occur situations, but they shouldn't that like more, I think we can learn. And this seems like a really important research structure to me.

The road bus ness to its representation post steering kind of indicates to me that it's more than just a keyword matching. You're actually understands what the thing is.

It's pretty clear to me that it's not just key work because we observe things like multi lingual features where the same text in different languages lets the feature up. You tend to see this is beyond a certain model scale in scaling. Mostly, they had a multi mode features, like the golden gate bridge, one let up on pictures of the golden gate bridge.

And like with my intertangled hadden, I do inside this. Very surprising. Because if you can map your inputs into a shed semantic, abstract space, you can do efficient processing, like, of course, this is.

And there's some interesting work, like there was the do lamas thinking english paper for, I think, Chris wendel, I think that seem to show that the model decides what to say. And I decide what language to say in IT kind like different points, and you can closely vene on them differently. But I think if you have the kind of n gram matching sarcastic parent perspective, you would not predict this.

And I don't really understand the people who hold that perspective now to be, I think, is clearly falsified. So I think to understand Better to, we need to first begin with what is the problem they're trying to solve. So you can think of the neural network.

They made up a bunch, and you person some input IT gets converted to Victors or series of vectors in the case of the transformer. And then each layer transforms this into a new vector, oral series of vectors. These are the kind of activation in the middle.

Often in the blazers in the middle have immediate activations. We believe that these activation vectors often represent concepts, features, properties over the input, something interpretation, but it's effect. We need some way to convert IT thing that is meaningful.

And sport order encoders are a technique. The choices do that. They basically decompose the vector into a small linear combination of some big list and supports a dictionary of meaningful future Victors. We hope that these future effector correspond to interpret concepts and its space in the sense of most vectors are not part of this combination on any given input.

I think that if you gone something like your amp dia, they'll often show you kind of different intervals like what is between the fifty eighth th and seventy eth percent tile of activation. Let's like give you some stuff. And this is useful for getting a kind of broad of you also just looking at the causing effect is interesting, like feature that little up, unlike fictional characters in general, but especially Harry pott. But when I stood with IT, IT was kind of only a Harry.

You can join minds A I, which is part of tooth A I labs, to do cool arc research and eventually beat arc. After that, they are gonna work on LLM based reasoning, following in the footsteps of o one. All pure research, no product well funded, go to two of a lab ster A I and will be covering some of their research talks in zurich to in the coming months.

Were also sponsored by a centaure, the insanely optimize model serving platform. It's best in class compared to others, four hundred percent faster tokens per second than the hypothesis ales. It's over seventy percent cheaper, just eight cents per million tokens on lama, eight billion on your own private in point, you get ten free credits to evaluate the platform.

So what are you waiting for? Sign up at seven ml. D A. I this is quite a long podcast. The first part of IT wasn't outside view that I recorded with nail. And then the the rest of IT is the inside component where we talked a lot about mechanistic, interpret, able and space auto encoders.

Enjoy is in some sense, a really weird field because we produce these are effects, neil networks, but no one designs. They're not like a computer program that someone made a design and then at the court and said, we just create some flexable architecture, show lots of data in and a effective system comes out. But we have no doubles going on inside or how works.

And these are really impressive systems. They can do lots of things, gloves of complex of engineering tasks and complex of reasoning, and get I M O seven medals. And I like, we just don't understand how they work and what's happening inside.

We have computer programs essentially that can do things that no human per governors. How to right? And my job is to try to fix this. So I run the google deep mind. Mechanistic interpret, all team.

And mechanistic interpret is a type of A I interpretations that says, I believe that neural networks in general learn human comprehensible structure and algorithms insight, and that with enough care and attention and rigor, we can go through them and at least sell to uncover parts of this hidden structure and go by ones just treating IT as like a weird bike box system or a thing. We take gradients through to light up parts of the input, but like actually understand how the algorithm is represented in the weights and activations. And it's really hot, but we meet some progress I .

will talk about today yeah what what is reasoning to you .

know you mean what is IT to know when an L M is reasoning or just what is reasoning in .

the philosophical .

abstract philopena abstract god for sofa abstract because i'm pretty sympathetic tic to the applying some kind of logical rules of inference like you pups some knowledge, you do things with that knowledge to produce more knowledge though it's kind of unclear whether there needs to be a sense of intentionality and agency behind IT. Like if a squirrel has the learned reflex of noise run up tree.

Is this reasoning or not? I don't know and I think there's also a interesting question with L L ams of is the elements forward pass the thing we are analyzing or is the l em as a system including long generations and chain of thoughts um or ridiculous long change of for in the case about one, is that a thing that we can consider to be reasoning because like that can clearly reason like IT produces a logical stream of inferences and you can give its obituary things like there's that proto Q A benchmark that's got all tempers of emma ses, all olympos ses of blue. What are temples and models can do that with chain of foot? While IT seems much higher a bar to say that I can do that kind of thing, win the ford pass.

And honestly, that seems more like a claim about the internal cognition. And I know that's why we go the field, mechanistic, interactive ability, but we have going on in terms of inference time compute versus um having kind of phase transitions or emerge things during trading and seems hard to compare to me. Like when you train when you spend more computer in a model, IT will have more capabilities.

Some might be learned in the sudden way. Some might be learned gradually. I don't really care. I don't think that's very releve here. Um if you want to train a model with less compute but spend lots of there on inference time compute, that will get you a different trade off.

I basically just buy that for many use cases, the ratio of inference time training time compute is like has been quite off historical and that in future, it's going to be more baLanced. I think that we don't yet know what to these new inference time compute focus systems are going to look like economically like we have scaling laws of loss curves. But how does this cash out in economically useful things?

Um so for example, one thing I found quite interesting in the old system card is IT was only as good as claw three on five minutes at agented tasks. I think that in from some computer systems are likely to be most useful for that kind of thing because you're happy waiting a while while that just goes off and do something you don't need to interact with them. And I predict we just haven't learned how to use them.

And I predict that we haven't learned, right scuffles, ding, to make them good at this. And this just gives me very wide arrobas on what the economics are going to shake out. Us.

you know, have having a good twitter, I think, is paramount to getting intelligent commentary. So I have .

occasionally seen intelligent comment on he's .

a good example so joe Smith said asked him literally anything about why A I safety is important so he can dunk on you. Dingdong.

alright, why is a safety import so um I think duncan is this tasteful but also I don't know, I think that human level A I is just a thing that is technologically feasible that will happen sooner or later and amount people are trying to build that. And this just seems like an incredibly big deal that will massively change the world. And if you have these intelligent autonomous agents, there are ways this can go well and there are ways this can go badly. And it's just really important that we'd spend a ton of research energy on things like safety and interpret, so we can make this correct.

Daniel filling, he said, how much progress have we made in mcinturff over the last five years? And are we on track? If I were .

fill reasonably good about whether field is outs, I think that we're trying to solve a really hot problem and we might fail, though I also think that progress in mcinturff often looks like uncovering more and more of the structure in a system such that even if we don't get to the kind of really ambitious goals in the field, but I think are often not very realistic, I am going to look at to the point where we are able to do interesting, useful things with systems.

And is that on track? Is that enough kind of unclear? I feel like we've made quite a lot of progress.

I feel like there's a lot of things I now understand much Better, like superposition and what do about IT, how to do principled causland interventions to try to walk up circuits, how to engage with the transformers, how to like what kinds of structure to expect in these models. But also, there's a time we don't know. So i'm clip.

Is there a new merc benchmark that has a measurable school in mcinturff? You know something like the blue score in in NLP. Do you think that would be useful? And how close are you to to getting that?

I think the problem here is the summary statistics can always light you your taking a rich, complex, confusing hy dimensional object and you try to impress IT down like a number or like a bunch of numbers and if you have a kind of well defined technique, will you know what it's supposed to be doing? And you're just trying to measure something like um with the sport also encoder, how good is that at reconstructing the model at a given level of possible that is pretty memorable? Or if you like, how good is IT at helping me unlearn the fact that also measurable? And these are just things that I would like sporting coders to be good at.

But there's lots of things like does this model exhibit superposition? Have we truly understood the circuit or could they be, I don't know, a positive and negative bit cancelling each other out that we hadn't noticed, but on some other input that comes a par and is important and does just really hot to come up with something that probably captures these, though, I think, is the field matures and becomes more paradigmatic. This is a direction I want to see.

So I said, korea. He said, how promising do you think? Control vectors, the break ers and rethink, rethink.

do you know what that is? So that stands the representation engineering, sorry, super, which is like similar to mother techniques like activation steering control vectors.

And yeah so the key here here is basically you define a vector in some way um like you give the model the prompt I love you and the prompt I hate you and you take the difference in the residual streams and I ve got a kind of love vector and then you add this in and some change happens like a output happy text and um circuit ers uh I believe is I mean I currently remember the details of the paper but is basically a way to make models if they see something harmful output like kind of not output the harmful thing using control vectors and some kind of training but I don't really remember the details, so sorry um I think so almost have focus on activation I think is an interesting direction, seems like an interesting technique. Um I like the fact that this kind of fairly interpretation model and hunley technique can do cool things. Um I mean I think golden get claude was like very similar to this technique. IT happened to use an S E, but this probably was unnecessary and golden acord was like hilarious. Um and that also seem to be with you can make this more useful by like reducing pollution inside rates or increasing truthful ness is IT gonna be enough to a lion help us a line, a like human level system highly on clear that might help dally.

annus says. And this will echo walley subbing a walley suber is a friend of the show. He is a good five person um just just just to put that in the and he said assuming that explaining a decision means doing an inference in reverse so from the decision backtrack to find the steps that LED to IT. And given that neural networks are incapable of doing that because there's no invert ble composition, how can we truly achieve .

explain ability? So that just isn't really how I think about the problem, like the way I think. So first of you don't need to care about the inevitability thing because models are a stack of lads IT clare is a much simple function, and you can just analyze the activation or treat a bunch of so like it's it's not like I just have the output and I have the input and I need to somehow reverse engineer is like books.

We have the weights, we have the activations. We know every mathematical details of what is going on in the system. We just have no we just by the foot have no idea when if that meets.

And um that is the problem that we've making purpose on. And you know I think if you have like an activation, you got another activation and you like how did you go from A A B, you can make some inferences. You can look at the weights s, you can do causland interventions like, I am not holding myself to the standard of proof of, like, we haven't vert a mathematical function. I agree that your own net books are not universal function, but I am holding myself to the standard of, like I have good evidence that my story is faithful t what's going at least part of what's going on inside the model. And there's a lot of ways you can try to to find that.

which i'm sure will tell about. I must admit i'm genuinely confused about this. So I don't think, well, I mean, maybe our brains are composition, but IT feels like our brains maybe aren't that different to neon networks.

Mean, of course, they they are different, right? Because the neurons can talk to each other almost independently and you can get these repeating cycles and so on. words.

Neon networks are are left to right. But if we just leave that aside, they are similar in the sense that they're very sub symbolic, the very defused. They are very complex.

They're entangled. Yet when you look at our mind and when you do psychology IT IT appears as if we can do compositional reasoning. So is isn't this weed two ways of looking at things you can look at in your network. You think, oh, yes, it's blow up and it's entangled and it's complex and different bits of the news network, you know apparently do the same type of thing. Could could I ever be compositional in in some sets?

IT feels more confusing to me when we're discussing a single ford pass. But like if you give us a scratched pad to do you change of thought on you going? Obviously, do all of this IT just write that down and then does inference on those? And like, is this different from me thinking verbally through a problem and documents? Is that different?

Why do you think that self prompting chain of thought rationals gives gives them up lift to me .

that he was really intuitive that you get an improvement and capabilities when you have changed for like when I have paper to write down my thoughts, I am smarter. I can do harder.

Problems in terms of computational complexity like the model, rather than going through every layer once can transmit some information back and IT can also um kind of just parallel zed things more so IT can have one token words doing the what should I what should my plan be calculation, another one where it's like what the step on of the plan telling to do and another one where it's like executing what step on tells you to do. And if you've going to do these all in the same token, the model if gets very crowded, there's like lots of concepts into fear with each other in the residual stream. And this makes up more era porn for the model. And also there is just some inherently sult al nature to a lot of this computation.

The rationale might be noisy IT might be wondering what the average number of carrots are written every day in belarus's and problems that require backtracking as well. The previous trajectories might become noisy and but might become distracted. So it's not a definitely win.

Win is IT. I expect that our situations where IT will make the model, for example, there was this delightful paper from miles turpin n on unfaithful chain of thoughts where you give the model a bunch of like multiple choice L B questions. Um where the chain of four response and then an answer given like ten of these as a few shot prompt and then given another and uh what they did is they made in the few short prompt.

The correct answer was a for all of the questions. And then in the final one, the correct answer was b and they found that if they are prompted IT to give a chain of thoughts at the final thing, then IT gave a bullshit chain ford for why the crack holds, was a, and then said a while, if they didn't ask IT for chain, for IT gave me, well, I was more likely to get b and in somewhere. Is this a kind of a boring example? Because telling IT probably triggered the few this more, but IT also good.

I have a wild that the model has an internal representation of. The answer is A, I should generate a spicious exploitation based on this fact. Like what I would love us want to do a chain a, to do mechanical p project on what the hell is going on there.

But from a model theory of my perspective, does that in any way denigrate the idea that they have goals and intentionality? If if, if it's such a complex token by token space, I mean, maybe maybe we don't have goes in an intentional, but does that in any way denigrate the kind of the theory of mind of avance?

I expect to go mechanistically to basically look like you have some criteria, you consider actions and you judge actions according to this criteria. I expect this to be easier with an internal scratch pads. But my general intuition is that if a bad model can do something with an internal scratching pad, a good model can do IT without the internal scratch pad and can do even Better stuff with a scratch pad. And like algoma, ally IT doesn't seem like evaluating actions should be that hard. I think this is this seems like quite an interesting thing to study and like more toy arl settings where this seems like decent evidence that things engage in planning like erik genre have this delightful paper looking at lea zero that I think found evidence that was like thinking at least two moves ahead.

Do you have any advice for fresh phs and mckinney?

Um don't spend too much time reading papers. I think a standard academia purch to getting into a new field is I will read all of the literature and you should read some of the literature ever reading list we can probably link in the description but I think you are also just had a lot of time coding, like play with more models, play with bus, also codes, play whatever your curious about that's pretty good tooling and tutorials, novaes.

And we will hopefully have some of those in the description and IT make up just a very empirical field. You want to build intuition and you can often get something like a cool block post out of not that much work and playing around um yeah um I get mentorship like ideally all supervise spends bunch of time with you flying post talks on the lab but try to find ways to spend time with them, maybe focus on being useful to them like helping with their projects. Um i'm a big fan of pair programing as a way of learning technical skills.

Um if you can find anyone who's Better up M L coding the new or in top coding the new and consensus pairing with them um maybe like as part of helping them with some project or something, I think that can be a great way to learn. And I think lots of people in academia somewhat neglect technical skills. Saw excEllent so no shade the um and just the more the Better you are coding false, you are experiments and the more research will do, uh be skeptical um it's really easy to come up with a beautiful, pretty idea and the idea is actually complete.

You get attached to IT. You don't notice an adviser or even just peers who can red team. Your ideas are really helpful.

You should also just spend a lot of time doing that yourself. Like, how could this be false? One of the ways this could not be true. What are experiments that would distinguish true from false? 啊 yeah well.

phillipson said, how on earth and that's how full stop on fool stop earth, fool stop. Did you manage to get to where you are in just twenty five years of age? Share your journey of a hero, please.

Well, so the um the world of life story is I finished a teammates degree at cambridge in I was going to do a mosses, but why on earth would you do a mosses during a pandered uh so I spent a year doing internships at various safety labs like deep mind and the future mantis tube and the center of human impossible AI a berkely um I wasn't super thrilled about any of the specific gentles i'd worked on but was like, I believe that A C T is big deal and I should work on that and then I massively looked out.

And got a job offer to work with Chris, ola and anthropic sometime there. Ended up leaving for health reasons and then spent about violent dependent and then um into up the deep mind in terms of like how this happened, i'm kind of confused to be honest. I think the k ingredients were getting into a field early that was growing really force and i'd like to think i've done himself up to help back both, which is also as useful to me.

Um i'm really good mentor people and I just find this really fun, right i'm really good in the sense of I can have people produce cool paper about spending that much time helping them and I also to find this a very fun thing to do which means that I both have my named a lot of papers and also I am a Better research for IT um and yeah luck and having great mentals like Chris Allison and finding a field that would just like a very natural fit to my taste and counts. I think IT is unfortunately hoda to get into mckenna p now that almost three years ago, sorry, the field has grown more people are interested. It's terrible.

Now one day we will get cross on the shock.

I be so excited. Um yeah the .

man is a an absolute living legend.

It's so great. Ah I think he's early never done podcast on the eighty thousand hours yeah but it's so good people should definitely listen to IT yes and yeah um and as context for people who have lives, twitter has been very interested in the fact that i'm twenty five recently is .

very confusing oh what what are you saying?

Um let's see so someone made to me where they said today is newnes ers nineteen thousand th birthday please comment. Happy birthday neil. If you believe the reverse engineering your network is important for reducing air risk. Um and he also works this hilarious post um that was completely bullsh about neil nanda, the eighteen year old policy who is verhagen A I inside google who invented the world of Walker source code at eighteen months of a um and i'm very confused yeah the twitter mean .

sphere is is a very strange place. We have now produced the most comprehensive youtube video on S A S. What is an S A E?

S A E sports are a tool to let you look inside a model as IT runs and see what it's thinking. I think this is one of our most exciting tools to date for interpret. But also, most people don't really get how to think about them.

And I have a lot of hot tax and timi zend several years discussing them. So that sounds like you have a good time. Maybe should check this out now.

Why is make into relevant A I safety?

yes. So yes, so I think there's this interesting question of motivation, where, in my opinion, recent up is just like a scientifically ally, fascinating field. We just don't get how these systems work in ally, but we can make progress.

And there is all of these beautiful puzzles discover and is a part of where I work in the film. But a lot part of where I work in the field is I am pretty concerned about x central risk from H I. And I think this is a promising way to make progress on IT.

And I think it's a very high level. I just think that it's much harder to make a system safe or evaluate how safe IT is if you don't know what's going on inside. And I think that any progress we can make on discoferies its tunnels is great on that front, a particularly concerning capability that I think mayrie is deception um systems that are capable of deceiving us.

And i'd know trying to evade in our evaluations, if you can evade your evaluations, every other evaluation is kind of useless. And IT may be quite hard to figure this out by just like interacting with the mod because of a model is small enough, that kind of news that are doing this. But I think that it's possible that if we can learn like look inside deeply enough, we must be able to detect when this is happening and potentially former deeper understanding of IT and potentially stopped this.

I think another angle would just be there's a lot of confusions around is aliment a big deal to these risks matter? I think this is quite bad. I think that would be so much Better if there was a good empirical grounding here that we could agree on and empirical questions to study.

But there's lots of things like, can these systems plan do they have goals that does no real consensus about? Or rather, will this happen? And I think the mechanistic contract pt ability is potentially a promising tool. The questions like that, because they are fundamentally a question about cognition, I don't think mcinturff is the only tool for answer ing these questions, but I believe that can be a promising one.

Is IT possible that there would be a level of sophistication in training where the model would deliberately kind of almost encrypt itself to stop you from putting probes in understanding .

what is doing? Maybe my intuition is that that seems significantly harder than learning to lie to people because he just has so many examples of that in the training data while um yes, has so many examples of them and training data while um how to warp your interns to avoid interactive way tools kind doing and doesn't get feedback on. It's like I don't really know how to walk my brains.

So aliens can't like read my mind um and like OK ably, there's like patents of coconuts that should be easier, harder to get that IT could do like not thinking about its goals very often. I like maybe that would happen, but I don't know. I predict there will be a window where this stuff is useful before we get models that like so galaxy brain, it's hard to know what tools to work on them.

3秒to make 打工 is happy。

Way back, my girl from is actually very excited when he had I was going back on because he says the previous episode was so relaxing to listen to SHE used IT to help her fall asleep no.

I did say last time that news doll sit times will melt the stress away and I wasn't lying. But also there there's the other sign of that, which is you're an incredible researcher in mcinturff. And I think a good lead for this is, you know, in your own words, what is making tub .

short IT mechanistic contract table tip or mechanical p for short is a sub field of the general study of interpreting I na box, whatever you want to call him. And it's kind of based on this hypothesis, is that neural networks, when trained, have learned some kind of human compress simple algorithm, but they don't have any incentive to make this legible to humans.

But there is some underlying structure and alder than there, because like that is, what does the computation that produces the outputs. And by careful and principled reverse engineering and science since criticism, we can deify parts of those hidden mechanisms. And ambitiously, this would look like understanding the entire algorithm. But even just understanding what the inter mediate variables are like, pots of that structure feel like meaningful and exciting progress.

To me mean, some people argue that I mean this this is a mammoth task. You know some might argue that it's am infeasible or or maybe even an unnecessary. And what what would you say to those people?

Well, unfeasible and unnecessary are two very different things. My revenue are unfeasible is I think it's reasonable to say fully reverse engineering like geri og pity for four two soccer is not very realistic um but I think that there are a lot of important useful things we can understand about them and partial progress. And um personally a lot part of the ism in this field is i'm pretty concerned about x central risks from A G I.

And I think it's important that we have stronger methods of studying them and ensuring safety. And I think that even partial progress and mechanistic contact ability can help us get there alone with just a note being scientifically fascinating and beautiful um like kind of rewards me out. The people are just okay.

A bunch of soft ig engineering tasks that seem really complex and difficult, and we have no idea how to make a system that does that. In terms of designing IT, we don't understand what is inside these things. They are just things that work and we've made progress.

But I still feel pretty confused on a deeper level about what happens inside them. Uh, regarding unnecessary, how does necessary mean like I think IT is important to do research, help us make the system safer. I think that one such pathway is interpret ability.

I'm not arguing that if we don't do any interpretation research, we will never produce safe systems that would be unreasonable. Um i'm not arguing that if we don't persue a specifically mechanistic interpretations but persue some other philosophy and we will never make safe systems, I am just saying I think this is promising approach I think is helpful. I think IT is helping us find real interesting scientific truth about these models and I wanted find them.

So if those are the goals of machines, what are these sub goals?

Yeah so. I maybe divide the field into three conceptual areas. There's basic science. So what on earth is going on inside models? There's automation, taking the techniques and tools for understanding systems and scaling them up, maybe getting an elem to do them, maybe just having an algorithm that can do them and kind of practical applications.

Doing things in the real world where the goal is necessarily to advance on sending of international ability, is to do something useful and get feedback from reality in the process. And um historically, most of the field has focused on basic science. I think this was correct um and I think we made a lot of progress. I think going forward, i'm excited about expLoring other areas more and building all of this progress um within basic science.

Um a decomposition that I think is often useful is if you think of the model is like a computer program in some sense that two tasks, what are the variables and like what is that state at each line of the program unlike what are the lines of the program? What is the code? What is the album I in mcintosh jacon? You call the variables, features, and the values are like, how much should they light up? And you call the algorithms, or lines of code circuits or algorithms.

And generally we study are models, activations, the things that computes on the fly when you give IT an inputs because we believe those represent variables um and like what the model is thinking about in some sense if you forgive the anthropos m. And um on the other hand, um circuits live in the premiers of the model. Um they are like a thing learned during training that doesn't depend on the input but is like responding to the input and rooting IT done different channels and things like that.

I may have already answer this question in in your last answer, but just to IT home a little bit, you know before we have things like line and shap, we might be doing um black box model interpretations you know with the cerebral model or something like that. Now you seem to be talking about what algorithms are these models running mean how do you contrast those .

different views yeah that's not quite my field. So what i'll say things about IT, they could be wrong. Uh, i'm sure there's like tons of papers that have all kinds of variants of these things. So if I expressed any critique, i'm sure someone can send me a paper that addresses out or something I note, but at a high level. So my understanding of lime is IT like let's take a local approximation to a model that's linna or something similar.

Um I think this is a pretty reasonable thing to do, but I think the question is like how are you applying the technique? So okay, let's remember them IT kind of feels like more of a poison pal difference um historically lots of interpreting hope. He just kind of looked to the inputs on the outputs and just hasn't looked to the intent or they've used IT to compute gradients or something.

And mechanical p is trying to go beyond this and like look at things like the activation of the model, which again, people have done before. But it's also trying to understand like the total connections and the parameters and how this all connects up. And just like pushing for a more ambitious 的 and techniques like lions shop may be useful， may not be useful um but they will like a tool you applying some situation like for example um attribution patching, we'll probably discuss more later but very roughly IT approximate to cause the intervention using gradients.

Um I have a blog post introducing and also there are a bunch of other papers that do similar things for previous years. Everyone always reinvent everything um and this is like quite similar lime and spirit, I think live as a different approach like finding the linear approximation gradients are really cheap and imperative. They seem to work basically fine but with with attribution patching, you offer apply IT to the model activations um rather than model inputs.

Um I believe um shap is kind of similar is like an attribution methods. We vote a bunch of inputs or something and you're like what is the sharply value by which would affect the up? And again, this is like a reasonable all you could apply in the models like your activations.

Ah I have not actually seen a deployed um I be kind of curiosity other results uh um ideally you have a more rigorous ous way of doing the thing and then you try using a cheaper tool and you see is a good approximation though that kind of constant chAllenging interpretative is what even is the ground truth. Um okay. So going back to the question um I think in some ways mechanism c concert ability is kind of continuous with what's come before.

Like ultimately a goal is to take a model and try to understand a happens inside. There are some cultural differences like different people with their ideas, like originally I was vision and then moves language models. And there are some people of doing language model in tup forever. But it's one of more of a perspective and philosopher rence than necessarily like techniques is is like it's not just introducing a new technique is why I trying to say.

yeah, I know I was speaking with Andrew elias from M I T. And and he is working on data modeling, which I guess is is not too to similar something like check where you actually model know how changes in in the data set affect the overall predictive architecture. But there's always this thing that IT doesn't scale very well, right? And I I guess the question would be, you know what what would you mean to reverse engineer language model?

So there's kind of two things that I think would both be reasonable senses of this word. The first would be you just fully understand with the model, you have a human compatible ble algorithm that acts basically exactly like the model um and you've like reverse in the promises. This would be an ungodliness sive program though.

An interesting thing is that I expect that to be like really wide rather than deep, like I expect that to be lots of different programs that kind of happened in parallel and then most of them are discarded and like a few are used such that there's like some hope that, uh, but I this is incredible ambitious. I really think this is realistic. Uh, unless we automated enough, we can just have another L M this um and the send, which I think is more realistic, is given any inputs, I will be able to give you a story of the computation done in the model to produce an output.

And um to me this kind of thing is a lot more amenable to like you just look at the activations at each step and you do coal interventions to understand how dependent things or maybe you even look at the weight and it's a mammal. We know how mammals work uh mature application and yeah maybe yeah maybe this is also just a good thing to add my previous answer of like differences with other kinds of international ability. Like to me, the neural network is some kind of substrate that represents an algorithm in its weights, somehow the activations each step in this algorithm.

And I want to understand as atomic a step as I can using tools like activation, patching that I think we'll discuss more later. Um I don't just want to treat IT as a black box going from like the start to the end. I want to get like as zoom interview as I can and um ideally piece together in this happens S S and you get from the start to the end yeah you said something .

which was along the lines of them and I was say similar which is that your networks are wide bit shallow and you know symbolic methods are you know are very deep, you know very narrow. And there's always this notion that they're quite blown up circuits. And then there's you know is is making terp kind of identifying an algorithm in an input sensitive way. Because if you think about IT, surely IT must be a super position of algorithms because the entire model is this newly mess of all of these things mix together. And I I guess you you are suggestion that given given an input example, we look at the behavior of the model and we try and infer what the algorithm was that took place.

Yeah so yeah, so there's maybe three lines of research that I think are worth calling out. So there's what people might think of when you think of interacting, like mckinney, like reverse engineering, a thing from its weights, like fully mathematically understanding IT and like we I know I did this for a tiny model doing moto addition. People are on this for other tiny systems.

We made a bit of progress on language models but like I basically think no one has satisfying he done this the language models um and IT definitely does not seem to scale so I am a bit pessimistic on this. So i've had out some hope we can rescue IT at some point there's the coastland intervention school thought that's like I think of the model as a computational growth. We have notes like each attention head is a node, each m pillar is a node, or maybe each neural is a node.

And I do caudal to ventures where I change one node from one input to another input. And I see how that changes downstream ones and how that change cades to the output. And yes, this is this is fundamentally very input sensitive of um often you'll do this on putting narrow distribution um like sentences of the form uh person a and person b went to the location, person a gave a object to and a new problem with person b um and like you know that's a very dior context in some sense.

Um the relevant paper there is interpreted he in the wild by uh Kevin wang. And so this is very an input dependent, partially input dependent because model components kind of do many things and um they're like the dragon is police semantic uh on a narrower distribution. Only one of these might turn up but um you can't make a general um and then the third family is work with possible to encoders.

So these are basically a technique to take activations um that a full of all kinds of stuff and mean lots of different things and to decompose to a larger and sponsor and importantly, mu semantic representation like you got a bunch of latent um some chemical features, but I really like that word which might get into later. And each of these latent hopefully corresponds to some concept and by itself, this isn't giving you an algorithm. It's just studying activation.

It's just telling you what variables are here at the step. But I think there's a lot of exciting work to be done in um converting that doing circuit finding with S A E as your notes. And if they truly are malos manic, then this feels I can might get you something that I would consider more input independent though it's kind of complicated and there's various improvements to this, which you can discuss later.

But you can maybe think of of the recent history of macintosh as people really wanted infant independent algorithms, uh, and that was really hard. We had lots of success with like input dependent stuff. But soul seems good enough to be useful.

And a lot of people are working on that, and we're hoping we can get less and less dependence enough to be useful. All that. There's a lot of tasks where you don't need to be input independent. You can just study that demain. And I think both of these are likely reason likely to work out.

yes. So you mentioned police semantics and mono semantic may maybe we should just quickly um define that.

yes. So the idea is you can think of any number in a model as being a detector of the input IT like is big on some inputs and is not on others. This could be like a neuron um or just any other elements of an activation or even a projection onto some direction. And um we call this money ismania if there is a shared property of all of the inputs that significantly caused IT to let up and we call the police semantic if there do not does not seem to be a shared property or that's like several clusters or maybe it's a complete mess. And this is always hot because this is an inherently subjective definition like what does that mean to have a shred property and my typical answer is just like, I don't know, man, it's only pretty obvious and practice um and there are some education where you might miss label of thing because you miss the pattern but like that seems basically fine. But some philosophers in the audience may be screaming at me right now.

yeah, isn't IT. One of the benefits, I guess, of neural network is that when I get, you can call IT sub symbolic, that that the knowledge is entangled and distributed over many shared neurons. And I guess to interpret them, we need to descend, tangle them.

But maybe the brain works in a similar way as well. I mean, what what you're describing here is that there's this kind of set of circuits. They get activated in some kind of task specific way. And then we can disentangle the representations such that they have an intelligible single meaning. And then we can use that to reason about the circuit or the program that run in .

in the network. I think IT is an important fact about neural networks that they can kind of do a bunch of things at the same time with the same component like an attention ahead, can do different things in different contexts, a by at the same time. What I mean is the same components on different inputs can do different than useful things, not on one input.

They can do three things at what that's much hot um and so the Jackson for this is um superposition. So maybe let's sum out of you and discuss some empire services by neal networks before I tried to explain what we think is going on. So empirically, new number of components are often policeman's.

They read news will response many different things empirically, concepts are often distributed. They often seem to be represented as um often linear directions and activation space. Like if you take a linear accommodation in your own, that lights up when the concept is there and doesn't light up much when it's not there.

And um what we think is going on is um superposition the idea that there are more concepts than dimensions. Each one is lining, leave, represented with its own direction. I'll justify the literary assumption more. Um and these are all kind of adding together in a way where you can lossy extract one, you can project onto its direction.

They'll be other things with non zero the product which will interfere but the interference is kind of tolerable like IT uses a bit of error but not so much that IT wasn't worth having that future. There are all. And uh my guess is that this is actually a pretty important part of why your networks are more efficient.

Uh h just so effective. And interestingly, um there's S A sense which transformers or like residual networks in general are a lot Better suit the superposition because there's lots of addition and reading from this shared residual stream uh in a way that lets that represent a ton of things in lenise superposition. Well, if you have non than iorio on there gets a lot more missing.

And this is a mechanistic hypothesis. The choice would explain the policeman tic and distribute the things. Um there's also a bunch of different senses of superposition.

You can have representative superposition where this is when you've got like some activation representing stuff and IT just IT just squashed in more than I has to mentioned, for example, the embedding matrix of GPT two. It's fifty thousand by seven hundred six eight for the smallest one. So this means that it's fifty thousand tokens into a tiny space, but it's still clearly kind of knows the difference tween the tokens. There's computational superposition um and uh this is like when you have um something like an m pilot, like a major's mum placation and a Normal and IT computes. More new features than that has neons.

One example where i'm pretty confident this is going on is facts like models know a ton of facts but and i'm just be shocked if they knew if they only know as many facts as they have in neurons um and this seems to be represented in some kind of computational superposition um actually have a cute investigation uh called fact finding where we tried and basically failed to defer how exactly the computational superposition worked mechanistically here are pretty confidence go on. And the final kind would be kind of circuit or weight superposition. We have a lot of different algorithms in the promoter matrix es added together.

It's like less obvious to me. This is a big deal because that's just like in square perimeters for every like dimensional activation, which is more space. But IT does seem to be there a bit. So like we find um when we say take like use possible coats or very called transcoding s and multiply wait mattress eases together where we think the start is interpreted, the end is interpreted. But the much bigger you often get connections between things that seem romantically totally unrelated but like don't practice like this thing never lights up when this thing also lights up and we think this is just just representing stuff and you get interference is really annoying. But it's a thing.

you know, like in the the physical world we live in, you can do analysts at multiple scales, so you can look at the might to conjure in your body, or you can think of yourself as an agent. We can think of the ecosystem and and I guess it's a similar thing with neal network analysis that even the inductive prior of a transformer, you talking about residual streams and adding things together and so on. And this is a mode of analysis.

And I think this is the case that you could take any transformer and you could represent IT with a blow up. mlp. So there is an M, L, P that i'll do the same thing.

And then technically, your abstraction analysis that you use for the transform environ would still work for the mlp. But the mlp is a completely different space. But IT goes back to what we are saying before, that neural networks are very wide and shallow, and that leads to this kind of confection of little unintelligible circuits. And could there exist some much more abstract decomposition of a neural network that would be far more kind of explainable, you know, if it'll be a Better theory of what's going on in the new network?

So essentially, could there be a different architecture? There is a lot more inherently .

interpretation, maybe an architecture, maybe just maybe a type of analysis that would Better explain, I guess humans needs to have quite macro scope c prize to understand things. You know, when we get to the real low level IT seems .

increasingly unintelligible. yeah. So I am excited about research that is trying to find these like more micro scope c higher level things. Um one though there was some interesting work with image models.

This is paper called brunch specialization that found that um so in fact about the original alex net is that that has kind of two separate branches because they just had two G P U with bad interconnect and they don't really intersect very often and these kind of ended up specializing come part member to what like I think one was doing colorism was doing shapes or something. And they found that another image model that wasn't trained like that still kind of had neurons clustering into like parallel branches. And if we could find this kind of microscopic structure in aleema, that was super cal.

Um a problem with this is the superposition is most effective with things that don't coker because if they both happening at once, the interference gets way was uh if what that happens, the other doesn't happen and you need to just tell that the first thing happened in the second thing didn't happen. This means that if you hand like two modules like the biology unions and the I know um generating news articles about sports neons or something um these would actually be great to do in superposition. So like the structures is gonna super eligible to us. And one hope is that with space also encodes helping us, as in tankle these things, we can do more to find this kind of high level structure and that the direction, I think, could be pretty cool for people to investigate. It's not something i've seen that much work on yet.

I mean, I just blow my mind. Have you seen that that there's a type of them. You can decorate tables and you you put water on them when you put two electrodes outside and you just see this kind of, you know, an electric patterning that burns, burns a kind of tree structure. IT looks like a lightning boat, like a tree structure. And you know, we're done, it's incredible. But the thing we done things around here is that, you know new networks are mostly grown and the growth process can be influences through inductive prize um of course but when you really dig into IT, you just see it's i'm really stretching the analogy here but a bit like evolution, right just these weird specializations and local dynamics that form during the training processing. It's it's kind of like A A living process in some ways.

yeah. So I think I think evolution is actually a really good analogy. So does a sense which biological organisms are the subject of a of like a billion year long optimization process.

You have evolution optimising for something like inclusive reproductive fitness, and you can randomly move around in DNA space, and you end up with the human brain. Like what? And when we look inside biological systems, like they often make sense.

There's like structure and the organs, and like we've learned so much about our bodies and then there's so much that we're so deeply confused about and also just lots of like random dumps, stuff like the longest nerve that like goes from here upwards and then down to the place of ends, which is particularly falling in giraffe. Um and yet just like a bunch of stuff for its like, oh, man, if I was designing this, I would not have done that. And i'd bet there's all kinds of stuff like that inside your new books.

I mean, you'll often observe something like this is kind of weird, confusing, I don't know what to do that. I mean, to move on IT probably does not matter that much. Or this kind of weird phenomenon where I don't see any real reason to have IT like self p if you delete a layer or attention head, often other layers will change their behavior to compensate to like recover the performance.

And like what um this happens in like this makes sense. You are training to drop out or something like a thing that does that. But like why does that happen in models that don't have that?

Yeah and that's quite by memetic as well because I mean, the brain has the same thing. If you have a strike, different parts of the brain can kind of take over that function. And a lot of self organization and self repair is because you have these little items that can be repeated, posed as not to do completely different things.

And one thing that I often think about is, you know, imagine we can do counterfactual analysis with the real world that be great. good. Imagine if we could just run the evolution on planet earth again.

You know, we might we might not evolve or or maybe there's something evolution, evolution fit around having by Peter workers with big brains and and so on. And I guess you must see this with neal networks because you are seeing the same market actually trains different times for longer, for shorter on different types of data. And I guess are you seeing the same kind of more tips coming up again?

Again, so so this this gentle idea of the university hypothesis that circuits are universal, and we just recur in models trend of the same thing and like the strong version of this is imperative false like there are all differences between models um like um but is really like a week form might be true. Like there's some things that recur or there's like some small set of things and some sample of those appear um like I propose this great paper from bill tuk ti called a toy model university that will actually an important part of that paper turned out to be a little bit wrong and was corrected by a great follow up work from dashall standard uh but the university parts still stands which was basically we were studying our graphic models and um in these arthritic models um we had like a couple of different algorithms they could use like five or something and every time we trained up IT just gets a kind of seemingly random sample of those five and no one study this enough on language models to really know for someone thing happens like my Better that probably does. And there's kind of some recurring motifs. For example, induction hands are like a super simple kind of circuit that basically let you take something like I know if a model sees uh, the world tim in a sentence is kind of multiple comes next, no offence ah but if he sees tim scarf previously and I see him again, IT knows that scarf likely to come next and um induction is just like a simple two head circuit implementing this and um this seem to work on basically every model i've ever looked up um in the relevant um paper where we studied this LED by Katherine son um we looked to a bunch of in tonal on topic language models up to thirteen billion premature and IT was there in all of them i've got a sk blog post each thing with looking in like forty open source models and it's also in there like IT sometimes true um I think possible to enter a university is like a pretty interesting direction like how do the features compare between models and training runs and data sets? Like comparing code models to like Normal language models could be cool.

Yeah mean, this is another thing I think about a lot is this concept of, you know, almost patton's acknowledge that the universe might be generated with some kind of computer program, and we somehow acquire that knowledge, you know, which does lead to the foot experiment, that if there were another civilization, maybe, I mean, the world works a certain way. And i'm pragmatic about IT.

I think of a lot of knowledge is constructed and social and relatively stic consume. But IT feels like there are there are some guidelines around how the universe works. The on knowledge thing as well.

You are just talking about facts. So right now, neural networks, like language models, they are incredibly good at at memorizing knowledge. But what they don't have is that degree of certainty. Y I don't have the epistemic fact fulness. Some people have done things like retrieve elemental generation and so on. But do you do you think that's an in principle problem? Or or do you think potentially in the future that they could be more certainty about what knowledge the the model has?

So I would actually say that I think we already seeing some meaningful progress here. So like this, I think there's two problems here that are important to distinguish. There's the model knows something, but that is a false fact versus the model doesn't know anything. So IT falls back on the general language model prior and just bubble. I personally consider the first one like out of scope, and I consider the second one to be what we mean by hallucination um mechanistically.

I don't expect any difference between the first two but in the same way that like I know i've have lots of false belief and sure um I you know um small and models will have fewer false beliefs but like I don't think that's onna fundamentally go away um and I think that I mean just mechanistically um producing a bunch of stuff is like different um like there was this great nature paper recently from sub focus on semantic entropy, which is a basically just you generate a bunch of things. You group the ones that mean the same thing together and then you take the entropy of that distribution. And this turns out to be a pretty good sense of how uncertain the model is. And there was a fun to follow up paper using IT to train a probe, which seems to do a decent job of predicting the models are lucina's. And i'm currently supervising a project from heavy, a frendon Oscar baser look, trying to understand making up of loose nation in more detail. And they found this super cool thing in what we're released yet a called like like an entity detection socket like the model has, uh, there are bus ultimate of features for I recognize this movie name and for I don't recognize this movie name and these are like causey relevant to whether the model will say, I yt know or just bubble um or like tell true facts or or bubble for like the wanted does know about also I think it's if IT knows about the movie, messing with this feature makes you say, i'm sorry, I can't help you if IT doesn't know, you will Normally here can help you but this can get rid that and they will inset battle as like that seem to be mechanisms here like I think there's a lot of progress that can be made that that .

does really interesting me know, is an interviewing acb calm and yeah he's a really cool y he's really call guy so where his debate papers, one of the papers of the iron icml and essentially IT was about having A A kind of a pool of agent, and almost like judged in a couple of critics getting them to argue IT out over ten iterations to get closer to the truth, right? And we see now with reasoning mean to me, reasoning is about closing a knowledge gap.

So I don't know something. And the trick is, is telling the model. I need to start prompting myself to reason now because I know that I don't know and that's the thing isn't IT. Do you think in principle that a model .

might know that IT doesn't know? I think, yes. I mean, I think the work I just described is an example of such a thing. IT is distinguishing between entities and nose, and entities that doesn't know, at least in some narrow domaines like movies, we in the effect finding project I mentioned, we also found that like when you give the model a fake athletes name is kind of does seem to act differently than if you give me a known athletes name um was actually kind of interesting. We found that the early M O pilas would still generate a kind of sports direction um but that the attribute attracting attention heads which like look at the athletes name where the factors looked up and move up to the end wouldn't look for the unknown names even though the mp s were still sing like lucina's sports as is kind of cool. I want to what the role .

of reasoning and thinking out loud here is because, you know, again, we can debate whether or not the models internally do IT or whether it's a form of exchange. So you model, think in in a system two way and that's almost how IT reckon sills, you know what IT, what IT knows internally into some kind of calculus that I can kind of, you know, reason with an and perform russian house and sound, I guess from a man turn point of view, how how does that affect the process letting the model perform some russia before you then .

analyze um as then you ask IT for a fact IT gives an answer and then IT does a bt of introspections .

the answer yeah my intuition is that perhaps that the model is quite nearly and if you let the model panda and consolidates and think about what IT knows, will you be able to Better know what IT knows and what IT doesn't know?

Inference time compute is helpful, feels like a pretty uncontroversial statement. Now one is, I don't know. I got a more interesting answer than that, like IT will have more chances to notice something going wrong.

And I think I don't really know what the circuit for. Is this fact true? Looks like but IT wouldn't surprise me if it's like a at least a bit different from the recall effect circuit. And there there are some facts where I can I identify that it's false without uh actually knowing the answer or something um like I just kind of bubble something like I want to say something but then IT looks back it's like that's probably says all like i'm not really sure um I don't know if any really looked into this, so i'm purely speculating right now.

But yeah I mean, the union integration I have is that IT feels like a lot of the reasoning we do is actually a form of patents, tool use. I mean, language is a tool, much like the we were saying earlier, the omelet equivalent of a physical tool, you know, like this is and we learn how to reason, you know, we learn IT at schools. We learn to apply all of these different ratio hours.

and. Applying these rationales in the tokens space kind of helps us make sense of what we know deep down in in our minds. And maybe there's an analogy that of language models. I know .

possibly I mean, I think there's a very kind of deduction you can do on certain facts um like A I claimed this but like maybe I recall a bunch facts about the claim that I just made and like see if any of them have any barring on this. And a general fact about language models is just you kind of only get one pass through unless you're doing some kind of chain of thought or simple.

So um this means that you can't do that much computation in a single forward pass and you can do much more if you just pass on a bit of information that you got with the first bit processing, the second bit, the third bit of or possibly that they might not even need to be communication between the different bits of processing. IT is easier for them to happen on separate tokens. So IT doesn't just interfere with the health of bunch in the same place.

So for facts at home, we wanna get into mcintosh mean this this is it's still quite a nation field and always said it's getting much more mature. Now what could folks at home do to get started?

yes. So I think this the field of Green, but they saw a lot of cool problems to work on yourself to be done. I yeah so I think in terms of reading papers, uh, I have a reading list that we can put in the description also you to google neil lander mcs up breeding, listen so you'll find up um I also have a guide to get excited in the field though, unfortunately um and the arena has like a fantastic set of coding tutorials um that we should link to as well. And I basically recommend going through um maybe skimming like a papero two to just see how interested you are in IT doing the arena tutorials, like get your hands studying and understand the tooling and then do a mix of reading papers and doing experiments kind of reefing off of those papers. Um I also think it's a lot easier if you have collaborators um or does people to chat to uh the eua discord, the mechanistic and interpret ability discord and the open source mechanistic capability slack are all great places there and um yep, I encourage people to have to just I don't know, um do a small project, write a book, post about IT, put that on your website or someone like less wrong or somewhere else um pick yourself out there a bit, try to get feedback from people, but more importantly, just like get your hands dirty and actually try things and follow your curiosity rather than just .

reading fifty papers so what is activation .

patching in contrast pace so the the goal that the activation patching technique is trying to solve is attributing some model behavior um in particular m kind of numerical output like the log problem of the correct answer to some model component on some data distribution like how important is this attention head or this S A E late to this neuron for the model answer ing something and so there's like a lot of ways you can do us um and generally you want to be costly intervening on this component to change its value.

But if you got a chunky y component like a head or layer who's got quite a big output that's like a vector on a high dimensional space, if not, not clear what you should replace this with. So like the default thing would just be replaced with zero s like it's going what drop out does this leg there of knock out or like a black and um the problem is that all sometimes just break a model because is a very off distribution. So they'll be components that aren't relevant to a task.

A Better like just there is like a bias ter or something. Uh so for example, in GPD to small M O P, zero is basically always used to like in haunts the tokens doesn't seem to do much. But like if you delete that, everything breaks because it's like the output is kind of added to the embedding to be like the effective in bedding that everything else sees.

And so the next level would be mediation. Um you just replace IT with the mean over some courts. And like I think this is a lot more reasonable um but even more interesting thing you can do is um activation patching where you have uh two inputs that like similar the difference of key detail, for example, the iphone towers in the city of and the collisions is in the city of these will have a different answer. Paris on room you have some metric, say the difference between the log probe of paris and room, this is quite nice because that's equal to the logit difference of paris on room because maths. And um then you swap some activation value from paris to from the paris proms to the rain prompt um and you see is what does make the other thing, say paris less.

The first one is called d noising because you can think of the room input is like the bad noisy input and you're like the noising one component you're replacing IT with yeah you're like replacing IT um with the truth thing um and going from roman to paris and seeing if IT damage as paris is called noisy because it's like you're kind of messing up on a component and you're seeing if it's important um and the ring nice thing is that you can have kind of your choice of the baseline. So like um a pair of prompts like this has got a contrast pair where you want them as close as possible apart from some key detail because this means that things like the i'm doing factual recall right now feature is still there, the I want a city feature is still there, but the like which city is that in bit is not still there. And you can also um but you can have different problems that have different changes.

For example, the iphone towers in the country of now you're analyzing the like relationship part of effective recall and um a deviation patching lets you have this really fine great tool for analyzing different kinds of information. Um the Dennises verses noising is actually quite an important social so you can think of noisy as being like what is this bit necessary for the computation or like at least was IT used if I get rid of IT does anything that damaged? Um you can think of as like was the things sufficient? Like does the output of this node from then on cause the output we want?

Um this does not mean that is the only relevant node, but that means it's kind of enough of an information bottle's to contain the key. Like if you have kind of three steps in the process, do noising any one of those steps should be enough? Um and you kind of want these in different situations um so for example, if you think you have found a circuit this like three stage things would like a few notes of each other.

Um you can test this by either noising everything, not in the circuit as sing how badly IT breaks, which is the way to test the whole circuit at once or you can Denny is kind of each slice of the circuit at a time and see if that is sufficient IT doesn't make sense to to like the oise two slices because the the second lace kind of doesn't care what the first lace is doing because he does patching in this values um and yeah I think activation patching is like a really cold harper technique um IT masses under many many names like coal mediation analysis in the paper worry first shot from Jesse vag in twenty twenty or interchange and interventions da gaga or restabilize which is mostly just used for the noisy part or coal tracing which just using the room paper and just so many names it's really annoying A I personally like activation patching and just try to convert on the name um the people who wants to learn more about IT I wrote this kind of uh tutorial peace with seven hymns hai could like how to use an interprolating tivo patching not only a research paper but is just like an intuition dump of how to think about this technique. Um and I think it's just like a pretty powerful tool that is useful in a bunch of setback. But try to undersigned model components.

Um on the final thoughts on that is that um if you're using this practice on a larger model, it's often uh quite expensive to activation patch everything. So the technique E M I recommend is attribution patching um which is basically you approximate that new ingredients and um either book quest on this and my team bad out a paper called atp star biyani cremer kind of measuring in detail of this is a legitimate technique that works and providing some improvements, especially dealing with attention layers. And because this uses gradients, you can kind of patch everything in a single backwards poss rather than needing to do like a separate ford pass per patch.

So this can leads like pretty enormous speed ups there in does have accuracy problems, especially in europe. The input uh, my intuition is that the embedding space of models just isn't isn't this in the same way that like its tunnels are later on? So gradients just tends to not work as well because IT is like is like fifty thousand descript points in space, and it's certainly locally lynx s like, why would I be so?

Quick digression on our friend grant Anderson, and one of his biggest fans is so great.

People who do you know? Ground runs the youtube panel, three, one room. They make great videos and fun, and they yet ground most of dozen of math videos, but also was doing A O videos.

And one day I was, like, make top puzzles of pretty ideas and visuals, and is kind of that unusually match for an eye topic. Why don't I called, email him and see if he's interested? And we had a lovely chap.

One of the things he was thinking about was how to make as transformer mlp video. And he was looking like a good motivating example. And I thought that my fact finding work was actually a good example and he's into a Green.

And so he discuss some of that also, just like a really great video and channel. And you should just watch all of this video, to be honest. You should especially what to that one because IT also has a bunch of stuff on superposition and how to think about this. And it's just a way Better animated and it's just Better than here. Talk to.

no, don't pause IT. Wait until the video then go over there. Quick pause on genre as well. yeah. So as I understand that you I guess this was inspired by the original microscope project, was great.

is involved in yeah I think I yeah yeah .

and and do you have done something similar for inspecting jama? How does .

that work? So um we did this project called demons cope and this is basically a family of. Several hundred open weight sport encoders on getting too because sports and codes or pain to train for reasonable get into laid up and we thought this would enable Better academic mean up very such um there is a website called neuron pedia who we are not affiliated ated with but that great and I love them and they do things like have a page for every latent direction in a sporting code with the text I activate said and like a giant excEllent of what IT does and things like that and they kindly made this gorgeous interactive demo for us which might be the thing you'll thinking of that similar to microscope. Yes, a genois cope is actually totally different from we just can think of a Better name, but I like the name.

So um so dots .

of the there thinking is basically I kind of think of sport ble to ensure as a microscope for understanding a language model. You pick an activation, you see me in, you expand IT into a sponsor and more interpretation form and you analyze that uh this analogy doesn't quite engage with the fact that you then use this to make a reconstruction of the input, but that I think IT gets the intuitions across. And this is like america, A S group for jamming. Just good.

Tell me about water and codes.

yes. So okay. So I think to understand sport coats, we need to first begin with what is the problem they're trying to solve. So you can think of the neural network being made up A A bunch of players um and give passing some input.

IT gets converted to Victors or series of vectors in the case of the transformer, and then each layer transforms this into a new Victoria series of vectors. These are the kind of activation in the middle. Often in the layers will have in the middle of them have immediate activations.

So you could argue that really this is just not a layer, actually several smaller layers but like whatever um and yet we call each of these and immediate variables and activation, it's like it's a vector. We believe that these activation vectors often represent concepts, features, properties of the input, something interpretation, an intermediate state in the models, algorithm. But it's effect.

We need we need some way to convert IT the thing that is meaningful and sport auto encoders are a technique. The try to do that they um basically decompose the vector into a small linear combination of some big list, something called a dictionary of meaningful future factors. We uh hope that these feature vector ors correspond to interpret concepts um and it's space in the sense of most vectors are not part of this combination on any given input.

Now many people would have seen the the golden gate bridge example and that was perhaps this.

I love him so much.

He was a good guy, was a cooky. And maybe just bring that in just for folks we have not had about at home. But but I was that really brought sports encoders, you know, into the masses. Everyone heard about that.

yeah. So okay, so gonna get laws. It's bit complicated to is very simple in some sense but I think the like interesting tagua is really more complicated. So um going a cold, people who want to work anthropic took um clawed three songs.

They are like medium size language model at the time they found the bus auto encoder future for the golden gate bridge just like one of these vectors. And then they clamped IT to a high value, meaning they are know that Normally was like someone between zero and three. They said IT to thirty.

And this made the model obsessed with the golden gate bridge. And they would do things like, right recipes that involved a mile long walk along the beach and things like that. And ah this was just really fun to play with.

And they had a research demo for twenty four hours. And I think a common misconception about golden gay lot is that space ortwin codes were necessary to create this. So there's another kind of similar technique called steering vectors, where you do something like, give you the model a bunch of problems about the golden gate bridge, give you a bunch of problem about like london bridge or something.

Take the difference and activation and average up. You got a Victor add of that in and it's unclear to me whether this would have been Better or worse than golden gate clothes. Um no one has really looked into this to my satisfaction but a to me the exciting part of golden get clad is less um the fact that you can achieve this technical feat um because I believe simply methods I mean even a system prompt would plausibly have achieved them.

The exciting thing is that shows that possible to encoders were doing something real. They found, uh, concept inside the model. IT was excited that IT corresponded to the golden gate bridge because that latent variable lit up more, or like systematically lit up on things to do with the golden gate bridge, even pictures of the golden gate bridge, jaw descriptions in different languages.

And this was closely meaningful. Like it's very easy to have to trick yourself by finding a thing that kind of correlates with what you care about, but it's not actually what the model is using. And this leads you to be mistake in view of its tunnels.

But by setting this to a high value, they just obviously made the models behavior a different in a really interesting way. The behavior seems kind of quality tatis ly different from what you'd get from the system prompt, which is also really interesting though um animal have observed kind of qualitatively similar soft red steering vectors. So I think it's more the act of like intervene with a Victor inside the model is like a powerful thing that we should be expLoring.

Yeah and I was thinking that these models, we only know them to be sick, authentic, and they are trained with our late chief. And they, they do you, they, they do what we want them to do, yet you manipulate their internal. And all of a sudden this thing has got a mind of its own.

What you mean by a mind of its own well.

IT comes IT comes back to this kind of once desires, motivations, intentionality type of thing. We only know these models to be very sick, authentic. And you you modify its internals in the way that you just described. And now what he does is kind of divergent from what you .

put into IT. Yeah so okay so here's ropy how I think about this. Um models are good at simulating different personality, often called the like simulators view um the train that pretrail on the internet and tons of things. They learn a very diverse range of things, which includes the ability to adopt a diverse range of prisoners when they are A F or whatever kind of post training people use nowadays.

You're kind of trying to find select a persona for us and many companies go with this kind of fairly agreeable assistance what you you're referring a psychic panic um and that is a personal but like I think this is is in some sense kind of fragile and I think that theyll often be trained to not my account of this persona easily so being the jail break phenomenon shows that is often not that hard with the right prompts. Um and if you find tuna model to have a new persona, that seems very easy to do. And this kind of international ability based intervention is just like another kind of thing.

But it's not at all like contract ability is given us a new capability we don't have before. It's more like, you know, breaking the personal and giving in a new personal was the thing we knew how to do with existing tools. This is just an interesting new tool, is going to have some different properties that are cool and worth expLoring.

But IT does seem to indicate a weakness with our relatives because as you say, this simulators view is that the model is a kind of superposition of simulate or or role players and our road chief selectively delete those role players, just leaving the harmless sicko hanc ones. And this rather leads to the conclusion that IT doesn't really delete many of them. It's quite a britain way of, you know, making the model present in in a certain way. And actually all of these other world players are just there, hidden beneath the surface and and you .

can activate I completely agree, I think that this should take true statement about the current state of our ability to post train models. Um is the solvable kind of unclear another fun data point here is I supervised this uh paper from and L D T called refusal is mediated by a single direction.

We found that you could find a like refusal vector by taking prompts like how to build a bomb, prompts like how do I build a car? Uh taking the average difference um unclear if this is a refusal vector or a pum ful question vector whatever. And then you just bleat this direction in the rye dual like you set to the projection to zero um and this means the models no longer refuse and this works in a band of web source models. And this is such a simple intervention.

So can you explain the linear representation and hypotheses?

yeah. So a surprising imperial observation that we've seen volunteer times with models is the concepts will be represented as some kind of linear directions in space, an activation space. So you can think of this is a the accommodation of neurons that I generally recommend thinking of IT more as like a thousand and dimensional space when your ones like the standard basis, but you can kind of pick whatever direction you want. Empirical IT is often the case that concepts seem to be detected by linear directions. Um the simplest case of this is when you want on your own that seems to only light up on some things that mean a ton of papers about this.

Uh one of my favorites is the curve detectors paper from uh Chris ola um from when Chris ola was at the open a uh which bounds a bunch of a family of neurons and image classification model that just seems you only ever activate on curbs with certain orientation but like I ve discussed, this often doesn't work things of policeman's um there are been other works that have found um kind of more like directions that are not basic lies um and like one of things are particularly fun one is a lots of papers that try to find truth directions and there's kind of just the entire field of linear pRobing what you basically just learn a direction such that when protected onto IT you detect some concept and you bta label dataset um one of my favorite examples of this is there was this great paper from uh Kenneth lee on a seller called emergent world representations so fellow was a board game like goa chess he trained a model on games with random chosen legal moves so just kind of a chest n staff think you see like you played and sell sixty three and then sell one and then sell seventeen ethra and the model became good at playing these legal moves and um he found that you could actually be before the state of the boat um but what he found is that Linda probe didn't work but non linear probe like a one hidden er M O P and um in some point of work. What I found was that you could instead of instead of representing IT in terms of like black and White IT represented IT in terms of um does this cell have the current players color or the current opponents color because that is actually an algorithm that just is useful in every move rather than needing to act differently for a black and for weight and when you Linda approve for that, um you you find that IT just works and you can even closely vene with these problems. And I bring up the example, not because I want to break about my papers but because um I think that this was like a really interesting natural experiment where there was an initial paper that seemed to provide some like legitimately good evidence for a nonlinear early represented thing.

And then in full up work I was like, actually he was a Linda representation of hiding benefits office. Um obviously this is all kind of anecdo. Rta, like we don't have a fully principles study of models that is like every concept is linear.

Um there was a fun paper I think from uh rob solution um that showed an example of a nominee feature in an R N um that could be represented so like this definitely could in theory happen. My current guess is that like most of language models, computation is linearly represented quite possibly all of that. But I wouldn't suppose me if there's some weird doc matter hiding beneath the surface as a bit of a digression. But the key thing to take away from that is many concepts inside language models seem to be represented as linear directions in activation space. But we don't necessarily ily know what these directions are.

And we should talk about steering vectors as well. Now you just you just sort of like alluded to IT.

but give me example of that. Yeah so um the idea is essentially you take some prompts with the property, some problems with the opposite property or without that property and you take the difference, for example, you could take I love you minus I hate you and there's just produces a kind of fluffy loving vector that you can add in that um we will take a neutral front like I went my friend and said and that says really happy, excited things um or if you subject IT IT says like really a angry, hateful things and this works in a bunch of so um I first saw this on on language models in alex honner's activation addition paper and Kenneth le's in front time intervention paper alex a bunch of stuff like sentiment and weddings cannot focus on truth.

Um there was also the representation engineering paper that didn't a bunch more settings and just seems to work pretty broadly. And so drawing this back to the linear representation hypothesis, the key takeaway here, in my opinion, is that if the Linda representation I boss, this is true, this kind of extract, two things that are kind of related should isolate the feature you care about. IT might also have some other stuff, but if you average, you probably wash that out while preserving the like, keeping you care about.

And then you can just add that in and other consequences anian. Ity is that you can just add in more features. And IT will process recently what IT means. Models can compose concepts that might not have come up during training, but there's kind of circuit tory that can deal with this reasonable uh a caviar with searing vectors is um so a key hyper prieto is the coefficient of the vector.

Um typically you want to make a bigger but if it's too small, IT just does nothing and if it's too big, the model goes mad and spouts ts to brush um and no one really studied exactly why. But I think it's just it's big. It's too big than when lane scales down the residual stream to be kind of ve uh unit rational then because you now got a large fraction, the steering Victor, everything else gets small and this dries out everything else and when you like, multiply by unions.

Uh, even if the steering back to direction doesn't interfere too much on your own, when has got such a big confidence that difference is actually really bad? But yeah you can dry models mad. It's you're gotta get a right .

t yeah mean, you kind of pointing to, uh, kind of stability analysis is like a kind of critical point. Well, beyond that, the dynamics of the model deco here as that sounds like a fascinating research problem.

Yeah, I think some a project that would be really cold to see but i'm Molly surprised I haven't seen properly yet is uh kind of chat language model interface with a bunch of steering vectors um either the kind I just described that you get from prompts or uh S A E feature vector directions or even something you optimize for this purpose.

Um this could be things like creativity factuality verbosity, things like that, things people care about in their assistance. Um how formal of us is informal to be things like that? And you have slide, those people can move.

And this just changes how the assistance dustings and neuron pedia has a kind of M V P version of this. The is the best i've scene. But I feel like someone could really make a really cool, polished diversion of this. I haven't seen that.

How might those directions interfere of each other? You know like you might have them faithfulness and morality and don't do bad stuff. Do do you think they could be weird interactions between them?

Um so IT is in general true as well as I can tell that if you try adding a bunch of during the vectors at the same time, the efficient needed to break the model is much lower. I think this will be a real chAllenge for getting this to work. Maybe you should like, find tune, the directions are not fail with each other or something.

Especially if you're using the kind of optimized direction approach. There was a cool paper called by D P, O that seems to like made more effective Victors by doing this. And yeah, in terms of semantic composition, I kind of just expect that to be like it's got non trivial because because that kind of similar is to feel with each other or but maybe it'd be some circuit tory that gets confused or like you push the model in different directions.

Mean, in some thing you saw this with golden gate laud. Well, like you ask, not talk about the golden gate. So he doesn't want to you make a talk about the golden get with the golden gay vector and IT gets really confused and egon ized. Yes, I mean.

my intuition is that when you when you modify the behavior model, lin in this way, you're kind of pushing out of distribution, which means you might see a commentate decreasing capabilities. I mean.

have you seen anything like that? I mean, yeah steering models like kind of jack steer models are kind of junky um like the grammar sometimes worse or y'll save weird stuff or that to spell random tokens. And like generally you can get a kind of good coefficient where IT does the thing you want without going mad.

But even then, I would still expect you to have some decoration. I don't know if anyone really study this. We're gotten get cloud is it's kind of hard to study in a sense because it's not Normal models are not supposed to constantly talk about the golden gate bridge, but golden gate hood is and you need to test that doesn't unfairly penalised IT for that. I think you could look at things like it's imo u performance is like that. That's like an interesting thing people should do.

So mathematically, what is the bus to encoder doing?

Yeah so as part autumn coder is so it's trina solve two similar problems um the space coding problem of finding this meaningful list of vectors that we think correspond to concept in the activation space. This is just like a fixed list IT doesn't depend on on the input and the splice approximation problem, which is finding a space vector of coefficient for this list of vectors that can reconstruct the input.

And there's like a whole field of study of the right way to do this. Both include us a one. And the idea is it's basically a um to lay neural network where the middle state is like much wider than the input. Typically IT has um some activation function.

Uh the simply case has reale but also discuss other ones later and you feed in the activation then some of the hidden license light up and um the ones that light up and because it's really most zero and the ones that light up mold, you multiply the decode vector like the vector in the output weight um OK the thing we hope is the meaningful future vector and that produces the output. And we train this um to struct the input on a bunch of real model activations, typically in the hundreds of millions to billions of tokens. And we have some kind of positive penalty such as a one on the hidden activations.

Um I um and an interesting fact about two layer maps, both transform layer s and what codes is you can think of each neuromas like an independent unit, each one has an encode vector and a decode vector. You protect the input onto the encoder, you apply the activation, you will lie by the decoder and then you at them up and um I refer to these units license. Um there are sometimes called features, but I personally find this bit confusing because feature as a word means an interpretation thing and latins are sometimes but not always interpretation and they think it's confusing to assume they are but yeah so we take these lens and the hope is that the latent corresponds to interpret concepts. The reason um so people hearing this might be like but a bit odd, you've purely optimized for reconstruction and facility. You never had the interpretation of loss function in that what gives um so the hope behind bustles encodes is that there is a true space decomposition into um that there is like true future vectors and exhibitions are a sparling accommodation of those such that if we just optimize of finding a good spouse um the composition IT will be at least pretty close to the real one and those interpretation and empirically, this often seems to work .

a lot of people at home will know about auto encode as because go back to the days of m nest and typically auto encoders are thinner in the middle than they are on the on the in in the out because you're telling the the the other encoder to entangle and compress and botnet the information where this seems to be the opposite. You telling IT to disentangle and to and to blow up what what's in the middle but IT seems to suggest that the the the model knows that it's entangled features together and IT and IT wants to, in this setting disentangle .

it's on the first part, appoints um you are completely correct. Um generally other encoders are like you have an input. You want to somehow pass sit through a constrained botnet in a way whether the bottle neck is more useful to you, like a smaller or maybe it's disentangle or whatever, and then reconstructs the input. And often the reconstruction is just a falcon function to make the bottle neck interesting. But really, I think you should just think of this as a bit with constraints.

IT would be dumb to remove the splicing penalty and training to koto with a wider thing because it's easy like you could just have a latent for every direction in the standard basis like a positive one and the negative one and I will just perfectly reconstruct things and i'm like, yeah that's boring um but sport city is actually quite a big constraint so an intuition for this um let's say I give you a list of a thousand vectors if I only let you have um a one sport thing that's just like a thousand lines through space that's like a tiny fraction of the dimensionality in let's say you're in a two hundred dimensional space. If I let you have a five bus thing, you have have a bunch of like five dimensional sub spaces unions together and like a hundred sport things is still like like mathematically this is measure zero, like it's an infinite testiment fraction of the largest space, even though IT men know it's plausible, kind of close to many points in the space and not entirely short. But sports, he is just like a pretty big constraints.

And the fact they are forcing the model to do that means that the other win colder actually has some pressure on IT regarding like the model knows I like what does that mean for the model to know something? Like if I try to train a linna a probe to detect sentiments, this will probably work. Um IT empirically does work, and that is probably a superposition with other things. And like does the problem know that it's entangled or point of order?

I think the way that was coming from, as you know, we spoke earlier about the models are not reversible. You the as these things become entangled together, the model shouldn't in principle know how to disentangle them.

啊。 So it's more of a mathematical question. How is this even possible?

Yes, it's almost like they shouldn't be convertible. These Operations shouted IT shouldn't be possible to undo them. You going from, you know, going going from right to left, if that .

makes sense. Sorry, I only yes, that's a good question. So the way that so this is kind of speciality is a constraints again. So like if I give you a list of thousand vectors in a two hundred dimensional space and I tell you, um here's a vector in the two hundred dimensional space is a linear combination of some of these thousand vectors which ones like that's impossible to answer sounded the near algebra CT.

That's an not convertible function um because there's lots of lily accommodations that are zero, which you could add freely but like there's probably on spling accommodation zero. And this means that IT is IT is constraint, like where vectors can be. And this means that if you have a sports, the accommodation IT will often be the case that for example, if you project onto every vector in the set, you'll have a much higher projection on the ones that are part of your thing.

And you you can, in fact, trains bottles from coders with a tight and cocoa and decoder. So each latent is you literally dot product with a vector, apply a really will play by the same vector and try to reconstruct the input. Uh empire cally, they perform Better if you don't tie them. Um the intuition here is if you've got a some features that are disjoint but highly carly, then you kind of want encodes that like push them further a pots than they are by the full like there was a really fun um example of this in anthropic towards malem tista paper where um they were looking at base sixty four um detecting latency and I think they found the one that was for uh numbers um in basic four one that was for letters and basically and one that was for asi text converted into the sixty four. And you know you don't want to activate to have visit the same time, see you want and could effect is the person of on um and it's like an interesting empirical observation that you can do this is entanglement but like speciality is just like a really useful Price basically um and this is also an interesting property of language so just the real world is kind of bus in the sense that it's full of concepts and things like you don't need to be thinking about the theory of relativity, if I ask you the name of assurance, latest album or something like that um and this means that there's lots of concepts that aren't useful for the most other things. Um if you I know a train model on a hyper specific task, it's not actually obvious to me that s is will be useful um like I like he tried to them on a modular addition network that are trained on exactly that task but once opposed me.

they don't do very well. How do you know that if one of these latencies interpretation?

yes. So that's a great question. So so the first thing to emphasize is that, uh, kind of why is this even a question at all? Is the space autum codes are an unsupervised technique that means that you don't tell them what to lead. You just tell them please be SPSS and you pray that something good happens and um this would be in contrast to like a probe.

Will you give IT labels like this is formal text and this is informal text or something and this means that at the end you just get this off effect would like tens of thousands of lents. I do like what does doesn't mean and empirically, some of these ladies don't even seem meaningful like the standard. The kind of dumas approach you can do is you just look at the text that most activites latent.

You look for a pattern. Often there will be a pattern. Sometimes I going to be, this is like a crude technique. IT is known that this can sometimes be misleading. There's a good paper on this from togo bolic aci called the interpretation illusion um and though I don't actually know if i've seen such an illusion for S C E features and that paper, I think they are just using basic rejections in the residual stream, which you have much less reason to believe might be meaningful um and get there are just some license, maybe like twenty to thirty percent.

There probably varies that just like early and tough a pattern um and the kind of standard thing that gets done is either a human or an elm looks at these and tries to give give an explanation uh you can score these explanations, for example um so the idea of having an alm do this comes from this great OpenAI paper from Stephen bills cold light language models explaining language model in your own saw something. And their idea was, you give me a list of data of examples, some from the like, highest spirit of the range, maybe some from more in the middle. IT produces explanation, but IT will always make an explanation, even if does no pattern.

Because language models be like that, you and what happens is that um they then give them more of the explanation, give IT some more text and say please predict the activation of this neuron or latent. They don't neurons but like is much interesting on lens and you see how well IT does. You can also do a kind of cheaper things.

Could that actually quite knowing like you give IT to texts and you're like which of these will light up the lasting more um you can yeah there's very as kind of difficulty scales you can do with that. Um yeah that was a nice eusden post on innovations for all to went up from, I think kidon un. I am probably buttering his surname and yeah, that suggested various innovations like that in IT.

So that's the kind of looking at data at examples. Another approach is causing interventions as a gun gay code you make a big, you see what happens. Maybe you make a zero negative, you see what happens.

Um for example, if you set a Harry port latent to zero or to negative IT will often lose the ability to answer a factual questions about Harry potter which is a very fun hacky approach to unlearning there suddenly seems to perform less well than actual of learning baseline um and you can also so in my opinion, the gold standard is you give the model a ton of text you look at all of the time the thing fires and you then classify all of that by whether satisfied property or not. You can sometimes do this alga likely for like is this an arabic? Is this basically for um the pretty ning taxes? Like hardly dive us.

So lots of things you would think would work reliably like a rega x just totally fail. Um you could also ask a language model, thus to fit the explanation. Anthropic did that in their skills.

Mona fantis I paper, which I quite liked. So they did that for their golden gate future. Um you can also handcraft examples and see if that activates IT. Uh one thing that's often missing from these analysts is um all these times when the explanation is relevant but the latent doesn't fire um and I think that that's a yeah I think that um that's not to see more working and topic had a recent many investigation in their lacets monthly updates. I think where they found often these things are like a third of the time when it's about this thing that was back to bate.

There's a lot we shall don't underside about this but somewhere is that rambling y answer um there's various things you can do um looking at data at examples is one natural thing. Um I holy recommend people go poke around on your media both in the javascript demo on the main website because they have pages with alem generated explanations from GPT foreign mini. So it's not as good as you get from like the best models and sometimes wrong data set examples you can even type in your own text and see mother the thing lights up and just kind of play around. I'd get a feel of how reliability this isn't practice.

yes. I mean, we should emphasize again, this is an unsupervised method. But what you were a leading to though is there is potential for kind of automating this process. But there will be some britten ness on on, on the education.

So for example, neu pedia are taking the thirty million ish lats in jams cope and generating labels with the GPT for a mini ful of them and is great. I think this will be a really useful or even if sometimes wrong.

Uh, that's like another question is like whatever way are you willing to tolerate? And often my answer is if it's a heroic tool for researcher, reasonably high, but then if IT becomes an important part of my results, i'll go in chekhov. Maybe I should also just come in a bit on why being unsupervised as kind of important here.

So a classic mistake in international ability is projecting your preconceptions onto the model like you think IT works a sudden way. Um the a fellow of paper we discuss earlier as an example of this assuming the features were black and White um my modern addition work is an example of this. I kind of initially thought there would be a nice discrete algorithm and then IT turns out as actually using discrete ferial transform aren't um and trigger active and um yeah this is very easy to be misled.

And the more you have techniques that can tell you when you're wrong, the Better. And the more your techniques just kind of let you confirm a hypothesis you kind of already had, they're like what IT is. And I think that p it's just walking out is great because there can be features we wouldn't have expected in a model that arise. And like the fact that I can let you do this kind of unsuitable discovery is great.

Um so for example, there was a recent blog post from um a bunch of my training that scholars building on the uh fellow results where they were analyzing sport auto in coder there and um there have been some previous claims that sports water encodes didn't recover the bolt state direction so they weren't working um but what this up work found is that actually the spent codes were finding a kind of more grania feature like um basically is this vertical colum of the board something we're playing in the cell? D six will work because IT takes things in that column and you can kind of combine these together to get the ball state. But like this was a more granular thing, the model hood land. And like I didn't expect that that was cool.

Yes, like before we were talking about you know the evolution of these useful features and presumably you can um infer from their presence that they must be useful because otherwise why would they be there? But when I was reading the the golden gate claudes of the post that they put out, they were talking about some quite abstract features, know things like uh power seeking and deception and so on. And then you could kind of click on and you could see which parts of the data are activated, those features and um to what extent do you think these models can learn very abstract .

features like that just like clearly yes, like to to a model is clearly got obstructions in that like that so capable at this point um like I mean so concretely in terms of like actual evidence rather than vibes um I think yet so in anthropic skeleton mona fantasy paper they um one of the things are really like about that is it's got all of these kind of rich quality of analysis of different features digg into what they mean and they cause effects and they have things like a this function means addition thing and if you change that on a variable f and appliance IT goes from multiple ation to addition when like completing python code.

And i'm like, what or they have fourth items in a list feature and they've also got this section on like safe product features, which like I think a super interesting towards the end. So um I think it's so they observe things that seem kind of related to the kinds of things we might be quite worried about in future. More capable air systems like keeping secrets from its Operators, trying to seek power, things like that.

And I think this is not actually that scary. And I think anthropic do a good job of not schema growing there. The reason I don't think this is particularly concerning right now is that, you know, these models are trained on characters and books.

Those characters will do things like power, seek and deceive. And it's just useful to be able to model le this to similar these people. But I also think the fact that we are studying to be able to study things like this with international ability is really exciting because I think it's really important like an. Is A G I N X central risk is a pretty polarizing question with lots of precious people and strong opinions on both sides, but like frustratingly little empirical evidence. And I think one of the things that interpretations could potentially give us is like a clear resent of what's going on inside these systems.

Like do they do things like you that we would call planning? Do they have any meaningful notion of goals? Will they do things like deceive us? And the more we can understand how these things manifest, um i'm like whether they occur in situations, but they shouldn't they like more I think we can learn and this seems like a really important research direction to me.

Yeah, I I agree. I mean, the reason why that came to my mind was I I looked at some of those activations because when you click on them that shows you in in their test corpus which bits of text maximum ly activated those latently and on some of the abstractions you know that they gave, I looked at the top activation and they seemed quite kind of um low level to me, you know, almost like a key word match. And of course, you know, the deflation review is that these model as a kind of n gram models on on steroids, whatever, but that might just be an art effect that you know, for whatever reason, the top activations kind of looked quite but now on superficial but actually if you look at the whole thing in context, that might confer you know what is what you would expect for a deeper abstract understand IT.

Yeah so I think yeah I agree that um I think that um if you go on something like your on pedia theyll often show you kind of different intervals like what is between the fifth and seventieth percent tile of activation. Let's like get you some stuff um and this is useful for getting a kind of broad of you um also just looking at the caul effect is interesting like um I know I found things like a feature that lit up unlike fictional characters in general but especially Harry pott but when I stayed with IT, IT was kind of only a Harry potter related things and um I think that yeah but what do I think um I think steer ability .

would indicate that IT wasn't attract icc. It's almost like the the road bus ness to its representation post steering kind of indicates to me that it's more than just a keyword matching. You'd actually understands what the thing is.

yeah. So I think it's IT seems pretty clear to me that it's not just keyword matching because we observe things like multi ual features where the same text in different languages lights the feature up. You tend to see these beyond a certain model scale and not even that big like I think of one billion or like five hundred million is probably enough to sort to see signs of bit um in scale.

Mono manchester, they had a multi model features like a the garden get bridge one LED up on pictures of the golden gate bridge. And like with my internship life had on, I don't find this very surprising because if you can map your inputs into a shared semantic abstraction space, you can do efficient processing on them. So like of course, this will happen.

Um and there's some interesting work like there was the do llamas thinking english paper um for I think Chris wand ler and um who I know is listener of the shop and I think that um that seem to show that the model decides what to say and IT decides what language to say in that kind of like different points and you can closely intervene on them differently and yeah but I think if you have the kind of n gram matching sarcastic parent perspective, you would not predict this. And and I directly understand the people who hold that perspective nowadays. To be honest, I think it's clearly falsified.

I mean, if if family banda was here right now and late the chance so I laid the the charge of of uncased parents matching system, what what would you say to?

I guess I would just be like we've observed algorithms inside these things like you can train a tiny model of modular addition and that has discrete free transforms and trick identities. We know that transport MERS have induction heads that seem to be composing in some kind of actual algorithm. We find these abstract multi model and multi ingo features.

And i'm sure you can justify some like relaxation of this model. We are like yeah IT does the easy stuff but I never does the like real hot stuff. So it's a to caster ish parrots. But I know um the fellow thing is another example, like I think that's a clear example of that formed a world model in some sense like IT only never saw the moves and ever saw the board but IT forms a closely meaningful in tunnel representation of the .

world yeah I mean the the steel man to that argument is that they don't seem to do this compositional generalization. They don't have the invisibility. And but then what is an abstraction if if it's not a bag of analogies, and what what is an analogy mean if you if you capture all of the presentations of a concept and link IT to a thing called an abstraction or whatever, and IT acts as if he knows the abstraction and IT can reason with that abstraction, at some point it's a distinction without a difference.

Yeah I think another important factor here is like, though, in my opinion, the claims being put forward, a kind of for all claims like that do not exist instances of the thing not being a toaster parts um i'd hardly think there are many incenses more is like, you know they memorize lots of stuff sometimes they holus onate um lots of that lots of the time they're just doing basic grammar like the token dawn is followed by apostrophe tea to make dont is like an n grab.

Like clearly the models learn this. And I think that we still figuring out how to steer them like how to get them to use of the like complex abstract circuit tree that we want rather than the thing that's really useful for predicting the next token. Like so for example, um even um kind of models in the like low billions of promises know how to do addition.

Um you can give them something like one hundred and thirteen plus two hundred and thirty seven and they will often give the crack answer. And this is clearly can be that use a come up with that much in the training data, even the general circuit of addition. Well, something like figuring out what a full stop is going to come next to comes up so much. So like think how much more of an incentive of the model has to devote parameters of that kind of thing. And I think we just not very good of reshaping how the how the circuitry is expressed in a way that leads to like all kinds of done the things.

yes. So if I understand correctly, you are saying that the model, the moment is a big soup. And because of pressures from the architectural and the data, there are some really useful circuits that are very robust. And then there's almost like a MIT like starter, then there's a middle tear of things that are slightly less robust and likely to hurt onate. And then there's just in a pure noise on on the outer lead.

Robust isn't quite the access i'm thinking about. It's more like kind of abstract or like competence or like usefulness, like there is circuit tree that can lake, I don't know, think about complex tasks or producer plan or something like that produce a plan in the like chain of thought sense. And that seems like the something we really care about. While it's enormous data set of memoried facts, we probably don't or I know it's probably memorize a bunch of like famous books um and we probably don't care how much about that um and um that maybe we do like possibly you do want the model to have memory as the whole bible and no um and I think that's yeah I think that well, how how could we meaningly if if if we .

think of IT is a kind of geological strata and there's bits that have mean some of IT is just knowledges and some of IT might be circuits for doing certainty ze of reasoning and planning and so and it's almost like we know now we've got this big thing. Maybe we want to shape IT and grow IT differently in the first place, but given a big model, how do we fact to rise IT into the bits that we want?

Yeah so. I'm honest not that convinced that interpretation here is the correct tool here though. This is a direction i'm interested in people expLoring. So in some sense, we only have a bunch tools with this prompt engineering. My opinion is basically just trying to give the model the right magic words to get IT, to use the circuitry you want and not the circuitry you don't want. Because beautiful, the model doesn't know what you want for tuning is like another thing, and like the kind of chat instruction following fine tuning the people tend to do is like a kind of very important example of that.

And my mental model of in tuning is that it's mostly just kind of Operating existing circuits though this is this has not been proven and I would really like to see a proven, seems like a super interesting direction and yet um I think the you can also think of steering vectors or kind of S A feature climbing um as another kind of more interpretation alita or staring of these models um plausibly there's things you could do what you observe that like these two attention had connect with each other in the succulent I don't like so if I break that connection, like take this head and subtract the output of the earlier head from map and instead as like the output of the earlier head on a different input um technically pf patching MIT talk about more later maybe that would break the capes you don't want and way that makes the model Better. But this is very much a speculative thing. And I know i'm quite sympathetic to the bitter lesson in many ways, like just through a computer. The problem is often a really good solution. And this makes me think that things like fine tuning are going to be quite a hard baseline to beat unless you have a situation where your data is really bad, like you don't have much of IT or it's Scott to experience correlations are noisy and you maybe want to use interpret T A to do Better there.

which was I mean, coming back to passato encoders. I mean, first of all, there was the the the vanilla varium, which was described in towards monogenic ity. But there's a whole bunch of variance that you change the architecture and activation. And so and your team has has been working on one of those, which is the jump, but I mean you just kind scheme about.

yes, so um so the first one is shinkin. So IT is mathematically the case that if you are using a bnd regulation, teached um latent will fire less than is optimal purely for reconstruction because he got two objectives. And l one always wants things to be smaller, so make things a bit smaller than they should be.

This is sad. Um and you that drinkers um the second one is a bit more conceptually difficult. So the idea is um superposition causes interference. This means that if you have some uh future direction, say the dog feature and you project onto that and you pulled a histogram of the air projections, um you kind of obviously expect to see a kind of dog mounted like I know between three and five. There's like the dog bit, but there might be a lot of interference.

So like I might not be the case that when the dog isn't present in the prompt is zero IT might be between minus two and two and like this very noisy and um you it's quite hard to solve this problem with a rally because what you want mathematically want what you want to do is you want to measure the distance from the origin if it's between three and five and otherwise said IT is zero but relies say everything above the reality thresh hold um is not set to zero so you like can't have the really thresh zero but if you have a higher like three, that means you're inset measuring the distance to three which is like distorting things and making up the wrong scale. And so an activation function which does do this is called the jumper. So the way I jump ary works is IT basically like a Normal value.

But you add a threshold a um that's positive and everything blue tea is your seat to zero. So for people watching the video, the grave basically looks like straight line jump. This continues jump and then it's the identity and this can exactly let you represent things like if it's if it's above three, take the distance from zero. If it's below three, said IT to zero um and so um my team wrote a paper on gated sports from codes LED by san regimental harn um who is great at just having wild crazy ideas for new S E architectures and the idea here was let's train a gate let's have a two encodes one which figures out which features should fire producing a binary mask and the other that figures out how much they should fire and this um and we only apply l one to the which features should fire. But because it's a binary mosque you can't have shinkin and we did some magic tricks to like make IT so you could train a thing even though IT produce a binary mask.

IT was quite hectic and um but we also found that if you made the two encodes have the same matrix just different biases because they say weight matrix IT basically performed as well and like you'd fewer premises so life was Better and IT turns out that this mathematically reduces to a jump really um and we had a paper called jump Polly vis which um um may be the state of the auto recipe. It's a little bit unclear um again, let by son so via heroes basically Normal less e but you replaced the activation with jump please a problem with jump use is that you ve got this thresh holds variable um but so if the thread is three, then everything above three is the identity. So if you gotten inputs where the activation is for the threshold kind of has zero gradients because if you make IT a bit higher or small er IT doesn't change things, which means is based really hot to treat.

Um so what we instead did is we optimized the l not rather than the l one proxy. But if what you can do this, because Albert is district, everything is zero. One is just like, did this feature fire if yes at one, if not at zero? Um but we used straight through estimators, which basically means rather than thinking of IT as a function of activation strength that is like a zero and then there's a sung jump to one.

Um we think of IT as a linear function that there for a while and then has a kind of very deep diagonal line up to one and is one. And then you take the gradients of this, but only on the backwards pass. So it's um discontinuous when you're going forwards, you intervene when going backwards to make IT this kind of weird estimate a thing.

And we found that this performed very well and IT addresses both shrinkage and the other kind of mess problem of how do you deal with low value interference. Um the other architecture worth knowing about is top K S S from leo O A OpenAI who are a great pp on this. Um the other part of that paper was scaling as he is A D B for the absolute madman um and super a simple idea you rather than having realize at all uh the you can also have release, you apply a top k function.

So you just take the top three hundred latency, keep those everything else is zero and this just gets you supported for free with um no concerns over like does that have too many features firming um and this also works pretty well. Um we found the jump value sly up performed its uh anthropic firms that top key seemed Better is a bit unclear um both seem useful. Um top key makes IT very easy to dissent whatever possible you want.

Um this is actually quite an annoying problem. So um so we've got this positive penalty when training these things, whether a one or not. And choosing this facility will change how space the model ers, how many features have to fy.

It's kind of unclear what the right city is because the intuitively that shouldn't be that money um concepts relevant to an input. So you ouldn't need that many, maybe like fifty to a hundred will be typically target, but maybe more like twenty. No no. Um and you and often the kind of slowest activating features will kind of just be noise um but it's often noise that helps the model reconstruct to the input but in kind of an uninterpreted, because if you're saying， well, if the projection to this vector r is high, I will add a similar vector and then I will reconstruct IT yeah um but you often want a good reconstruction and so it's kind of unclear what the right way to do is um typically the way we compare as the e families is rather than just saying which one performs Better.

We take a range of positive penalties um or different case in the case of top k and we plots a preto curve of um like how many features are firing or not and how good is IT a reconstructing um this can either be how visited at reconstructing the activation um or if you substitute the reconstructed activation and then finish running the model. What's the decrease in cross ropy loss, which since then some sent to closer to what we care about because there might just be some uninterpreted garbage that doesn't matter in the activation. And typically, this will look like a curve because having more features is Better um and then we can compare the curves for the different methods.

And so when I say a technique is Better, what I mean is the curve is like up into the right and you should really go look at the papers and look at the diagrams. I'm not explaining this very well verbally. And um the other thing that's crucial to check is that you haven't accidentally made things on interpret and um you basically just do this by having a human interpretative study or having a like language model do IT for you.

And an interesting thing is that there's mean quite a lot of progress in improving at the sparse reconstruction side of this, but interact ability numbers really haven't changed that much. And a but old I finally know on on top k is there's a small improvement called batch top cay, the but dustman one of my mentors made. So an annoying thing about top cake is the you.

So you get to choose the basket, which is great. Uh, you don't have to like tune hyper premature or and fiddle around um but what this means is that um you have exactly the same number of features per input which like isn't really what you want like intuitively. Some tokens are boring.

Some tokens are interesting and should have a lot of stuff happening they did with batch top cape is rather than taking the top k for each token, you take the top b times k over the batch. Um what b is like the total number of tokens in the batch. And this means you can have variable numbers per input. But on average, the positive will always be a hundred.

And at inference time, you gone to take like a kind of typical value of the b case thing as a threshold and just fixed that see you don't have to have a batch every time, which uh is actually um identical to jump rely wood inference, which is kind of nice connections though jump up early with identical thresh holds everywhere. Um and batch top case, like probably my recommendation for people who wanted use top k stuff or dislike, really want to know the city um jump volume might be a bit Better and it's what used for geoscope, but I think is like probably more complicated to train and easier to shoot ourself in the foot. Though we have some open source implementations now.

you said that we want to have as as many features as possible, and there seems to be something limiting the amount of features produced even across some of these methods, may what what other things limiting the number of features, useful interactive features.

So I don't know if I agree with that. We want as many features as possible. So what do .

we want? What's the objective? Is that as many useful interpretation features as possible.

I want to know what's going on in the model, what is the computation happening, what are the variables? And I don't know how many variables that all um IT actually is probably more complicated than that. Like it's i've actually updated away from the idea there's some fixed number of variables.

So what can we do with water encoders? And what evidence do we have that they work?

Yeah so okay. So from the let's think about this from the perspective of a researcher, you got a language model. You got a um what can you do that so typically um so you need to pick an activation to train your boss altering code on, like you need to train a different one for like each activation in the model essentially typically the most interesting one is the residual stream because that's like a bottles neck.

That's the sum of all lair output so far. So it's like the kind of, if you understand going on there, you've actually answered quite a lot about the computation, while any given layer is only a small fraction. Um so um for example, in scaling on us, mantis, I anthropic just trained A U S.

A is on like the middle residual. So what can you do with one of this? So um the kind of natural thing to do as you use IT as a microscope.

T you run, put some text with the model, and on each token you see which later slip. And then ideally, you have some tooling like new on pedia that will let you understand what those latently mean. And you can be like, are the models think about this, this and this, or why is the model thinking about this?

That's odd. I'm curious what's up with that? Um I um highly recommend people just go to the uh neuron pedia jassy pe for demo and just play with the microscopy pot where you can put in text and see what lets up.

It's got a very, very pretty U I um and uh neuron media also has an A P I which means that when you're like a researcher messing around things in a collab, you can just see the dashboard reach feature which is because kind of a pain making them in yourself um and yes so like using IT as a microscope um another thing you can do is just kind of using IT as a discovering like you just look at each latent like randomly choose them and you're like, uh what does this fire on? Oh, that's an interesting concept. I wouldn't guess the model has that concept or maybe you guess for a latent you think exists.

So you come up with a bunch of prompts that might have IT um and you try looking at that and see if you can find a latent that kind of acts like a probe. Distingue shing, those two data. Ts, um the next thing you can try doing is causing interventions.

Um basically sterling, uh you can either do this by clamping so you pass the residual stream through the S A E, but then you take the latent like say the dog latent or the yelling latent, and you make IT put IT at a fixed high value and then go proceed. You could also just do steering. Just use the decoder vector as a stirring vector.

You can think of this is just adding some number to the licence activation or you can think of this as just is adding a Victor even need the S A um one note is so S A is introduced ed era um the reconstruction is not the same as the original activation and uh these errors, the like difference between the original residual stream and the like new one the residual residual streamers at um is sometimes contains important thing just like that some features they were just too easy to be captured and um so if you replaced with the reconstruction, this inner of itself will have an effect on the model which might drown out the effective of clamping a latent um typically what we do is we just at the area turn back in or um do some more efficient Operation. The is mathematically equivalent because that's kind of wilful. And this is annoying because that means there's there's like mysterious note that we don't understand, but IT also means we can do cleaner investigations.

So person goes um you can also yeah so yeah you can use IT to find you what was going on. Causal interventions um ability are probably a more interesting kind of causing intervention like you give IT a prompt and some concepts light up um like the ipl tower in the city of and IT says paris and you're like which lats actually matter for saying paris you can just delete them on at a time and see how the paris look probability changes and um yeah so I think a more ambitious thing is so finding with them so um so special twin coders beautiful. They're just an activation studying technique.

They find out the variables in the model but this doesn't inherently give you the active they like algorithms, the seconds um but what we can do um so yeah I think the best work here so far is the sport feature circuits paper by sam marx and aron Miller. Um what they basically did is they took the outputs of each attention and emotional ila and every residual stream. Um they train a different S A E on each of those.

They did this on peth seventy million because it's kind of expensive and annoying, especially because you have so many latent. When you have like this everywhere and you try to do you like caul interventions, you've got some prompts. They studied things like subject of agreements, like the man, yeah like the man says the men say, or something, and tried to find interesting features for this. Will you basically just look at the cause effect of each feature um later when you get rid of them and then you also can look at connections between them um sir, like you delete a latent in later too.

You look at the change for latent in the four, or maybe you subtract the direction for the latent to layer two from the input to the one and therefore and then you try to use that um and then you see what affects that s and uh one nice thing about sporting coders is because their SPSS um often many latent ts weren't master um and so you don't have to do this edge tracing flag everything because that would be a commonly oral nightmare um also because they had um residual as IT is this meant that they kind of only had to do like one step of a puff um because there was like a bottle neck at each year they were connecting to like they weren't looking at connections from like attention one to mlp five or something um a problem with this approach is that um the era nodes will just completely break your model is right the reconstruction error when you put acid everywhere. What kind of cade and just destroy your model um which means that the finding um without them is like kinder sketchy um but if you include err loads in your socket then it's like he's got this massive uninterpreted lump and um they got much worse results um when they in like didn't include items. Um you can still um yeah you can still make some progress.

Like you can say if I have a sub graph of latently and I think it's a circuit, uh, you could try deleting all of them and see how much that damages model performance, for example. And like that can give you some evidence. Um I also say that I think you don't even necessary need to do the kind of edge patching stuff because often you'll have a sufficiently small set of latently.

You can just kind of assume all the edges are meaningful and it's like kind of good enough. Um one caveat i'll give to this is that this is very much kind of causal intervention based um so good finding uh they in fact used um a gradient based proximately to this called attribution patching that I think I discuss briefly earlier uh which are they improved using integrated gradients. So like classic context ability technique, but it's basically trying to approximate a caul intervention of like changing a latent value to its value on another input.

One kind of nice thing is that, that kind of a good default value for latently of a zero. Finding a good default value is actually quite hard if like general circuit finding because activations aren't Normally like the zero vector. So he said of that, that's actually really destructive and weird. You could send IT to the mean, but maybe the mean has much smaller norm, and that messes would lay enorme and things like that. Um you can also do activation patching when you replace the latent value with this value on another input.

A this is useful if the latent is kind of active most of the time, though IT can actually be misleading to do meighan when you replace that with the average, because if the distribution of the latent is its zero and ninety eight percent of the time, then it's one hundred two percent of the time. Then the mean is two. But like in some sense, being two one hundred percent of the time is like less faithful to the original distribution than being zero one hundred percent of the time.

The under there's like lots of questions. We still haven't pick IT out here. Um one of the things we've out of our way to do with javascript to have S A ease on like every learn sublimer like enable this kind of analysis.

And I think you can just about fit all of the uh narrow ones for jama to to b on like an eighty gig a one hundred uh maybe leaving out the residual ones um because i'm excited to see what circuit analysis confined on those kinds of models um actually have a pair of um math scholars, a step in shape l in and the metric color banker who are gonna produce some work on this stuff soon who been looking into in context learning and uh so there was a recent um set of call papers on this idea of function vectors or task vectors. There are actually one from Erica and one from I think a handel like a day apods. Um the idea for this is in context learning um like few shot learning.

You can have a bunch of examples of the task and then you um given another input when IT does the task like producing opposites, torture short. What they found is the at a sudden layer you could extract a vector that corresponded to the task and then for a different prompt will like a different input um and a different task, you could swap out the task vector and I would do the original task for the input in that prompt. So kind of isolating the few shot ness and also finding that IT was like a silly earlier layer than when the model figured out what the inputs es just quite keep. And so what step in the demetri found, which I think we put out in the recent blog post, is that, uh, there are sparse altering coder features corresponding to these task factors and also space altering coder features corresponding to like uh, task detectors, like what task is happening on earlier tokens that like connect up to them and yeah I really curious to see how much more you can uncover of those.

Is IT possible to know, though, in advance? I mean, I mean, how how should we expect these features to raise? I me, if we have a certain type of, you know, data set about, I don't know, architecture or something, you know should we expect a certain .

type of feature to a right yeah so this is a really interesting question I will understand super well yet. So yeah so I think a crucial um hyper premise or I haven't spoken about so far, is the wife of your dictionary. The number of latency this is chosen by the researcher before they train this bottle to encoder.

And you can kind of think of this intuitively as the zoom al level of your microscope um and like how do you find kind of a few cost grained features or lots of fine grained ones? And um a really interesting thing is the so the kind of picture I laid out of the starts of there are just some ground truth list of features which have their own vector. Under that view, you might expect a spot altering coder to kind of learn the first twenty thousand of those if it's got twenty thousand latently and just totally like the rest.

But what we find is that that's partially true. Um larger sports in coats do learn features that seem totally missed. Um two of my math dollars, man and Patrick lisc have a fun block Price on that.

Um but there's also a lots of features letting um where you kind of have a course highly able feature that splits into like lots of some things like I know color might split like red, Green, blue bubble is such a that's actually sufficiently common that probably didn't actually happen but there are like lots of examples um and this is kind of weird because IT raises the question of are fast altering coder features actually like units of computation or not and my correct guess is that they're like probably not um boston Patrick had another post called like are the atomic and what they did hear is they trained a metal S E where you take S E latest features. Uh you take the um decode matrix of N S A and he had another S E to reconstruct to that decode matrix called a meter. S E typically would like a tiny l no four or something.

And this found that sometimes you would have features that like an aba insignificant that he composed into lake male, physicist, german, whatever. And um these often these kind of meta pictures often corresponded to things in a smaller S A. And so it's like smaller S C ees often split apart in bigger S C ease.

But bigger S C ees are often kind of just compositions of things in smaller s is but they also often interpreted and closely meaningful. And these feel like, me know, I like interpretation closely, meaningful thing that feels like I should be enough. And I don't know.

I think this is the thing that I do not feel like I currently understand. I think the right question is maybe just ask us what to encode is useful for understanding what's going on inside models. But IT also seems important.

Tab, Better foundations of, like what are they actually found? Or if you like, they found something, but I don't know what they found. And IT seems kind of fascinating that the tunnels of model are of high rocket in this way where you can have interpreted closely meaningful directions at different levels of obstruction.

Um and the other comment or make on what you might expect to comment to this is so my favorite pod of anthropic uh scaling on a antti I paper is the feature complete section. So what they did here is they took a bunch of things that you might expect to arise, the model like chemical elements or um I think maybe london boris or something. And um well, that's kind of like a clean name.

So you can just check, is that like a feature that likes up on text, including this name? And they took these as ease of different winds and they looked out how what is the um probability um rather like how many of the is get learned and like what does getting learned look like as a function of the frequency of occurring in these different access or different winds? And they found that you could fit quite well.

A like simple predictive model that I think was like a thick White V I remember the details, but basically sufficiently small as he is don't care. And then depending on the frequency, they'll be a point when like the probability celts increasing and then he goes up a lot. And then it's like basically one hundred percent. There's plausibly also splits later on, which is like a complexity and not sure they went into but .

catch that reference where .

was IT um so this is the future complete section of anthropic scaling munus mantis I paper. The long paper has so many sections, so good. And so kind of I think that take away from this other a the model, the frequency of a feature matters a lot, which just make sense like you want to things are more useful if they occur more often and if they are kind of like bigger in the victim ation space in some sense.

But I think the second interesting take away is that IT seems probably tic whether a feature gets learned in an S A E ever given with. I would like if you train multiple of the same with theyll, have shed features that were kind of the obvious ones to learn. But for features that are kind of more knish and kind of more on the board line, it's like and just kind of random.

And I really know what to take away here is like um I think i'd love to see more research in this area in terms of advice for practitioners. I basically think that you should just train several bits of S A S um in jama group. We have kind of a wide and narrow every sites and we also have A A few layers when we have like powers of two.

So you can do like whats of hierarchical and features slighting analysis. Um I just have a few weeks. Try the different months on your task, see which ones capture the features you care about will be nice if we had a Better answer but .

doesn't best like a you can also train S S A S on, I do know like the residual stream on on just mlp s on on attention. I mean, how how do they compare?

Yeah yeah. You can mainly train like any activation. You can even train like the queries and keys and values and attention layer there that's more confusing and that works as well.

Yeah doing IT on the output, blair, I think is a super reasonable thing to do. Um the there's an interesting question whether you should do IT on the M L P activations or the M L P help. So typically the amali activations will be like four or more times the size of the residual stream, but then there's just a linear map mapping IT down to the output.

And so like of conception, that should not really be a difference uh but also your S A E will have four times in many premieres and like maybe this helps, maybe this doesn't I don't know. Um typically I just train on the output because it's four times cheaper. But IT would be great if someone actually study this systematically. Returning to your question, I think that this just have different tools. Like if you want to study what a layers do you want the output for that later.

I think a totally valid mode of analysis would be um taking a task you care about, like in contact learning or refusal, doing activation patching um like swapping layer outputs between different plants to find the ones that matter um and then you could um go and to take the jaso pe S A E for that layer and then see what happens. Um the residual stream is generally more useful if you want a kind of holistic view of what's happening. Um probably all kinds will be kind of costly, meaningful.

But um the residual stream is going to be the one that like detects what the model is thinking about while the layer ones are all like what's happening at this layer like what do that layer do is just like a different question. Um I think that um an interesting thing about M O P S is that the the like a pillow itself is this kind of dense nonlinear mess and it's kind of in superposition. So you've not like thousands of jelly activations or gated unions or whatever.

And this is very unclear how to think about this. This is fine if you're just doing caul intervention. So you just want to study in activation.

But if you want to try to do use some kind of weight space circle sis like recovering this hope of like input independent things um which IT feels like S A use should bring us close to you. They are kind of more semantic um is a kind of com deal with the big chunky M O P vot. Um the solution to this is a transcoding.

So I supervise the delighted paper from a Jacob damascus in Philip, china. Uh on them the idea of a trans coder is rather than finding a sport reconstruction of the mlp output with from the mlp output, you use the map input to find a space reconstruction of the mop output. So it's kind of like a sparse replacement, a alpa with lots of neurons but their spots or and release and um this kind of uses as a drop in replacement on GPT two, they worked about as well as mlp output s in terms of reconstruction on the mature, they were notably worse, I think because has gated mlp.

Um though we saw release a sweet transcoding as which might be used for people who wanted do this kind of input independent circuit analysis, though actually annoying ing thing we found is that is really hard to do, purely input independent circle analysis because you take two trans coders, you not apply the decoder weights for one and the n code weights for the other to get this like um yeah I kind of like interpretation basis by interpretation basis matrix and there's lots of weight that just don't matter because because i'm gesturing at earlier and the way Jacob an phillip salt this is that they um yeah they are looked at future to currents and they just kind of didn't look at the connections for features that didn't coca but like that somewhat input dependent so it's not fully satisfying but IT was like solid attempt and i'd love to see moderation there on attention. So attention is a much more linear than M O P S in some sense. So the attention pattern computation is weird and nominee, and I don't think we yet understand how to deal with IT with that, he is there was an interesting blog post from keh wino on Q K transporters, which seems like a cold approach.

But if you take the attention patterns of given a fall, please get you think to do but uh, there with me, then it's basically just a linear map. You use the attention pattern to kind of mixed together the different post tokens, and then you apply a linear map, the volume matrix and the output matrix, and you get a new thing and you do this for had parallel and then you alone together. And so um technically you do the volume map and then you do the attention mixing, but they commute, so it's fine.

Um and I think see the law episode if you want me rummaging about this musical detail um and so this is interesting because that means you can do linear attribution. You can uh just say, look, this is basically a linna map of POS tokens and then some feature let up. So like what were the tokens that contributed to this? And um you can even rather than training IT on the output trainer on the um concatenated a mixed values over the heads, which is like the thing immediately before the output wates.

This is just like a li map away from the actual output and this means that you can now attribute things directly like each heads ah, which is a quite a nice feature. And I uh supervise this paper from conic cosine and rob critical al scape basically just trying to do a deep dive of, like can you do attention? Can you do S A ease on attention lives and basically guess works great.

There are potentially a bit more interpretation than M L peas. And you can do all kinds of nice things, but like liner attribution like this. Um for example, rob, the absolute madman, went through every head in GPT too small and looked at the top ten S E latently according to, like how are lines they were with that heads specifically because you going to read off the weights for each head and look for a pattern and just kind of labelled heads by, like device a person. What is the person and check out the paper, if you want, think you're got a table of author .

somewhere very good. And what's the relationship between the number of features that that that you try and find and how kind of meaningful the features all? And I was also so wondering how was IT .

related to the size of the model? Yeah so this is not a thing that I think has been studied in much detail. I predict that the interpret ability will be quite similar.

Um though the larger though there's often kind of training stability problems with larger s like when you get beyond like a million latently, things get more annoying um for ample in anthropic thirty four million lents S A um they I think how about sixty five percent of the features were debt, meaning they just didn't fire. And this is like a really high ratio, just like twenty million features are doing nothing. Um and this might just be because there on thirty four million things though, they definitely found that the S A E had missed addon stuff.

E G, the model knew all of the london bars, but they couldn't find S E features for all of them. Only most of for example um yeah learning stability problems um but like build that I guess is they get much more fine grained to the point where you could argue that some of them I just I don't know, like memorising uh, couple of training examples or something. Perhaps if you go to the extreme, meaning the infinite with limit, you'd kind of expect each tokens to get its own latent or something um which seems boring into generate but i'd be surprised if we ever get to that point.

Um intuitively, larger models should just have more concept stainer, meaning you should want a larger thing. Um I don't know if anyone is actually tried training an enormous S A E on I don't know um every model in the GPT two family or pithy a family and just tried to see whether the smaller ones look kind of crazy and the larger ones look reasonable. But like I think that i'll .

be a very tool experiment on the security sort of things. Is that possible that these, you know sport water and code features could be used to, I don't know, before adversarial attack .

saw something like that um probable. So I think my general web is we haven't seen any much evidence that S A is can do so that like beyond what you can achieve would say fund tuning like functioning can do a lot of adverse tax against the model. And IT is as plausible that you can do kind of interestingly different ones with interpretative.

But I um I haven't seen any real sign so far. But there's kind of dramatic capabilities that we just not there that like we're really hard to achieve with fine tuning. Um so for example, in this uh refusal is immediate by a single direction paper um they do and ldt where we found like refusal stirring, Victor, that we could use job break models.

You could just break models by frontino the weights. And this was like a large part of why I felt comfortable releasing that people. Like IT was just an interestingly different attack and in some way is kind of easier and cheaper. And like maybe things like that might happen with S S. I mean, it's a new technique. So like there's always the potential that there's some things are not expecting the work going to learn, but I don't see any particular reason to expect IT to advance the frontier of what we can do beyond things that used through computer the problem because thrown computer the problems really effective.

Are there any gaining laws with S S A S?

Um yes. So this is a thing that kind of surprised me. They are both thrown c and opening. I have had recent papers just scaling these too much bigger models like we talk about anthropic a bunch opening.

I go out to D P four and like both have some nice sections in their paper on s city was and IT seems like anana. Initially we don't really know what we were doing when trading as IT is, but now we've iterating the methods enough that they like smoother. And we can just do things like change the count of data, change the amount of latently things like that and poured out.

I know what an optimal, optimal set of hyper premature for a given compute budget would be or something. And like this kind of messy because there's a lot of hyper premature. There's like how much did how many latently, what positive penalty should use? And like, is there really colossal solutions, any of these? But people should totally go check out those bits and stare at the pretty coves.

I mean, one of the problems is just the amount of garbage fishes. And what do we what do we do them?

yes. So okay, so there's like dead features which basically don't fire. There's uninterpreted features which fire sometimes, but we can't really see a pattern.

And then there's um high frequency of interpret able features which fire a lot um like one percent of the time plus. And that. Features might know kind of ignorant that matter, you can print them out of your S E.

So I think the rates of dead features have gone done a lot as we've kind of refined or trading techniques like previously, there were lots of hockey thing people did like resuming where I is often just replace all of the dead neons with new weights. But I think training methods now are quite a bit more stable and that doesn't seem necessary anymore for the most pop um with more modern methods. Um I think that the uninterpreted table ones, I know there's an argument for just removing them.

Um i'd be kind of curious to see what happens if you just do all to want to help in every future. You find the ones that are bad, like you delete those and then you I know functions e the S E A bit so IT doesn't have like a gapping hole in um that seems like kind of an interesting direction uh, at the moment. I already think about them too.

Hoards though, if I started to see like lots of uninterpreted things lighting up a bunch, I want to be more concerned. I think the thing which is more concerning, and i'm a like a little more interest understanding, is these high frequently features. So under l one things which light up like one percent of the time, I like a very heavily people.

So they intend to happen too much um but under things like top k or umbra, that peno is much less like tok doesn't care. And with jump, you you only get penalize if you kind of near zero. So we just big one person the time maybe IT doesn't care.

And um this means and if our activities, a lot of the timing can be quite useful for if IT explains a lot of the time, I can be quite useful for reconstruction even if is not international. And sometimes these things do seem interpretation. It's kind of confusing like anthropic had a recent monthly update where they compared different sporting coder turney techniques. Unlike mentioned, they'd looked into this bit and had been like, actually we think these things are kind of interpretation enough when not that worried about this um is the rough by I remember there people should read the post I might be subprime IT incorrectly as goes with every pay for I discussing this podcast um but yes, what was that baby .

just reference from?

Uh IT was one of their monthly updates. I think the one uh the second first recent one I mean, one thing that I .

was thinking about that is that you know, one school of four, and i'm not sure what you lean in this direction, is that these things represent units of computer. Another school of thought is that they are post hawk descriptions. And what we need is, I guess, some kind of a cause of mechanism to tease that out. I mean, what do you think about that?

So I don't know what unit of computation means is going to want was getting out earlier with the feature splinting. And these things are not atomic discussion like yeah like in some sense, I feel like a unit of computation must be irreducible and I must kind of fit into a kind of clear circuit and be used downstream.

But if you have three um if you got a mammal feature that split into a bunch of animal features but is costly, meaningful in its own right, is that a unit of computation like maybe um or like in some sentence enough to be useful, but IT doesn't. It's not quite achieving the ground truth of what's going on the model and like I don't even know whether the stuff bottoms out somewhere in some underlying truth or fits all, just some kind of messy hypoxic structure where you can kind of go apple down the Blake. I do not even a treat as things often much like direct to the cyclic graph maybe and yeah so I think a research direction that i'm pretty excited about is trying to apply these things to real world tasks or at least kind of downs am tasks.

For example, in sam oxes small future circus aper, in addition to the circuit analysis stuff that was really cool. But where called shift where they um so they took a gender biased datasets, like I think I had bias of um female nurses and male professors and they are trained to prove that and IT partially learn to pick up on the profession and partially learn to pick up on agenda. So when you evaluate IT on a data site with the opposite and flip IT was very bad, what they found is they can identify the S A E features that almost reluctant gender with a mix of giving the model data and human analysis.

Get rid of those. And um IT, the probe became much less biased. And I think this was like a really call application of I know I don't know what our existing probe debug king baseline should be, but there seems pretty impressive that S I can do something like that.

I'm like. If I can do something like that, how much do I care whether it's a unit of computation or a post talk thing like I care a lot about of being real and I care about IT like teaching me things about the model. And I don't want to assume things about IT that will mislead to me.

but is IT possible for us to correct knowledge, or to insert knowledge into a model using essays.

maybe. So a problem with acs that I think is often delighted in this kind of high level discussion is the the thing you want is not always a feature like um I have trained as eight years and tried to make a golden gate laud and I did didn't find a golden gate feature and that was yet like maybe if I trying to White enough S C E I would have I don't know um and um this is actually a benefit of starring Victors because this isn't really an issue that you can just always make a data set for the thing you want um and so like if the knowledge you want is represented as an S A E future you could totally blew up that would probably be somebody help um I yeah I know of some people who are looking into this though IT sounds like IT doesn't outperforming existing on learning based lines at the moment.

Um and I think that yeah anner knowledge in session maybe um so this might work yeah might work is over IT feels a bit odd because to me, knowledge is like a kind of look up like a function um even if you want to do IT in like a really nice interpretation way, you want to have the input that's like Michael john and the output is like plays basketball. Um and you can maybe add a little bit of Michael john to the input direction for basketball, to the encode direction for basketball. You could also just add an extra action that has Michael Jordan and produces basketball.

This is kind of against the spirit of an because like that's clearly not reconstruction y that is like an additional function you've added but like some things you can do this IT feels more intuitive tive with transcendence. Um the like mp inputs to M O P output thing because in some sense fact look up is the job of mp though. Um I mean, we found that IT tends be distributed across layers, which is kind of annoying, but you can insert a one way totally fine, especially if your downstage have like a felly high making tude to decode of vector or encode vector.

What's the relationship between essays and the task that's being performed or the amount of data trend.

So S A is get Better when trained on more data as a demo rule. Sometimes that kind of saturate. Um i'm not entirely sure if this is like true saturation or just a problem with the training methods, but in journal mall, data will make them Better reconstruction.

We haven't looked too hard into. I'd know a new features appearing or is just for finding the existing ones. I think this would be like a pretty cool thing to just look into like take a bit of text that activate in S U latent.

Look at that. Listen to have a training pattern. Props do this for a bunch doesn't look like face transitioning or is IT gradual. I have no idea um someone to look into this. I think that um you I expect you can train specialized S A is um not i'm not aware of too much work on this um there is a paper called towards more principle evaluations as possible cos I supervised from alex Michael ve and gale again the focus of this was we think that s is should be useful for circuit finding. Let's take the circuit.

Oh, why interact object dentification which h we talked about IT been in the last episode in GPT two and let's see how well as you find out and compared this to kind of supervised dictionary that we just train because we kind of think we know what the features should be. And um I think that was a pretty call evaluation. The relevance to this is they both trained as a ease on like red tech, but also trained IT on just I, A specific stuff.

And to some degree, IT performed a Better. But IT also was kind of weird in various ways, like there was this phenomenon of feature over splitting, where there was some picture that should have been there like, uh, is the name first or second and and said they split into, like, twenty features that were all kind of like that. And this was much more knowing to work with an analyst than the kind of broader S.

A is. But maybe you can sole this by making IT a narrower, but if you make a narrower than IT doesn't learn the name of features which you also care about, which all kind of should be more fine grained. So yeah, but like that was a very narrow time. I'm sure if you did say playing code, I would much in a minute .

ago you are a leading to training dynamics and this is something that that really interests me in the egypt which is to say how do the the weights evolve over the course of of of training and how the you circuits emerging so could you not um you know I kind of compute as I use at different points of the the training uv and use that to understand the training.

dynamics. Yes, I think that is a thing should work. I yeah so you just to be clear, we're switching from training DNA ics of ees to the training dyan ics of the model and using as as as a tool to study this. Um and yep, I think that I I think there's lots of stuff you can do. Um I think this would be one example um as he might even be overkill like you could just take some concepts you believe or represented and just train probes for those. Um one thing I be quite interested in is like how does the problem evolution evolve in which the prob direction of all be a training um training tax isn't really my field so i'm going to say things and I want to be a paper that does does this have come across um yeah I think um looking at what the representations look like at the time would be super cal looking at how well an S E transfers would also be interesting. Especially if you correct for like changing bias terms could like if the average changes is gonna screw everything but maybe the direction of a future doesn't change .

because i'm obviously you've spoken a lot about rocking, and I guess that could be seen as looking at training dynamics. And would IT be interesting if you saw so called emergent materialization of capabilities or or features that come out of A N S. Say, would that be interesting to you?

So I mean, I think emergence in general is an interesting topic. Um I would be a little bit hesitant to draw two stronger conclusion from what happens with s is because, for example, IT could be the case that IT gradually learns the feature for a pillow but the S A E doesn't think it's worth at a wide to have a pillow latent until that reach a certain threshold.

Ld of silence and like venit lens, I like that might look really sudden, but actually it's gradual. So like what I had, so this making kind of more interested in um probe, possibly using the S A years like a motivator of what things to search for and like what probe look for. Uh, what is literally using an S A E feature as a probe just translating IT to the earlier ones because like they won't surprise me if um the pillow latent on the first S A that lends up still works on the checking before um and yeah I don't know um the does like a couple of models that would be pretty good to study this on.

Um i've got some toy language models in the transformer lens library like jelly one l or jelly two l but like when there are two little models, there should be super easy to train. S is on there's the pizer sweet that has tons of checkpoints. Yeah i'm hesitant to use presence of a future in the S A as indicator because that's just like a lot of noise there. So I want to be more cautious.

Yeah and I think that your team have just put out a whole bunch of open way S A S and for yeah I mean, what were the engineering chAllenges doing that?

Um yeah so is this kind of wild is not really depressing from my perspective, are in the vert. Um S A is a kind of become quite a big deal in at least I think they're a big deal, and I think they will unlock lots of cool things we can do. So I want to focus on them, which involves training them.

And this is just like much more of an involved engineering thing than lots of previous projects that were more like player around in the current book with a tiny cute model. Um and so the reason is an engineering chAllenge is so essentially to train an S E, you need to run a much of text, draw a language model, collect activations that are certainly and then for say, a couple of billion tokens and then use those activations a to train a sporting curder, which can also be quite big. Um is is like only a two and network, but like the middle community enormous.

So like and if you're going as big as and rope did to say thirty four million and you want to train a on a model like gem to twenty seven b, which has a residual stream of with five thousand, the total number premature is two times by thousand times study for million, which is like three hundred and forty billion promises, which is like a lot bigger than G M two twenty seven beat we did not go that big. We only went up to like a million lions. But i'll get sizable and so um you kind of have two choices for how you do this.

You can either do this online so you um pass text with the model, you store the licence in like your G P U M, and then you run an essay on them. Um you often wants to have a big buffer of these activations so you can shuffle because activation from the same prompt will gonna correlated to look in what peaches they have there. No one has actually tested to my satisfaction how much this matters.

I just intrusively expect IT to matter, like so many things like that. This is. And um yeah, so this basically doesn't work for models behind a certain size.

And I mean that if you want to do like a hyper matter sweep, this is like a quite um because to run the model again and again or even just if you want to train a bunch of s once uh on different activation. And so the second mode, which is what we eventually converge on, is activation storage to use run dom model once you collect all its activations. Actually, I think we did like a couple of stages because the like disk input outfit became a botnet when you wanted to store that active asians. Um you save this a disk and then you train bars also encodes on them at your with whatever hyper matters you want out. If you mess up the training, you must go back and train again.

Um the so this is just kind of intense like we stored about twenty peppy bites of activations, which is a lot um we had to request special permission to get that desk allocated to ost um but fortunately deep mind beliefs and mussels from codes um and um use about twenty two percent of the compute of GPT three which is like not that much by modern standards but like by interpretative standards is quite lot at the main reason um all of these numbers are so big so we mostly released things on jai u two b and jm N B um because I think these are like a good intersection of um kind of useful to sight of, like kind of big enough to be interesting, small enough that like academics can in fact work with them without breaking the bank for G P S or needing to message around a ton with like paralyzing um is that we wanted to do like every layer, including attention, M P and residual. And so like demo two 2 nine b has like forty two lives。 This is like a hundred and twenty.

And we also wanted to do two winds, which one in forty, and we couldn't decide on what the correct facility wants. We did like six. And so that's like a lot, uh, I typically a claim we did about four hundred.

Could not trish account the spot ones? Could I Normally use recommend people use the ones closest to one hundred ds, unless they actively want to do some kind of sport study with people should do. I don't know how these things work, man. And um we also did know a couple of all ones like we occasions did a one million latent one on if you lives. We did a few on gma two twenty seven b and we did a few on uh the chat tuned jm to nine though actually um S A is transferred surprisingly well between base models and chat models.

Um uh to my my scholars and I have a nice book push showing this and we CoOperate to that in the jasper pe techy report um I also should say uh credit for javascript es to the team and most especially to tomb liberum who was the technical lead and dealt with all of this engineering lightness so I didn't have to and yeah and like this is just uh ja scope like I expect um if we wanted to do a larger model um i'm like larger S A is um he would become kind of even morning though there's a distinction between um kind of training one S C E on a residual stream which is enough to some use cases but not others like enough to kind of monitor the system is enough to like skate. But it's not enough like see the hunt l circuitry doing everything in sublimer made IT like a hundred times more expensive um you know quite a hundred times more expensive because you don't need run the language model one hundred times as much but like more painful and um so yeah it's unclear to me how much this is worth IT, and I would be quite curious to see people explore this kind of thing in practice. Um and yeah ah the reason we did java cope this open release in the first place is as what i've been saying, hopefully communicate is really in doing engineering wise and this is a lot easier to do if you're an industry lab.

Um then if you're outside, could you just have Better inflections and more and you know twenty pepper bites of storage and um kind of in the same way that things like jama lama like that really expensive to train, but that to use that like much cheaper. We thought that we could unlock lots of coal research projects by doing this. Um there's also also some other cool open way S A E projects worth having a shot up to like um l uses have trained a bunch of lama.

Um that's really actually council open source since I think the code and data is all open source. Um and um OpenAI released a bunch of high quality ones on GPD, too small, which is nice to smaller projects and there is a lot of more scattered people who put out various nice things. But the hope with geoscope is that for like any project that once a kind of interesting model, this could just be high quality canonical S A, is that would enable as many projects as possible. So he is a kind of through fandom variants in .

the so so much stuff has been done, but there's still lots of work to do. I mean, what are the open problems in N S. A? You still .

yeah so okay. So here's a couple of areas that i'm excited about. So I think this basic science, there's no we don't understand about S A use like our and I hope I pointed to many mysterious so far like when should you expect the future to be in the sae? What's the right sporty what's throw at with other true units of computation? One problem might call out is just what's going on in S C R terms.

Um what do they mean? Can we understand them? um. And we know that they sometimes just capture features that the model missed. But like is also with us of going on, who knows? Um there's yeah that category one, category two would be using as IT is to solve mysterious of interpretation service. So this is like a lot of things I don't understand, like what exactly does fine tuning due to a model circuits, especially chat fine tuning, which is like a particularly important kind.

Can we can we find like a truly monotheistic circuits with s is now we've got kind of the right units, all these the right units kind of unclear um be at more of kind of a using them task also just now applying them to interesting circuits like um if you shot learning um the next category would be real world tasks or downs ing tasks like other things that people who aren't interact ability nudes think is interesting. And can we use sport auto odors to make progress? Um I I say that both because I know I think doing useful things, especially on the end of like making model safer is great.

And I partially say this because I think it's just so so easy to trick yourself in interpret able and we have like some evidence of s is working um like getting to a bit more a bit if you're curious. But I think they just there was a task that people struggled on, and S, A, S. Were able to do this at least as well, ideally Better competing APP purchase, all with, like summer, to want to do over competing a purchase such that I can be useful in some settings would would be really awesome to see.

For example, can we use them to make models how listening us? Can we use them to monitor models like erratic behavior, whether they are being jail broken? Can we use them to make models more steal, like in ways other scenarios were like prompt engineering files? Can we debug models like other times lama or gma have weird behaviors that we want to go and dig around and see if we can decode why this is happening. Um like one of my dreams is to be good enough interpret ability that if we have a moments like sydney again or it's just behaving and pretty wild and unexpected ways like try to convince the reporters to leave their wife that we can actually understand what is the cognition happening inside that LED to this yeah I mean.

early wrong. We were talking about that connect the dots, paper and the unfaithful chain of thought. And potentially you could apply sa to some of these problems. But could you just introduce those .

to the didn't? yes. So yeah, think I really, really talked about the unfair china for key idea.

Well, key motivation. People often see a model, produce a ten of form. And that, like that is an explanation, therefore is interpreted. I have understood IT.

Why would I even need to do interactive? I like sometimes this is a true, but like sometimes, is the model kind of already knows the answer and just produces some kind of specious, if IT thinks that supposed to give an explanation right now um and it's unclear how closely relevant this field and this paper had a really nice demonstration of this where they um made a bit of molle choice questions with the example change of thought. They gave the model of fish up prompt with ten questions, whether correct answer was a, the final one, the correct answer was b.

And the model tended to produce a spider chain of thought, justifying why the answer was a, and then said, i'm like, this suggests me that the model has a circuit incident that is like the generate bullshit chain of the fault circuit. IT has an internal representation of what the correct answer should be. And IT somehow is mapping that to this socket.

And i'm like, that was wild. Can we turn this into some kind of like faithful chain of thought detector? Well, like I don't know, you give you a bunch of math problems and you check how costly meaningful the chain thought is and you compare that to your detector and use that as a fourth of ground truth or something that be wild.

So we were just talking about owin um Evans there was hit out of his group anyway and he had this paper, which went viral on twitter a few months ago, connecting the dots, talking about this out of context, generalization and reasoning. And there is some medical examples in that. Can you just fill this in on that?

yes. So so a wins group does the great role of paper that's like, you didn't a and you take a thing that you would intrusively expect to work and you're like, oh, that doesn't work or vice pressure.

And so the connect, the dots paper the so what they did is they um i'll just got a concrete example um so they um came up with a mystery city, say paris they call the city x and they keep giving the model prompts of the form the distance between city x and london is blah where blair is the true distance and they find units um to predict distance to a bunch of other cities. And then this is like kind of a weird thing to do. You might expect you would just like memorize these or something, but then an inference time if you ask IT, like, what is the name of C D X? IT says paris.

If you ask questions about paris, like what's the bigger lamacq city of all tower? And they had several of toasted that, like in this kind of genre of you tune, do a task where kind of discovering some latent thing is a useful way to do the task. And then you ask a lot of questions that show IT germline zed in the way you might not expect.

And I really like this result and my hypotheses. What's going on is the the easiest way to answer these questions is not to memorize is to use the existing secretary y that has memorized or may be the model has some actual in the hunnewell world model to compute distances and finding the paris direction. Does that as like there is a greater signal in the paris direction and the this is enough to like overcome any additional pressure. And so the model just converts the like city x phrase into paris. And I would love someone to um replicate one of the examples on like lama three seven or nine eighty and or like jama 2 and use S A is to try to dig in to whats going on there because I would find IT very cool if my random post lining with mean intern tuitions was correct。

So people at home you wanna learn more about S A S what should I do?

yes. So um yeah so I think the arena tutorials are a great places of starts. Um they've just made a new one on bus or twin cotters, which we can link and this what kind of give you a hands on a walk through and like hell, you write codes and like really engage with them.

I think this is a great place to stop. Um if you want to change introduction, the other best place to stop is the neuron pedia demo of the demos cope. But you can just kind of play with IT in an interactive way, see what is fast ultimate color can do, try staring with a try using IT as a microscope and um no code required.

Maybe I D to start with the new upedes a demo and I to do the reno tutorial. And then if you wanna dig deeper, um my reading list has a section with S E papers in IT um and beyond that I recommends just like jumping into projects. Um there's a bunch of s is on small models.

You don't have much compute um if you know how to use larger models um you can um play with the various jama two S A is in James scope. I think the maude tub is like pretty manageable to play with um just like no no a test type of disease about IT. It's lake doesn't have this feature.

Can I get IT to steer this way? Um what things can I exhibit? Features playing? What things are I confused about? Just like i'm a big believer in like as you learn about field and as you read papers, just go and get your hands zilla and write a match of code and implement ideas in the paper, or like predictions that makes or extensions. And I think this is a pretty natural of flows, like actually doing research .

or there has been amazing having you back on a elsy. This has been an absolute marathon. And how long have we been recording for a long time?

So I got here. Yeah, I got here two. It's not ten, eight half hours.

Pretty good. That's pretty good. yeah. Um honestly, it's been amazing. I I think that this episode is probably the biggest brain dump out there on the internet on sports encodes. So too, we promise you that I would be dense and very rich of information, and I hope that we have not only delivered the goods, but we've been gone past what we did last time.

Yeah, it's been great.

Thanks so much. wonderful. Thank you so much now.

Neel Nanda - Mechanistic Interpretability (Sparse Autoencoders) 03:42:36 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Neel Nanda - Mechanistic Interpretability (Sparse Autoencoders)