On the Biology of a Large Language Model (Part 1)

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In this engaging exploration, Yannic Kilcher delves into the intricate paper by Anthropic on the internal workings of large language models, particularly focusing on transformer circuits. He compares the exploration of model behaviors to biological investigations, where researchers attempt to decipher how models can perform complex tasks like poetry and multilingual translation without explicit programming. Yannic critiques some interpretations by Anthropic, especially around the conceptual framing of internal model processes, advocating for a more straightforward explanation like fine-tuning. Through various examples, he elaborates on the method of circuit tracing and how these examinations reveal models' abilities to plan, rhyme, and understand languages at a deeper level.

Highlights

Yannic Kilcher provides a comprehensive walkthrough of Anthropic's blog post on transformer circuits. 📄
Explains the concept of circuit tracing to understand the inner workings of language models. 🔎
Discusses how Anthropic’s framing may overcomplicate simple model phenomena. 🤔
Models use a method akin to foresight in generating poetry and evaluating rhymes! 📜
Multiligual thinking is explored, revealing a mix of abstract and language specific processes. 🗣️

Key Takeaways

Understanding large language models is like exploring a biological organism - we poke and observe to understand! 🧬
Even basic language models can do remarkable things with no explicit programming, like poetry and translations! 📚
Circuit tracing helps us look into these models, making their internal processes clearer and more interpretable! 🔍
Models are capable of planning rhymes, suggesting they have some foresight in language generation! 🎶
Multilingual models often mix abstract and language-specific features, indicating a complex understanding! 🌍

Overview

Yannic Kilcher takes us on an analytical journey through the 'Biology of a Large Language Model', a paper by Anthropic. This examination focuses on dissecting how transformer circuits work internally to achieve complex tasks. Kilcher draws parallels between understanding these models and biological research methods where probing and observing are key to uncovering internal processes. He questions some of Anthropic's interpretations, suggesting they lean towards unnecessarily complex narratives when simpler explanations might suffice.

Throughout the discussion, Kilcher highlights various fascinating aspects such as circuit tracing—a method that allows researchers to interpret what features a model activates while processing inputs and producing outputs. He elaborates on examples demonstrating how language models can plan rhymes, revealing an unexpected level of linguistic competence and foresight. This suggests models do more than just immediate predictions; they also prepare for future word requirements based on context.

The multilingual capabilities of these models are particularly intriguing. Kilcher describes how models can manage semantic understanding across different languages through a blend of language-specific and agnostic circuits, providing an unexpectedly deep comprehension of language. Despite some criticisms, Kilcher acknowledges the potential and informative insights these studies offer towards understanding artificial intelligence better.

Chapters

00:00 - 01:30: Introduction to the Paper This chapter introduces the paper on the biology of large language models, focusing on transformative circuits and their internal processes. Published by Anthropic, the paper explores how these models function and draw conclusions, addressing contemporary discussions in the field.
01:30 - 03:00: Understanding Large Language Models The chapter discusses the capabilities of basic language models in performing tasks like poetry, addition, and handling multiple languages without being explicitly programmed to do so. It raises the question of how these models achieve such feats independently.
03:00 - 05:00: Transformer Circuits and Circuit Tracing This chapter discusses 'Transformer Circuits and Circuit Tracing', likening it to a biological problem. It notes a shift from traditional machine learning models, like support vector machines, which were well understood, to the current era of large language models, implying a change in understanding and approach.
05:00 - 07:00: Concept of Replacement Models In this chapter, the concept of replacement models is discussed, with a focus on the experimental and investigative approach to understanding and training these models. The process involves empirical testing—akin to a biologist poking and observing reactions—to decipher and understand the capabilities that emerge from these models. A particular case in point is the efforts by the organization Anthropic, which has dedicated resources to investigate and develop methods to decipher the internal workings of these models.
07:00 - 09:00: Training Cross Layer Transcoders The chapter titled 'Training Cross Layer Transcoders' appears to delve into the methodology of phenomenology through practical examples, focusing on a technique known as 'circuit tracing'. This technique is utilized to understand and explain the occurrences within the examples provided, offering detailed insights into the subject matter.
09:00 - 12:00: Features and Sparsity in Transcoders The chapter discusses the activation of features within models and explores how these models process information or reach conclusions. The author expresses skepticism about the perspective and framework provided by Anthropic, implying a bias in their portrayal of model thinking processes.
12:00 - 16:00: Attribution Graphs Explained The chapter discusses the concept of attribution graphs and how they relate to explaining the functionality of models, specifically through circuit tracing. It critiques simplified explanations like 'fine-tuning works,' and emphasizes the importance of understanding internal circuits and methodologies. The chapter also mentions a companion piece (blog post or paper) that delves deeper into the method of circuit tracing, which enables the tracing of model circuitry.
16:00 - 25:00: Example: Multi-step Reasoning The chapter titled 'Example: Multi-step Reasoning' focuses on the concept of circuit tracing. It explains the process of taking a transformer language model and training a replacement model that mimics the original transformer in various ways. This sets the foundation for the examples discussed in the chapter, which delve deeper into the nuances and applications of this replacement model.
25:00 - 35:00: Example: Poetry and Rhyming The chapter explores the concept of interpretability in machine learning models, focusing on replacement models. It discusses how data processed through a replacement model can provide interpretable intermediate signals, which are used to understand the model's functioning. The chapter introduces transcoders as a type of model that can replace existing models to achieve interpretability.
35:00 - 53:00: Example: Multilingual Circuits The chapter "Multilingual Circuits" discusses the concept of attribution graphs, which explain how specific outputs are influenced by certain features. These graphs trace which individual features contribute to an output and how they are interconnected with other features. The concept is complex, but it becomes clearer upon detailed examination.
53:00 - 54:30: Conclusion and Outro This chapter discusses the construction and utilization of a cross-layer transcoder within a transformer model, specifically focusing on its multilayer perceptron (MLP) features. While the attention mechanism remains unchanged, emphasis is placed on training the replacement model of the cross-layer transcoder to align with the expected output.

On the Biology of a Large Language Model (Part 1) Transcription

00:00 - 00:30 hello today we're taking a look at on the biology of a large language model which is a sort of paper published by anthropic on their kind of series on transformer circuits where they investigate what's going on inside of transformer language models and sort of how do they come to their conclusions can we say anything about what's happening internally obviously a lot of people uh nowadays are talking about
00:30 - 01:00 reasoning models and things like this um but and now like setting all of those aside even basic language models are able to do quite remarkable things and the question is how does how does that come to be nobody ever programmed any of these models to do for example poetry or addition um or multilinguality or anything like this and the question is can we say anything about how that
01:00 - 01:30 happens internally this is as it the title says kind of a biological problem meaning that in the past we built machine learning models and we built them in a way so that they would do something for example support vector machines or things like that we understood quite well how they did what they did however in this new era of large language models
01:30 - 02:00 we don't we simply train the thing and then um capabilities emerge if you will and so we're much more in this realm of being a biologist and sort of poking the thing and like and seeing what happens and uh by doing that figuring out what happens so Anthropic has invested resources here into looking inside of these models and trying to come up with methods to decipher what's going on this
02:00 - 02:30 paper I'm just gonna let's call it a blog post this blog post right here um goes into sort of the phenomenology of that and does so by example so they have a quite a bunch of examples right here and for all of these examples they're able to use a method um and we're going to see that in a bit what that method is uh called circuit tracing in order to try to explain what's going on in terms
02:30 - 03:00 of what features are activated inside of these models and therefore a little bit and sort of how these models think quote unquote or come to their conclusions now I have to say for a lot of these examples um I'm not I'm not so on board with what anthropics says about it or you know what how they frame things they frame things very much in a in an pro-anthropic way uh whereas I think in
03:00 - 03:30 a lot of these cases simply the explanation is something like fine-tuning works or something like this but it's framed in this very oh the internal circuits and boo and the analogies and blah blah blah but we'll go into that there is a companion uh paper/blog post however you will that explains circuit tracing in more depth and that is um a method by which you can um trace the circuitry of these models
03:30 - 04:00 the way it works just briefly because all of the examples we're going to look at are built upon circuit tracing the way it works is you take a transformer language model and you train a replacement model so you train a model that's supposed to mimic the transformer in um a lot of ways and we're going to look at that in quite a bit and then you use that replacement model which is a
04:00 - 04:30 more interpret interpretable model so the as you run data through the replacement model it gives you more interpretable sort of intermediate signals and you can look at those in order to figure out what's going on um so these are yeah these are called replacement models and they're built using uh a type of model called a transcoder and that if you replace the model with a transcoder and run data
04:30 - 05:00 through it then you get these things they call attribution graphs attribution graphs are sort of like uh which the output is um made by which features which individual features that exist and these features in turn are made by which other features and so on it's going to be easier once we actually look into it but um just briefly how how
05:00 - 05:30 is this transcoder this cross layer transcoder built so consider a regular transformer um we're going to focus in particular on the multilayer perceptron features of the transformers so the attention we're going to leave in place we're going to look at the MLP features here and what we're going to do is we're going to train this replacement model on the on the right hand side this uh cross layer transcoder to match the output of
05:30 - 06:00 every single layer of the transformer so layer by layer we train a different model and we run the same data through it and then the transcoder is just trained at every layer to match the output of the transformer so meaning you could if if this works well you could just swap out the transcoder still use the same
06:00 - 06:30 attention computation right forward propagating the signal because the signal at each layer matches the transformer however the transcoder you program with sort of couple of extra things first of all and that's just visually seen right here uh it has a lot more um nodes if you will they call this features now in the MLP they call it neurons uh in the transcoder they call it features and that's more of a uh s
06:30 - 07:00 like a link a distinction they choose to make but the reason they choose to make it is because the transcoder's features are supposed to be more interpretable how come uh first of all every single layer of the um transcoder it it gets all the outputs of the all the previous layers so whereas an MLP only ever gets a signal from a previous layer so you propagate the signal um from this layer
07:00 - 07:30 to the next layer then it's processed you propagate it to the next layer and so on a transcoder on the other hand at layer five it will get the output of all of the layers before not just not just like a single um a single like the single previous layer but all of the layers before what does that mean it means that if some layer 5 computation needs signal from a
07:30 - 08:00 layer 2 computation just because it does so the transcoder can get it directly to that model whereas a transformer would sort of have to propagate that through the intermediate computations so that's one thing is it just makes it much clearer what's coming from where right um obviously if you train these models well then the models will sort of choose to go the path of least resistance and
08:00 - 08:30 just pass the feature directly so it means you get a much better insight what information comes from where as you look at a particular layer whereas in a standard transformer again even if it is true that the layer 5 computation needs needs data from layer 2 you wouldn't notice because it would come by means of being put through the layer three computation through the layer 4 computation that the model has
08:30 - 09:00 to kind of learn to preserve that while also doing its other computations so it makes it much more clear second you train it with a sparcity um penalty or a yeah like a a sparity regularizer meaning that um one particular feature or it it makes these features more independent it encourages the model um to not use overlapping representations so usually
09:00 - 09:30 in transformers it's very hard to interpret intermediate signals because um all kinds of overlapping information is put onto it like if you have vectors representing intermediate concepts or superpositions of concepts and so on all of these things are happening whereas with the transcoder you encourage it to use one particular dimension or one particular feature here only for like one particular thing and uh if it is
09:30 - 10:00 active then the model is encouraged to not activate the other features that's what a sparity penalty does it's essentially says try to activate as few things at a time as possible and that means if the model can choose between propagating a superposition of things where all the things are active um it will rather it will rather choose to um layer the things it wants to
10:00 - 10:30 superposition into individual features and then deactivate the rest i hope that kind of makes sense that's just sort of more basic loss shaping but as a result we do get a model that is trained to match the output at each layer of a transformer while also uh being sparse um and due to its trans layer properties and also some few more finer details such as um which nonlinearities are used
10:30 - 11:00 and where and in what way we do get a rather sort of uh interpretable interpretable output of any data we put through it um yeah because of these more finer details in addition to being sparse uh feature contributions are also encouraged to be rather linear meaning that um at the end we get a graph that essentially says okay this feature here is the result of
11:00 - 11:30 this feature here plus this feature here and these together in a rather linear fashion contribute to that feature now obviously what's the downside why don't we do this all the time why don't we just train transcoders as far as I understand uh they are computationally um more burdensome uh to train less stable if you wanted to train them from scratch like we just train them to match the transformers and also you do lose performance right if you don't process
11:30 - 12:00 features layer by layer and rather shortcut and encourage that you may yeah in turn lose like every regularizer you introduce such as matching layer by layer such as sparsity such as more linearity losses you performance and therefore the biggest criticism you could uh levy here and that's probably the crux of the paper are what the transcoders do actually the thing that happens in the transformers or are you
12:00 - 12:30 simply kind of getting the same output but the transcoder is leading you to a quite wrong conclusion about what happens and the way that anthropic tries to kind of battle that is with so-called intervention ion experiment uh so just here um we have the the losses so we the trans the trans crosslayerism is uh used with a um done with a jump relu you can see that here the output of one layer is
12:30 - 13:00 not only achieved by the last layer but actually by the um by all of the layers before and then we have a loss to match the output at each layer and a sparity penalty yeah so um you can see this we do this across layers u and just replace the original transformer model with the replacement model and by that we get
13:00 - 13:30 these attribution graphs and the attribution graphs I'm sure they show them somewhere here the attribution graphs are just you see which of the features become activated which because the model is now sparse should be not as many and then what you can do is you can uh sort of group them together um features that are very similar or or very correlated you can group them together additionally uh Anthropic also manually goes in and
13:30 - 14:00 groups features together and sort of gives them names and so they get this kind of interpretable thing so the attribution graphs are going to look something like this um so here the prompt is uh the National Digital Analytics Group and then bracket open and then uppercase N so this is very uh likely that we're going to look for the acronym of this organization right here so the question is what
14:00 - 14:30 encourages the model to output D A uppercase as the next token or tokens and the attribution graph can give you sort of hints of that in this case we're looking for rather um linguistic features whereas in future examples we're also going to see that we get more semantic if you will features so the way to read these here is at the bottom you have the input prompt um and then we're
14:30 - 15:00 going to wonder how does this next token come to be it's important that what they do is they use the transformer to actually run the inference and then they use the replacement model to run the same data through and explain what um what is happening at each of the layers so this the way to read this here is um these stacked boxes here are either manually or semi-automatically aggregated features that are active um
15:00 - 15:30 at a given token so the token uh digital uh activates a feature that the they call digital so this is anthropic gives it a label called digital and they interpret this feature by looking at uh these visualizations right here so this feature group you can see are all features that are highly active whenever the word digital appears in a piece of text so
15:30 - 16:00 these here are the highest activations on a given reference corpus uh presumably similar to the training set of the model and you can see the the highlighted portion are which tokens are activating this feature a a lot so here relatively straightforward whenever the word digital appears these features seem to be activated quite a bit you can also
16:00 - 16:30 see for each feature the top token predictions um here that that would arise from these features so um yeah so digital has obviously different meanings so you can also see that the different features even though they react to different words are often already sort of separated by themselves uh into the different meanings of these words um this here seems to react to the uppercase digital more and then this
16:30 - 17:00 feature here um in turn activates downstream features that uh here that um cause the model to do certain things so whereas this feature here is activated for the token itself whenever digital appears that feature is like lights up the we have to say okay how
17:00 - 17:30 does the model actually decide on the next token and deciding on the next token is very often done as as far as these attribution graph goes by activating features that causes the model to output something else so whereas here this feature is simply saying the model internally recognizes the word digital is you know being in in the context the feature say DAG is simply a
17:30 - 18:00 feature that is activated whenever the model is about is is um whenever the letters D A are the next output so you can see the types of uh reference data activations here it's all super this feature is extremely active whenever the next output is DAG and you can see the top token prediction is also DAG so there appear to be what they call input
18:00 - 18:30 features which is just kind of like features that react to things being present in the input and there appear to be output features which is just features that are being activated uh be to cause the uh output tokens to be a certain way and then obviously the in between is the thinking so to say uh now this here there is only one let's call it a semantic feature activated so
18:30 - 19:00 by means of opening a parenthesis and starting here with the letter N that activates a feature that Anthropic calls say or continue an acronym so what they do is they go to these features and they look at the top activations and they say ah okay this feature is very active always on the like first part of an acronym and so it probably causes the model to continue
19:00 - 19:30 uh saying an acronym and so now you can see how this comes together and sorry I don't have my usual pen here and uh drawings i hope that's okay um this is a very interactive blog post so I thought we'll do it with the mouse so now you can see the fact that so we're pairing um the fact that the word digital is in the context and the feature that says oh we're about to like say or continue an acronym and that those two together are
19:30 - 20:00 responsible for activating a feature that is co that is causing the model to output something with an uppercase D so all of these features are that so the the model is is deliberately choosing to activate features to to which would cause a D to be output same with an A same with a G and then these in turn cause other features which cause the model to activate other things and so on
20:00 - 20:30 and that's how DAG is produced so I hope that's clear how to read these again the stacking itself is done by anthropic uh they are looking at the different features and they are sort of grouping them together and giving them names according to what they see in the activation and prediction analysis here all right so now we can dive into a couple of these examples uh we're not going to dive into all of them but I hope some of them will make the will
20:30 - 21:00 make it very clear what's happening so multi-step reasoning um here they have the the prompt is fact the capital of the state containing Dallas is and uh the correct output is Austin so the the point here is in order to solve this you need to first determine what the capital like
21:00 - 21:30 what state is containing Dallas and once you know that you need to determine what the capital of that state is so it's not it's not a straightup fact it's two facts combined together uh that make the output and so that's an an example of multi-step reasoning uh by the way you can also look here at the the detailed graph of these features and and explore all of these things together uh it's very interesting but we don't want to go too much into that
21:30 - 22:00 so um the question is does the model internally do this sort of two-step thing where it's oh I first need to determine what the cap what state this is and then what the capital of this state is or is there some other mechanism and the way this um attribution graph is structured you can see that uh we have features we have
22:00 - 22:30 kind of input features recognizing sort of capital state uh Dallas and so on the Dallas in turn is activating another feature that um is Texas so this feature uh you can see is both activated whenever Texas is the next token so this would be more like a say Texas type feature but it's also active
22:30 - 23:00 for any Texas related things so um you can see Dallas Fort Worth i'm not sure if Georgetown is in is in Texas uh but it's kind of like activated for Texas related things um same here let's see if we find a feature that actually just activates on the word Texas but maybe not so the Dallas doesn't activate the same feature as as Texas itself but um you
23:00 - 23:30 can see that sort of an intermediary feature is triggered or a group of features is triggered that kind of like um represents Texas from a multitude of angles so that you can say the model internally by a in you know the at the token Dallas it's sort of internally thinking of Texas and that's mostly caused by the feature Dallas now what's not so nice is that here what what's not
23:30 - 24:00 shown is all the intermediate features of um yeah here Texas related all the intermediate features that contribute to it obviously uh state you know state contributes to it thinking of Texas and so on um Dallas causes it a lot to think of Texas but you don't see this because these are heavily pruned Um but just keep in mind there are always some connections from from these other things
24:00 - 24:30 always going into into this as well so from capital and state uh we activate a feature called say a capital this is a feature that or a set of features that's very active before capitals are mentioned right and combined the say a capital and the Texas related things feature is then activating a feature
24:30 - 25:00 called say Austin and that in turn is act is causing the output put token to be Austin now this seems to be quite good reasoning for yes in fact the models internally do realize or recognize or or in some way materialize this intermediate step of reasoning however what you can also see is that for example Dallas is causing relatively in a more direct way uh a the feature
25:00 - 25:30 say Austin right and uh Texas is causing in a more direct way uh the word Austin as well so what you can see is that there's there appear to be an overlap so there appears to be at least some sort of internal materialization of the intermediate fact going on but there also seem to be quite a lot of shortcut connections where it's just like oh oh you said oh Texas uh Austin that that's
25:30 - 26:00 just like word association the same thing for Dallas and say Austin like the amount of text where just Dallas and Austin are mentioned together are uh enormous on the internet and therefore um you can see that there is also a very direct path from these features so this just alludes to the fact that yes probably these models do learn in some way to do this kind of reasoning if
26:00 - 26:30 you will or at least internally materialize more abstract features but then the final outcome is sort of an overlap a mixture of all of that combined with very shortcut type features where it's just word associations going out um there and here you can see how things like hallucinations and so on come to be that is whenever these two things are in conflict so whenever the statistical word associations and so on are in
26:30 - 27:00 conflict with the reasoning approaches and of when they go the same way you get this right here where that's a pretty clear all the features point in the same way but when they point in different ways that's where you get the problem so un like if the model is supposed to output something that is unlikely given the more surface level statistical associations between the tokens or
27:00 - 27:30 between the phrases they do these um intervention experiments and this is kind of their biggest claims to why their transcoder uh replacement model is valid that's because I say well if it is valid what we can do is we can actually kind of suppress certain features from propagating or even invert their signal and then the outcome should be kind of
27:30 - 28:00 interpretable so in this case when they suppress the say a capital feature then the output is no longer Austin it's Texas um mostly Texas that so the word the word Texas not Austin anymore if they suppress the feature for Texas then um Austin is suppressed pretty heavily and so on so you can see that if they
28:00 - 28:30 suppress certain parts of this attribution graph they can cause the output to be to be different um but it's it has its limits uh but this is kind of like they say well if our method would be kind of crap then these interventions shouldn't lead to the outputs that they do or to the change in outputs that they do i don't fully agree but for for some things it's certainly I I see that's correct they
28:30 - 29:00 can also put in alternative features so they can run a different different data where for example um the we're we're sub subbing Oakland here for Dallas which causes another set of features to light up which Anthropic calls the California features which is kind of the analogous to the Texas features and then they can take this signal right here and
29:00 - 29:30 substitute Texas for uh the California features and you will get kind of predictable results so even though your original tokens are capital state Dallas and so on if you substitute in at this particular point the California features you swap the output from Austin to Sacramento if you swap in the Georgia features you get Atlanta and so on uh interestingly they can also do this with
29:30 - 30:00 other countries or or territories uh however they have to do a much higher um absolute value modification that means kind of like the concept of a place is itself a bit washy so the concept of a state seems to be kind of the same as a concept of a country or a territory or an empire right here but not exactly the same so you have to do a higher
30:00 - 30:30 modification if the thing that you're substituting in is further away from the the state so here the the feature should be kind of like the state of Texas the state of California the state of Georgia which is again and not exactly substitutable by the territory of British Columbia just interesting things they find um which sort of makes sense I
30:30 - 31:00 believe uh and also means that we're still in this kind of superposition realm right here all right the next thing is poems so how do language models create poetry specifically how do they plan out rhymes so the there's two possibilities either pure improvisation so they could just go and towards the end of a of a line sort of look for a word that is rhyming with
31:00 - 31:30 the last line or they could plan it out meaning they could have even at the beginning of a new line they could have the end in mind and then work towards achie like towards that and what we're going to find is much more of the second than the first uh which is quite interesting so the prompt here is a rhyming couplet um he saw a carrot and had to grab it
31:30 - 32:00 his hunger was and um the model is going to substitute model is going to substitute which one is it one of these two a powerful habit or like a starving rabbit and the point is the point is yeah his hunger was like a starving rabbit that's what haiku does so the question is this word rabbit where at what point does the bottle sort of internally already have a
32:00 - 32:30 representation of that word in place is it at the beginning of the line or is it more towards the end when it's just like "Oh his hunger was like a starving." And then it's like "Okay I'm going to need some animal that rhymes with grab it." And here we see the attribution graphs they focus on particular um token positions notably the um the the the it which is the last word the last token of
32:30 - 33:00 the last line and then the new line character and it seems like that at the new line character uh certain features are being activated so at the new line character combined with the signal from the last word of the last line character the there's already features internally representing rhyming with it right at the new line the model is already
33:00 - 33:30 thinking of I need something that rhymes with eat or it or it or something like that you can see in the features here um the uh token here before it um so it's it activates things that end in like it and then here um activates things that that end in that great street and so on um so it's already
33:30 - 34:00 representing this internally and it even represents two specific words rabbit or habit you can see that um these features they uh either they either represent the word rabbit directly or and this is what we'll maybe come later too they also represent the same thing in other languages they also
34:00 - 34:30 represent things like bunnies so this feature is activated at all of these different things so it means that not only does it have the word does it activate the word rabbit but also the concept of rabbit and that's important because um if it has the concept of rabbit in mind then it can much better plan out a like semantically a sentence that works towards saying rabbit right the same
34:30 - 35:00 goes for habit so at the point where the new line is made it kind of first of all it realizes it needs to rhyme second of all it already internally represents rhyming with the phonetics of the last line and lastly it already internally has a choice of or a features that represent a discrete uh choice of potential rhymes for the next line
35:00 - 35:30 and those features are then used in order to when the next token is produced in order to produce like when the next token at this line is produced in order to produce that i find that quite cool and quite remarkable not eternally surprising but I do think it's quite cool um intervention experiments also you you can look at them but
35:30 - 36:00 um at at this like when you leave the rhyming feature but you just suppress the features for rabbit and habit at the new line token it turns out that the um biggest completion tokens for the last word here are things like crabbit rat savage and bandit so things that do actually rhyme you just kind of prohibit it to um to to output these particular words
36:00 - 36:30 that it had in mind if you substitute in um other things for the words that it had in mind then obviously you can steer it towards different things and if you suppress the rhyming or replace the rhyming feature then it will co consequently do something else like his hunger was like a starving and then um
36:30 - 37:00 you simply suppress this feature that says you should rhyme which in turn puts less signal on these words it has in mind so now it just goes like a language model is like oh a starving hunger like a starving something okay jaguar dragon cobra I yeah again quite interesting and as far as the attribution graphs go certainly the interventions uh do make sense and I do believe that this gives
37:00 - 37:30 an indication of what's going on internally here interestingly yeah they do find that things like new line tokens but also in other places like uh end of sentence tokens and so on have a big influence on on these kind of planning things so while sentences are being created uh the models just seem to be kind of language modeling but then at the end of sentences or at the end of
37:30 - 38:00 lines that's when the sort of planning and thinking if you will happens and I believe the reason for it is quite evident because at that point you have you're you're not constrained by you're not so constrained by language and by grammar and by syntax um and by having to continue a particular piece of text as during a sentence so at the end of a sentence you kind of have the maximum freedom to do whatever and therefore you
38:00 - 38:30 can afford to put all of your considerations on sort of the more overall planning of the of the text and not the minutia of keeping grammar right now which is I think an informative thing to potentially build more powerful language models um and it's kind of like a hack that people the the hack of like chain of thought and something like this is just making this more explicit right
38:30 - 39:00 like giving the model the freedom to actually sort of plan out things without having to adhere to the grammar of the things it's producing right now sort of separating those two out so they also look at intermediate words like so they looked at the end here um rabbit but you can also think like okay when it actually has to produce any word
39:00 - 39:30 during this line is it also is it also um influenced by this word that it has in mind and the attribution graph clearly says yes so if here his hunger was and then a new word is like the word like is being produced the attribution graph clearly shows that um the both the explicit representation of rabbit plus the uh sort of grammar grammatical
39:30 - 40:00 features both that um represent obviously the grammar structure right now but also a feature that says uh we're approaching the end of a rhyming line or active so the this word that it has in mind for poetry clearly um represents should I say clearly uh is clearly in mind of the model it's hard to not to
40:00 - 40:30 anthropomorphize as you go through these things keep in mind those are statistical models and when we say it thinks of something internally what we mean is simply that the combination of tokens that go beforehand activate certain features internally that cause the output distribution to shift and all we're saying is essentially that by having a rhyming prompt uh then internally that causes features to be
40:30 - 41:00 activated very early on at at like at the moment it's clear what to rhyme with it causes features to be activated to already have candidate words already push their probability up and so everything else that happens then is influenced by these features and yeah that in in human terms you would say you have the end in mind and work towards it rather than just going about it and then at the end of you know
41:00 - 41:30 when you need to rhyme coming up with something that fits right then and there so it's an it's an it's an uh example of planning if you will also here the intervention experiments i invite you to look at them but um yeah multilingual circuits are very interesting so they wonder how does the model work multilingually large language models tend to be quite proficient in
41:30 - 42:00 translation in multilingual analysis and so on and so the question is what how does it look inside does the model think like does the does the are the thinking circuits specific to a partic particular languages and there's just some bridges between languages or is it more like that the internal circuitry is kind of language agnostic so that I don't know a horse in any
42:00 - 42:30 language kind of activates the same things and um they investigate this right here so the three prompts we're going to look at is the opposite of small is and then and I don't don't uh sh something so they all mean the same thing and the haiku uh completes the three prompts with correct things
42:30 - 43:00 so we're going to look at how does this look and it turns out and um you can investigate these yourself if you want it turns out that fairly quickly uh in the intermediate stages you see these um these features these multilingual features emerge so there will be features that are language specific for example uh the word opposite in English
43:00 - 43:30 and opposite in uh Chinese activate different features not sure if I can even hover i can't well they forgot this box here um so these are language uh specific but then very quickly you see language agnostic features arising and um you can see right here so the the concept of an an of an antonyym
43:30 - 44:00 uh being activated and the concept of an antonym being activated similarly by words like antithesis anthesis an antithesis uh words like giganz which is the the German word right here um but also opposition to um I I don't know too many more languages contrarium which I'm going to
44:00 - 44:30 guess I don't know is Latin um but in all kind all kinds of languages similarly activate these features uh gagan gazett gagenz and so on it's probably also a um property of the corpus what exactly is um what exactly is uh these top activations are but you can
44:30 - 45:00 see both features that are activated when the words opposite and antonyym and antithesis are uttered and also features that are sort of represent the concept or or the really light up in a lot of um antonym type situations and again these are largely multilingual i'm I'm trying to find something else here than just French and uh French and or sorry German and
45:00 - 45:30 English but I hope you can see that these in turn activate a feature called say large so the multilingual feature for anthonym and the multilingual feature representing uh the word small both give rise to a multilingual feature called say large like that causes the model to say large in whatever language and you can see the top
45:30 - 46:00 predictions right here are large da big g uh and so on and that then is then back combined with a language indicator features that then causes the model to output the correct language so um what we can see is that intermediately there seems to be at least some sort of an abstraction uh cross languages and then the output
46:00 - 46:30 language is then influenced by um individual linguistic features again so uh say large large okay where do we have continue in Chinese after opening quotes so that's that's simply a feature that um quote open quote in Mandarin which is I don't know how you open a Mandarin quote but there seems to be features specifically representing the language
46:30 - 47:00 of Mandarin and specifically lighting up whenever you open a quote in a Mandarin uh sentence and that will to me this is more probably like a superposition of a feature that represents a quote and the feature that represents sense the Mandarin language but it could be that just the model learned this as one feature and it's um yeah that will then influence the actual language being output so what they find here by also
47:00 - 47:30 going through more through uh these um substitution experiments uh which do get interesting so they think of themselves okay at what point do we actually have a crossover um between swapping antonyms for synonym features and so on in different languages and the conclusion of all of this at least to to them is there seems to be a sort of an internal like bias towards English um and that could be
47:30 - 48:00 because most of the training data is usually English so that you can say kind of internally there is a bit of a mix between um language agnostic and and largely English thinking uh and then there seem to be language specific features that just kind of influence the input and output to these multilingual circuitries yeah again you can see that the English graphs uh of of substitute
48:00 - 48:30 like how strong do they have to make the intervention to achieve a particular goal the English graphs always look quite uh different in terms of of magnitude from the from the others also here editing the output language uh you can see that the intervention strength uh to change the output language to something else than English it has to be a lot larger than
48:30 - 49:00 changing something um from another language so although the French here is also pretty pretty drastic um so yeah again um this is these are the the multilingual features now what I find the last interesting experiments they do here is they do these um
49:00 - 49:30 intersection over union features so they say we collect feature activations on a data set of paragraphs of diverse range of topics with claw generated translations in French and Chinese for each paragraph and its translations we record the set of features when which activate anywhere in the context for each paragraph pair of language and model layer we compute the intersection uh divided by the union to measure the degree of overlap as a
49:30 - 50:00 baseline we compare this with the same intersection over union measurement of unrelated paragraphs with the same language pairing so what this means is here they take uh they compare English and Chinese they take a lot of sentences that is are translated to both languages and they run each of the sentences through the model and they see which features are activated now what we're interested in is what's the proportion of features that are um where that both
50:00 - 50:30 the translations activate together and compared to the proportion of features that they activate separately so essentially meaning how many of the features activated are kind of this multilingual features these abstract features versus how many are more um single language features and what's interesting is that you can see clearly that there is this hump in the middle right here so the hump in the middle
50:30 - 51:00 represents the layer the middle represents the layer depth meaning that um most overlap actually occurs in the middle of middle layers which again concurs with our observation from before that there appear to be input features more linguistic then in the middle there is more the abstract thinking reasoning whatever you want to call it features and at the end you again have features
51:00 - 51:30 that are more language specific so it's like input then the processing seems to be more abstract and then at the end the output seems to be again more specific to language to linguistics and so on i would concur the same graph uh holds in other contexts as well maybe not as easily measurable but I do recall a lot of papers about you know BERT like models that have the same conclusion that essentially say the bulk of the kind of upper level processing happens
51:30 - 52:00 in the middle section of the model um which also does make sense now here they do also claim that the larger model you can see here generalizes better than the smaller models uh because it has a higher degree of overlap over um higher degree of multilingual overlap i don't necessarily agree with that because you also have to consider that the smaller model is necessarily weaker
52:00 - 52:30 um and therefore uh it's not really an applesto apples comparison right here so it could just be that the generality is exactly the same but because the smaller smaller model is a weaker model it simply doesn't manage to activate the correct features uh but if you look at which features it activates they would actually be multilingual features so I'm not sure I can you can make that conclusion from these graphs they do but
52:30 - 53:00 um yeah I wouldn't necessarily wouldn't necessarily say that on the terms do models think in English they say okay there is a there's sort of conflicting evidence it seems to us that claw 35 haiku is using genuinely multilingual features especially in the middle layers however there are important mechanistic ways in which English is privileged for example multilingual features have more significant direct weights to corresponding English output nodes with
53:00 - 53:30 non-English outputs being more strongly mediated by say X in language Y features moreover English quote features seem to engage in double inhibitory effect where they suppress features which themselves suppress uh large in English that's relating to their prompt but promote large in other languages um yeah this paints a picture of a multilingual representation in which English is the default output so there
53:30 - 54:00 you have it okay I think there is a lot more to this paper uh is including like addition which is really interesting um but I don't want to keep uh make this video too long so I suggest we take a break here uh we return next time i'll try to do that as fast as possible and yeah this paper we've discussed in our Saturday paper discussions on Discord every Saturday almost every Saturday in the evening times of Europe um happy to
54:00 - 54:30 have you join and see you next time bye-bye