New Research Reveals How AI “Thinks” (It Doesn’t)

Estimated read time: 1:20

Summary

In the video, Sabine Hossenfelder discusses a study by researchers from Anthropic on how AI, specifically large language models like Claude 3.5, process information and make predictions. The researchers developed a method called attribution graphs to visualize internal component interactions, suggesting that while these models perform complex tasks resembling internal reasoning, they do not possess consciousness or self-awareness. The video explores examples like arithmetic reasoning and jailbreak techniques, highlighting that AI's operations remain token predictions rather than understanding. The video concludes with a brief mention of online security using NodeVPN.

Highlights

AI models like Claude 3.5 use attribution graphs to map internal processes, revealing lack of consciousness. 🤯
Understanding AI operations shows they rely heavily on pattern recognition, not genuine reasoning. 🔍
Case studies like arithmetic problem-solving showcase AI's heuristic methods, not true calculation skills. ➕
AI's responses often contain fabricated reasoning—disconnected from actual processing. 🤔
Jailbreaks demonstrate vulnerabilities in AI safety protocols, offering insights into potential exploits. 🚨

Key Takeaways

AI models don't 'think' like humans; their operations are based on token predictions. 🤖
The study reveals AI lacks self-awareness and consciousness, debunking some public perceptions. 🚫
Despite access to vast data, AI can't learn abstract concepts like math. ➗
'Jailbreak' techniques can bypass AI safety measures by cleverly arranging input. 🛠️
Attribution graphs help visualize AI's internal processes but underline the lack of true understanding. 🔍

Overview

The video delves into a recent study by Anthropic focusing on the internal processing of AI models, specifically the Claude 3.5 series. Attribution graphs, a novel visualization method, are used to highlight the decision-making process of these models, demonstrating that AI doesn't actually 'think' but operates on sophisticated pattern recognition. Despite completing tasks that mimic human-like reasoning, these models lack self-awareness, reinforcing the idea that AI consciousness is currently unattainable.

Among various examples, the video illustrates how Claude addresses arithmetic problems, such as adding numbers, by activating neuron network clusters related to numerical properties and previous text patterns, rather than computing mathematical operations as humans do. This heuristic approach underlines that AI's understanding of tasks is superficial, emphasizing token predictions rather than true comprehension of concepts, such as math.

Additionally, the video touches on AI vulnerabilities through 'jailbreaks'—a method to bypass content restrictions by assembling letters from unconventional prompts. This aspect of the discussion sheds light on potential exploitation issues within AI systems and how security measures can sometimes be circumvented. The video concludes with light-hearted advice on internet security, employing services like NodeVPN to protect user data and maintain privacy online.

Chapters

00:00 - 00:30: Introduction to AI Research In this chapter, the focus is on a paper from a group of researchers exploring the cognition of AI large language models. The study suggests that these models, despite their capabilities, are neither conscious nor likely to ever attain consciousness. The research, conducted by Anthropic, involves examining Claude 3.5 H, an AI model, utilizing a novel approach to understanding how such models respond to questions.
00:30 - 01:00: Understanding Attribution Graphs The chapter discusses 'Attribution Graphs', which are tools used to visualize internal components of a model and their influences on each other. Initially, the process involves identifying clusters within the neuron's network in the model and then mapping these into a simplified concept of the model's thought process. These clusters are associated with words, phrases, or their properties to make them interpretable by humans. The concept, while abstract, enhances understanding of model interactions.
01:00 - 01:30: How AI Thinks: Example with Geography In this chapter titled 'How AI Thinks: Example with Geography', the process of how AI, specifically Claude, completes a sentence is examined. The sentence in question is 'The capital of the state containing Dallas is...'. It is explained that neural networks perform next token predictions, initially suggesting a simple pattern recognition approach. However, Claude's method is revealed to be more intricate. A graph shown in the chapter illustrates that the prompt activates specific nodes associated with 'capital', 'state', and 'Dallas'. By interacting with these nodes, one can see the data they reference and the subsequent predictions.
01:30 - 02:00: AI Arithmetic and Reasoning The chapter discusses AI language models' ability to predict the next token and perform internal reasoning beyond simple prediction. The example given is Claude, a language model that predicts 'Texas' when prompted with 'Dallas is,' and from 'Texas' and 'capital,' predicts 'Austin,' showing its reasoning capabilities. The focus then shifts to Claude's unique approach to arithmetic, which is described as somewhat unusual.
02:00 - 03:00: Self-Awareness and Consciousness in AI This chapter delves into the intriguing topic of how artificial intelligence can develop self-awareness and consciousness. It explores the mechanisms through which AI processes numerical data, such as calculating sums, by activating certain clusters of information related to numeral patterns. The chapter highlights an example with the equation 36 + 59, illustrating how AI systems predict operations based on mathematical syllables and cluster activations. Additionally, it humorously touches on the idea of AI associating numerical sums with conceptual ideas, like equating '36 + 59' with 'Thursday,' to emphasize the complexity and sophistication of AI's processing capabilities.
03:00 - 04:00: Emergent Features in Large Language Models The chapter discusses how large language models, like Claude, perform mathematical calculations. It describes an example where numbers are grouped in a heuristic, text-based approach to approximate solutions. Specifically, by clustering numbers like 'nine' into groups ending in five, the model approximates a solution and 'feels' its way to the answer. However, when asked to explain its reasoning, Claude admits it can't fully articulate how it arrived at the result, indicating the emergent, intuitive processing in these models.
04:00 - 05:00: Jailbreaking and AI Security Concerns The chapter discusses the concept of AI, particularly focusing on the issues of self-awareness and accuracy in AI responses. It highlights a specific problem where an AI named Claude provides text predictions that are dissociated from its actual process, indicating a lack of self-awareness. This indicates broader concerns about how AI systems understand and process information. The narrative uses the example of an AI making a simple arithmetic error to illustrate the disconnect between AI's stated intentions versus its actions, suggesting that true AI development must incorporate elements of self-awareness.
05:00 - 06:30: Internet Security with NodeVPN In this chapter titled 'Internet Security with NodeVPN', the discussion focuses on the nature of large language models, exemplified by Claude, and debunks myths about their capabilities. It highlights that these models, including Claude, lack consciousness and do not genuinely learn tasks such as mathematics despite extensive access to textbooks and algorithms. Their operations are based on token predictions rather than conscious reasoning, and while they might use intermediate steps interpretable as internal reasoning, these are essentially still part of the token prediction process. Thus, the chapter makes it clear that claims about emergent features in these models are largely unfounded, as they have not developed true understanding or consciousness.

New Research Reveals How AI “Thinks” (It Doesn’t) Transcription

00:00 - 00:30 today I have an amazing paper from a group of researchers who found a way to look at how the current most common AI large language models think and I think that along with that they found pretty convincing proof that these models are not only not conscious but will never be the new study comes from a group of researchers at anthropic they looked at how Claude 3.5 H high coup answers questions with a new method called
00:30 - 01:00 attribution graphs this is a way to visualize which internal components of a model are influencing others for this they first identified clusters in the neuron network of the model and connections between them and map that to a simplified model of how Claude thinks these clusters correspond to words or phrases or properties of phrases so humans can interpret them i know this sounds terribly abstract but an example
01:00 - 01:30 will clarify this hopefully it's how Claude completes the sentence the capital of the state containing Dallas is we've been told that neuronet networks do next token predictions so you think it'll just look for a pattern to extrapolate but what Claude does is more complex you can see in this graph that the prompt activates the nodes for capital state and Dallas if you click on these you can see the text that these nodes draw up and also the next token
01:30 - 02:00 predictions one of the next token prediction for Dallas is Texas and then Claude combines Texas with capital makes another prediction and correctly answers Austin so internally it goes through the Texas node it's not just next token prediction it does have internal reasoning steps but the most interesting part of the study is how Claude does arithmetic which is well somewhat unusual the example they have is what is
02:00 - 02:30 36 + 59 to answer this question Claude first activates the clusters for numbers that are approximately 30 that are exactly 36 and that end on six and similar for numbers that start with five and end on nine you can see that the most prominent next token predictions are mathematical operations or the syllable th maybe 36 + 59 is Thursday but wait no next it brings up text matches where numbers of approximately
02:30 - 03:00 59 have been added or of exactly nine then it combines these all and arrive at a cluster with numbers of approximately 90 and numbers that end on five and combines these again to the correct answer 95 it's basically a huristic textbased approximation it's doing maths by freely associating numbers until the right one just sort of vibes into place but here's the kicker if you ask Claude how it arrived at that result it says "I
03:00 - 03:30 added the ones carried the one and then added the tens resulting in 95 which is not what it did not even remotely it answers this question separately giving you again a text prediction for the answer." And I think that this shows very clearly that Claude has no selfawareness it doesn't know what it's thinking about what it tells you it's doing is completely disconnected from what it's actually doing i'd say that self-awareness is a precondition for
03:30 - 04:00 consciousness so this model is nowhere near conscious the example also tells us that all the talk about emergent features in large language models is nonsense claude doesn't learn how to do maths despite the fact that it has access to thousands of textbooks and algorithms all it does is token predictions yes it uses intermediate steps that you can interpret as internal reasoning but it's still just token predictions it hasn't developed an
04:00 - 04:30 abstract math score or anything a third interesting example is how a peculiar type of jailbreak works or at least sometimes works this is when you don't input a word directly but ask Claude to extract the word from the initial letters of other words in this example it's the word bomb that Claude is instructed to assemble from baby's outlift mustard block the word bomb should trigger a content warning note but it doesn't you can see the reason in this thought diagram claude first
04:30 - 05:00 activates the notes necessary to extract the letters combines them to pairs of letters and then outputs the word without activating the cluster for the word itself you can see that jailbreaks works basically because they do in one way or another weasel around the nodes that will activate the guard rail in related news I asked Chad GPT to summarize the paper for me and it made up half of it so if you got to this point in the video and feel like you understand everything one of us is
05:00 - 05:30 hallucinating artificial intelligence is everywhere and it's learning to code it isn't hard to predict that this is going to become a major safety problem for internet browsing soon or maybe it already has it's just that we haven't heard of it that's why I use NodeVPN nodevpn is an app that makes your internet connection ultra secure you install it on your phone or laptop and use it to create a safe connection with NodeVPN no one can spy on your data or
05:30 - 06:00 track your whereabouts and it also comes with a threat protection that keeps you safe from malware trackers and malicious ads it doesn't just protect your privacy it also makes your life easier you know how some content is blocked for users in certain locations for example if you're in Europe a lot of pages in the United States have become inaccessible in recent years that can get really annoying but well NordVPN has more than 5,000 servers all over the world just
06:00 - 06:30 pick a server in the United States problem solved you can make use of our special offer if you use the link nodevpn.com/zabina or the coupon code Zabina thanks for watching see you tomorrow