Delving into Reinforcement Learning with Stanford

Stanford CS234 Reinforcement Learning I Tabular MDP Planning I 2024 I Lecture 2

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In Lecture 2 of Stanford's CS234 course on Reinforcement Learning, students delve deeper into concepts of Markov Decision Processes (MDPs) and planning using Tabular MDPs. The lecturer revisits foundational ideas, emphasizing the importance of monotonic improvement in policy iteration methods and the distinct roles of policy and value iterations in reaching optimal decisions. Through an analytical and practical lens, the lecture underscores the significance of understanding value functions, Bellman equations, and their properties, setting the stage for more advanced topics in the course. As the lecture concludes, students are reminded of the critical computational theories and can expect to apply these insights to future decision-making tasks.

Highlights

The lecture revisits important themes such as monotonic improvement and policy vs. value iteration.🎓
Students engage with complex topics like the Bellman equation and contraction operators fundamental to MDPs. 📚
The importance of computational efficiency and the practicality of simulation-based evaluations was stressed. 💡
Understanding MDPs is crucial for advancing in reinforcement learning, as demonstrated with real-world examples like AlphaGo. 🏆
The lecture sets the ground for future topics including function approximation and expanded learning strategies. 🚀

Key Takeaways

Understanding the importance of monotonic improvement in policy iteration methods helps optimize decision-making in reinforcement learning. 🔄
The Bellman equation is integral to calculating value functions, highlighting the role of both value and policy iterations. 📊
Value iteration and policy iteration techniques provide different advantages, with unique convergence properties and applications. 🔍
In reinforcement learning systems, concepts like MDPs and value functions are foundational for developing advanced models like AlphaGo. 🤖
Exploring finite vs. infinite horizon problems offers insights into decision-making strategies in reinforcement learning. 🔭

Overview

Welcome back to Stanford's CS234, where we dive into lecture two of Reinforcement Learning with a focus on tabular Markov Decision Processes (MDPs). This session refreshes our understanding of policy iteration methodologies and their unique monotonic improvement qualities. We explore the essence of decision-making under uncertainty, preparing us for real-world applications like AlphaGo.

The core of this lecture revolves around two main strategies in reinforcement learning: value iteration and policy iteration. We delve into the Bellman equation and its crucial role in determining value functions. By understanding the significance of contraction operators and the differences between finite and infinite horizon problems, we better grasp the nuances of optimal and suboptimal policies.

As we progress, we bridge theoretical insights with practical examples, emphasizing the need for computational efficiency in RL models. This lecture rounds off with a precursor to future topics such as function approximation, engaging students to think critically about their applications in reinforcement learning, from conception to state-of-the-art systems.

Chapters

00:00 - 00:30: Introduction and Review The chapter introduces the second lecture of the reinforcement learning series. It begins with a review designed to refresh students' understanding of the key concepts discussed in recent lectures. The instructor encourages students to log into the Ed platform, providing assistance for those not already registered. The focus is on ensuring a strong grasp of foundational concepts as students continue their learning journey.
00:30 - 01:00: Logging into Ed and Participation Points This chapter covers the process of logging into a platform called Ed using a SUN ID and participating in polls. It explains that clicking on poll links should automatically log responses once logged in. Participation points are awarded based on a sufficient percentage of poll responses. Full participation points are awarded to those who meet the required percentage, although participating is optional. The chapter begins introducing the concept of markup decision processes.
01:00 - 02:00: Understanding Discount Factor and Rewards The chapter discusses the discount factor (gamma), emphasizing its impact on the influence of short-term versus long-term rewards. It also touches on sequential decision-making under uncertainty, particularly in real-world systems where monotonic improvement is desired, meaning that with more data or computation, better decisions can be made.
02:00 - 03:00: Sequential Decision-Making and Monotonic Improvement The chapter discusses the possibility of constructing algorithms for computing decision policies such that additional computation or iterations guarantee a monotonically improving decision policy. It prompts the reader to consider whether they know of any existing algorithms with this property, whether they believe it's impossible, and to ponder the universality of such a feature across all algorithms.
03:00 - 04:00: Constructing Algorithms for Monotonic Improvement This chapter begins with an introduction to the concept of 'Monotonic Improvement' in algorithm construction. The speaker invites the audience to refresh their understanding and indicates that the content discussed is not necessarily part of formal assessment. They also encourage revisiting past lecture slides for those who need a recap on prior material as they delve into understanding and developing algorithms with a focus on ensuring that each iterative step results in an improvement or at least no degradation.
04:00 - 05:00: Refresh and Assessment Questions The chapter 'Refresh and Assessment Questions' appears to focus on encouraging communication and peer interaction, as indicated by the transcript excerpt which invites individuals to engage in conversation with those nearby. The content likely includes strategies or exercises aimed at reinforcing learning through discussion or peer assessment.
05:00 - 08:00: Discount Factor and Value Function This chapter discusses the concepts of discount factor and value function. The correct answer to a posed question is revealed to be false, prompting an explanation from a participant. A significant point raised is that long-term rewards are multiplied by gamma, indicating the importance of the discount factor in calculations. A larger gamma value signifies heavy emphasis on future rewards.
08:00 - 12:00: Matrix Inversion and Dynamic Programming The chapter discusses the concept of rewards in dynamic programming, specifically in the context of matrix inversion. It explains the significance of the discount factor gamma in determining the importance of short-term versus long-term rewards in decision-making processes. A gamma value of one treats short-term and long-term rewards equally, whereas a gamma value of zero indicates a preference for short-term rewards. The chapter sets the stage for deeper conceptual questions and exploration into these topics.
12:00 - 15:00: Markov Decision Process (MDP) The chapter discusses the concept of Markov Decision Process (MDP) within the context of reinforcement learning and its distinction from other AI and machine learning methodologies. The speaker clarifies previous confusion around the concept of optimization in machine learning, suggesting that optimization is a fundamental concept whether it's conducted through reinforcement learning or other AI approaches. The notion of optimization is likened to minimizing a loss function, which is a common objective in machine learning tasks. This explanation aims to demystify how optimization in machine learning can be understood similarly across different methodologies.
15:00 - 17:00: Policies and MDP + Policy as MRP This chapter delves into the intersection of policies and Markov Decision Processes (MDPs). It emphasizes that while it may resemble an optimization problem, such distinctions can sometimes be ignored initially. The focus on decision-making highlights the importance of identifying the right metrics, which may not always be straightforward loss functions but can include various scalar values or multiple objectives. It is suggested that differentiating whether supervised learning relies on optimization might not be as beneficial. The chapter sets the stage for a deeper exploration of the role of policies interpreted as Markov Reward Processes (MRPs).
17:00 - 19:00: Policy Evaluation The chapter 'Policy Evaluation' discusses the potential overlap in content with other classes, particularly mentioning if students have taken course 238 with Michael Cocker. Initially, the course may share some similarities in the early weeks. However, a distinguishing feature will be a higher level of theoretical content in the beginning weeks, focusing on the properties of certain algorithms and the guarantees they provide. This theoretical focus is expected to decrease after the initial weeks.
19:00 - 22:00: Computing Optimal Policy and Policy Search In this chapter, the instructor discusses the unique approach of the course in comparison to other decision-making classes available at Stanford. The focus is on understanding the fundamental systems such as the seven-state Mars rover before progressing to more complex systems like AlphaGo, robot control, or optimizing large language models (LLMs). The instructor emphasizes that many foundational ideas that enable advancements in these fields are rooted in simpler models and systems. Students are encouraged to reach out for clarification during office hours or after class.
22:00 - 37:00: Policy Iteration and Monotonic Improvement The chapter titled "Policy Iteration and Monotonic Improvement" discusses foundational concepts in decision processes, particularly as they relate to advanced methodologies like reinforcement learning and human feedback systems. It highlights the importance of understanding these processes in a structured, tabular format for clarity. The transcript specifically mentions AlphaGo and suggests that today's learning will introduce policy search, which underpins more advanced concepts like policy gradients.
37:00 - 44:00: Questions and Policy Iteration Proof The chapter delves into decision-making in the context of Markov Decision Processes (MDPs). It serves as a foundational building block for more advanced algorithms that will be covered in subsequent chapters. The focus is on evaluating how good a particular decision policy is and determining what constitutes an optimal decision policy. Emphasis is placed on understanding these concepts as a precursor to engaging with state-of-the-art algorithms.
44:00 - 53:00: Value Iteration In the chapter titled 'Value Iteration,' the discussion centers around having a model of the world. This includes a dynamics model, which describes how the world evolves with decision-making, and a reward model, which assesses the quality of those decisions. The chapter builds upon previous discussions on Markov processes and introduces Markov reward processes, emphasizing their utility in evaluating the effectiveness of specific decision policies.
53:00 - 60:00: Contraction Operators This chapter revisits concepts from Markov processes and decision processes, focusing specifically on the role of discount factors in determining the present value of future rewards. The chapter explains how future rewards are progressively devalued (discounting), illustrated by multiply the following rewards by their respective powers of the discount factor. As a result, with sufficiently long time horizons (or infinite ones), the net present value of future rewards approaches zero, emphasizing the diminishing importance of rewards that are further away in time.
60:00 - 66:00: Value Iteration Proof and Properties The chapter discusses the concept of the value function in a Markov reward process, emphasizing its role in determining the expected discounted sum of rewards from a given state when acting perpetually. The discussion highlights that as long as the discount factor (gamma) is less than one, the expected reward remains finite. This lays the groundwork for understanding value iteration and its properties.
66:00 - 71:00: Finite Horizon and Simulation The chapter focuses on the concept of computing returns and value in the context of Markov reward processes over an infinite horizon. It discusses the use of the Markov property, which helps in simplifying the computations by making the future independent of the past, given the present state. This foundational idea is crucial for understanding how to predict the long-term rewards and value from any given state when an agent acts in an environment indefinitely. The chapter highlights the theoretical underpinnings necessary for engaging with infinite horizon scenarios and provides insights into the computational approach needed for such tasks.
71:00 - 73:00: Conclusion and Key Concepts In this concluding chapter, the focus is on the key concept of evaluating future rewards based on current states. It emphasizes the idea that understanding immediate rewards and potential future states is essential. Each potential future state is weighted by the probability of transitioning to that state, similar to methodologies used in tree search algorithms. This approach allows for a structured understanding of expected rewards without needing to overemphasize historical data beyond the immediate context. The chapter wraps up by reinforcing the importance of these strategies in dynamic decision-making and planning scenarios.

Stanford CS234 Reinforcement Learning I Tabular MDP Planning I 2024 I Lecture 2 Transcription

00:00 - 00:30 hi everybody welcome back this is a lecture two from reinforcement learning um we're going to start with a refresh your understanding again these are just a sort of a quick way to check your conceptual understanding from the most recent lectures or occasionally we'll go back a little bit to do this you just need to log in to Ed everybody should be added to Ed if you're not just send us an email to our mailing list um so if you go to Ed please follow the steps given to log in first before you click
00:30 - 01:00 the links so if you follow those steps and then you're logged in with your sun ID then when you click on the poll links it should just take you right there and it'll just log all your responses if you're curious about how we use these for participation points um you can just go to the website to see how we calculate it I think it's um we use just a percentage of these if you do a sufficient percentage then you get full participation points it's optional all right so we're going to start with this today the question is in markup decision process says a large
01:00 - 01:30 discount Factor gamma means that short-term rewards are much more influential than long-term rewards um and then a second question to start thinking about is in general so last time we started talking about sequential decision- making under uncertainty and one of the things we often would like in Real World Systems is monotonic Improvement meaning that if we get more data or we get more computation we know that the system is going to be better make in our case better decisions than it could if it had less computation or less data
01:30 - 02:00 and so the question um that I'm posing to you now and that we're going to discuss today is is it possible to construct algorithms for computing decision policies so that we can guarantee with additional computation which we could also think of often as iteration um that we're going to monotonically improve the decision policy and you can start to think about if you're already aware of any algorithms that might have that property if you think it's impossible or if you think if it's true do you think that all algorithms would satisfy that okay that's not for the poll that's just to
02:00 - 02:30 start thinking about and we'll come back to it later all right so I'll just give you another minute or two to do this refresh your understanding it's just a quick one and then we'll go and again these are not assessment questions so you're welcome to look back on lecture slides from last time you're
02:30 - 03:00 also welcome to talk to anybody right next to you
03:00 - 03:30 all right it looks like we actually have um maybe a 23 1/3 split on this question um the correct answer is false does somebody want to say why it's false yeah and remind me your name yeah I think because you multiply the longer term rewards by uh the gamma so a large gamma means that the long
03:30 - 04:00 rewards are we get decently that's right so if you have exactly said so if gamma was one you would care about short-term rewards exactly the same as long-term rewards in general if gamma was Zero you wouldn't care about long-term rewards at all you'd be entirely myopic but as gamma gets closer to one it's sort of a relatively waiting more of longer rewards than you would otherwise great all right and yes as I said H well we'll get more into the conceptual question later um the other thing that I
04:00 - 04:30 wanted to clarify I saw there was some questions on this last time as well as after class as well as on Ed is sort of I had mentioned when I was uh making distinguishment between reinforcement learning and other forms of AI machine learning this notion of optimization but I think that that was a little bit un like it was more confusing than it was helpful um and because depending on how you think of it in machine learning II we always have some form of metric or optimization so you can think of a loss as also being we're trying to to minimize the loss and so that also
04:30 - 05:00 sounds like an optimization problem so you can just ignore that distinction for now I do think in general when we're thinking about decision making it's going to be very important what we think of as that metric and so it won't necessarily just be loss functions we can have lots of different scalar values or even multiple objectives but the distinction of whether or not supervised learning is using optimization is perhaps not so helpful okay great so let's go ahead and get started um so I do also just want to
05:00 - 05:30 highlight that for some of you um and I got a question about this we've also got a couple questions about this this this first week or two will overlap a little bit with some of the other classes you might have taken so particularly if you've taken 238 with Michael Cocker the beginning May overlap the things that will probably still be different in the first couple weeks is I expect there's going to be um a higher level of theory in the first week or two about the properties of some of these algorithms and what sort of guarantees we have and then after I suspect after that most of
05:30 - 06:00 the content in the rest of the class will be quite different if you have any questions about how this compares to a lot of the other decision-making classes that are offered at Stanford don't hesitate to reach out to me in office hours on Ed or after class all right now why do we do this because also you might be thinking we want to get to alphago or I want to get to controlling robots or I want to get to optimizing llms why are we starting with systems like the seven State Mars rover that we're going to look at and the reason is because actually a lot of the ideas um that enabled people to
06:00 - 06:30 solve alphago um and do things like rhf or reinforcement learning from Human feedback really Builds on this fundamental notion of decision processes and I think it's much easier to really cleanly see how these ideas come up bless you when you can actually see these in the the world is tabular you can just write down all the states so that's why I think it's helpful but even today we're going to start to see where those ideas might be applied so we're going to start to see things like policy search which is sort of the foundations towards things like policy gradients
06:30 - 07:00 which are extremely widely used so you can think of all of these as just being building blocks that we're going to use to build up to get to the point we're later going to be and very soon within a couple weeks tackling things that are state-of-the-art algorithms all right so what we're going to be doing today is really focusing on making good decisions given a markof decision process and so that means both being enough to understand how good a particular decision policy is as well as what is an optimal decision policy and
07:00 - 07:30 when I say we're given a model of the world what I mean is that we are given sort of that Dynamics model which tells us how the world evolves when we make decisions and we are given a reward model which tells us how good decisions are and last time we talked about Markoff processes and we were starting to talk about Markoff reward processes because they can end up being really useful when we're trying to evaluate how good a particular decision uh policy is
07:30 - 08:00 and we'll see a lot of the same ideas from markof for word processes to mdps okay all right so let's just refresh our memory so as this is the question that we had before of like you know how do we think of uh the influence of discount factors as was said what happens is we multiply the next reward by the discount Factor two rewards Away by the discount Factor squared Etc and so as you can see there if the Horizon is really long or as it goes to Infinity rewards will have zer value eventually because gamma is
08:00 - 08:30 less than one so the idea of the value function was to say remember this is a markof reward process we don't have decisions yet just says how much um is the expected discounted summer rewards we will get starting this state and acting most of the time today forever so most of the time we'll think of today of just like getting to act forever and how much reward would you get and because this gamma if as long as the gamma is less than one here that will be a finite number so we're just starting to talk about how
08:30 - 09:00 could we compute this so again remember the return is going to be a particular series of rewards you might get if you start in this state in act forever and V is going to be on average how much reward would you get if you start in this state in act forever all right so one of the key ideas here is that Computing the value of an infinite Horizon Markoff reward process leverages the Markoff property which was this idea that the future is independent of the past past given the
09:00 - 09:30 present so given your current state you don't have to think more about the the history so what that applies when we try to compute what the expected reward is future reward is from a state is we can think of well what is the immediate reward we get in that state plus all the different states we could get to next under our Dynamics model and then the value of their reward how much do we weigh each of those well we weigh each of those just according to what is the probability I could get to each of those next States and if you're familiar things like tree search you can think of it as
09:30 - 10:00 just I'm in my starting State I think of all the next States I could go to each of them have some weight depending on the probability i' get there and then I sum all of those up according to their values okay and this is going to be the basis of the Bellman equation which we're going to see lots about okay so if we wanted to think about how we could solve this one way we could think of it is if we have a tabular world meaning that we can write we can maintain a uh scale value for
10:00 - 10:30 every single state separately so this is like our Mars Rover case um then we could just Express the value function in a matrix equation so we say the value of each of the states is exactly equal to the immediate reward plus gamma times the um transition probability to all the next States and so that's nice because now we can just directly solve for what the value is so we know that this has to hold so now what we're going to do is just invert that to solve for V
10:30 - 11:00 so what we would say in this case is we would say V minus gamma * P of V this is p and again I'll apologize that in the different uh things you see online or the textbook Etc people sometimes use T for transition Matrix they sometimes use P for like probabilities going to the next state um if it's ever confusing what notation being used don't hesitate to reach out okay so we just rewrite it like this is equal to R and then we move this so we have B of
11:00 - 11:30 IUS gamma p is equal to r i equals the identity Matrix which means V is equal to IUS gamma P inverse time R so why do I show this I show this because if you know how the world works you have the Dynamics model you know what the reward function is and the world is small enough you can just directly solve for this
11:30 - 12:00 this isn't for decision yet this is just showing us blessy like what the value would be of each of the states so this is one way to solve it we would call this like sort of the analytics solution and this require one thing to note here is this requires a matrix inverse and so there are faster algorithms than n cubed n being the number of states but you know in general Matrix inverses are fairly expensive so this has been done once but this is a fairly expens if your state space the number of states you have is large this can be expensive and it also requires
12:00 - 12:30 that um the identity Matrix minus gamma times uh the dam the Dynamics model is invertible okay so this is one way we can solve this yeah and remind me your name um in practice what usually happens like do people just go ahead and take the Matrix inverse let me the questions in practice you usually find that this these kinds of matrices are invertible and if yes do people just go ahead and The Matrix or
12:30 - 13:00 it's a good question so in practice is it invertible and what do people do in practice normally we're dealing with State spaces that are far too large so we can't do this yeah good question there might be cases where it's small enough but in general no so that's a great motivation for a second approach which is instead of doing it directly analytically we're going to use dynamic programming and we're going to design an iterative algorithm okay and this is going to be very very similar to what we're going to see for decision processes so the idea in this case is we're not going to do this in one step
13:00 - 13:30 but we're going to avoid that Matrix inverse which might be pretty expensive so we're going to initialize the value of a state to zero for all s and you can think about whether or not it actually M matters what we initialize to but just imagine we do that and then for a series of iterations K is our iteration variable for all the states in s what we do is we say we're going to sort of make a new copy of our value function and we say VK of s is equal to R of S Plus gamma sum over S Prime probability of going to S Prime given s times the value
13:30 - 14:00 that we already have for Kus one of S Prime and we just do this over and over and over again until our value function stops conver stops changing and we'll talk soon about whether it will stop changing the nice thing about this is that it's only s s for each iteration so this would be an iteration instead of a matrix inverse
14:00 - 14:30 all right so this is how you could compute the value of an MRP now we're going to see how we could do that for an mdp so markup decision process is very similar to a Markoff reward process but now we get to add in actions so now we're actually going to be starting to make decisions and the idea now is that the Dynamics transition model um will probably depend on the action you take so you're going to get to different distributions of next States plus you and so it could be something like you know you think of um depending on the ad you show a customer they might different things depending on the controls of your
14:30 - 15:00 robot it's going to move or manipulate its hand in a different way General these Dynamics are going to be a function of the action um and we are going to for right now uh talk assume the reward is a function of the state and the action you take so you often say that an mdp is defined by a tuple sa Dynamics model reward model and Gamma okay so we could think of that for here so now we have our same little Mars Rover but now we actually have two different Dynamics models one one for if we take A1 and one if we take A2 this is
15:00 - 15:30 just an example one these in these cases these are deterministic in general we can have them be stochastic and we would also need to specify what the reward is so maybe we have zero reward in all of these states and plus one here and plus 10 at the end and this would just Define so once you've defined the state space the action space the reward function the Dynamics model and Gamma then you've defined your mdp okay all right so now we have get to start to think about policies which is
15:30 - 16:00 what we'll be talking about throughout the course which is how do we make decisions depending on the state we're in and the policy is going to specify the action take which can be deterministic or stochastic and often we're going to think of it as being stochastic and we'll talk about the properties of stochastic versus deterministic ones and why you might want one or the other um quite a bit in the class but we can generally do everything we're doing in each
16:00 - 16:30 all right so an mdp plus a policy is just a Markoff reward process why is that because once you specify how you're going to act you've sort of removed the policy part and so if you want to know how good that policy is so let's say someone says again your boss says hey how good is this thing at you know like advertising to customers for example then once you've decided what the policy is we can think of the reward as just being a weighted sum over what the probability is is taking that action in that state times the reward
16:30 - 17:00 for that state in action and then your Dynamics model which is a little more subtle which is now you're taking a weighted sum over all of the transition Dynamics according to the action you take weighted by the probability you take that action so it just defines um a markof process because now you just have this sort of transform Dynamics model where we've merged in the policy Okay so why is this helpful and this is something
17:00 - 17:30 that you may or may not have seen in previous classes one of the reasons why this is helpful is because now we can just say oh any techniques we have for markof reward processes we could also apply to evaluating the value of a particular policy in a Markoff decision process because we've just reduced an mdp and policy evaluation back to an MRP all right so if we think about doing policy evaluation with an mdp um we can just plug in the actual
17:30 - 18:00 policy that we would be T be using okay so what we have in this case is that instead of now we actually get to make decisions and so then we get to say what is the probability of taking the action in this state times the expected discounted sum of rewards at that point so this looks very similar to an MRP except for we're saying based on the probability for each action what would we get next and we call this a Bellman backup
18:00 - 18:30 for a particular policy because this is um going to specify what is our expected discounted sum of future Awards if we take start in this um State and follow the policy and just notice that if the policy is actually deterministic we can reduce it back to a case where we've sort of averaged over these rewards so remember this was just going to be um if you have a particular action and then you're just going to index into
18:30 - 19:00 what the reward is for that particular action so we can see that here okay and just raise your hand if you've seen this before like if you've seen the okay good so like probably at least two-third of people all right okay so if you want to check your answers if you something this is new for you then um one thing to do is to try to check that you can um do this sort of value iteration um or this policy evaluation for the Mars where for example we won't go through it in class
19:00 - 19:30 but can check the answers I'll release them at the end of the slide just to check that you know how to apply this all right so of course shortly we're going to be interested in not just EV valuing the value of a single policy but finding an optimal policy so one question is how many policies are there and is the optimal policy value unique so we'll just take a second you can go to the polls and enter in your answer
19:30 - 20:00 okay great so it looks like most people um got the vasary appeal got the right answer for the first one which is it's 2 to the 7 in general the number of policies we have is going to be a to the S because for every single state we could choose any of the actions and also most people got the next one right which is great which is the optimal policy um the one with the SI is not always
20:00 - 20:30 unique it can be unique it depends on the problem um but it's not going to be unique whenever more than one action has the same identical value so when you have ties yeah um how do we generally deal with invalid actions because like for example if we're in S1 and we choose left I would imagine to me that's an invalid action not sure we really do with that yeah so the question was um if we have invalid action so in general you can
20:30 - 21:00 have a different state a different action space be possible in every state that's also very common in like recommendation engines that you know you'd only it's only a subset of Articles you might show to some people based on their state um in this particular example we're going to assume that it's always it's not actually go left it's try left and so if you try to go left and there's nothing rest of the world you just fail and you stay in the same place but in general most of the time in the class we're going to assume the action space is the same for all states but in some cases it might be different good question
21:00 - 21:30 right okay so in mdp control we're going to want to not just have the policy like evaluate a particular policy but we're going to compute the optimal policy so we want to take the argmax over the policy space which in general is that a to the s space and there is going to exist a unique optimal value function and the optimal policy inside of a tabular m DP in an infinite Horizon is unique and deterministic so um those are two properties that are
21:30 - 22:00 good to be familiar with so now we're going to think about how do we actually compute this and what its other properties are so one is that it's stationary um what I mean by that here is that in infinite Horizon problem you always have an infinite number of additional time steps and so the optimal thing to do just depends on your state it doesn't depend on the time step we'll think more about what happens when you only have a finite number ler your H is finite and what might happen there but
22:00 - 22:30 for most of today we're just going to focus on the infinite Horizon problem and as I said and most of you guys already knew that we um this in general is not unique okay so one option is policy search and this is where we are going to get into oh yeah and remember me your name is the optimality condition on the initial St is the optimality conditional on the what do you mean by that uh oh yes the optimality yes it'll be
22:30 - 23:00 per state yeah so you the optimal policy will be defined per state the idea is that you can take a different action in every state and you want to know what the optimal thing is to do to maximize your expected discounted summer rewards from every state individually like Point wise good question yeah yeah two interconnected questions um why is there a unique optimal value function and second is um can you remind me again of what was the reason why may not necessarily be unique you
23:00 - 23:30 mentioned a specific case ah so the policy optimal policy is not necessarily unique because there could be more than one action with the same value and the optimal value function is unique for reasons we'll see later in this class like later today we'll prove it um okay so the one of the things and this is going to go back to the conceptual question I put at the beginning of class is we would like to ideally have methods and algorithms that have monotonic Improvement uh cap capabilities and so policy search is
23:30 - 24:00 going to be one of those so what we're going to do here is we're going to try to search to compute the optimal policy there's a to the S deterministic policies in general you could imagine just you know enumerating all of them and evaluating them all but we can often do better than that and when I say better what I mean here is we can reduce the computation needed to try to identify the optimal policy so we shouldn't have to iterate through all a to the S policies so how does policy iterate a work the idea is that we're going to
24:00 - 24:30 alternate between having a candidate decision policy that might be optimal we're going to evaluate it and then we're going to see if we can improve it and then if we can improve it we will and otherwise we're going to Halt so what we do how we do this is we're just going to initialize it randomly which just means we're going to start off and we're going to say for every single state we're going to pick an action and then while our policy is still changing so this is the L1 Norm um it measures if the policy changed for any state just as a refresher what we're going to first do is we're
24:30 - 25:00 going to um evaluate the policy and then we're going to try to improve it so so in order to do it to do that sort of policy Improvement step it's going to be helpful to define the Q function again I know for many of you this is probably a review the Q function of a particular policy is just what is the reward of the immediate state in action plus the discounted sum of future rewards if we were to after that action act according to the policy so it's sort of like saying okay first
25:00 - 25:30 when you're in this state you're going to take this action and then from then on you're going to follow whatever your policy tells you to do so and for any of you who've seen Q learning you you've seen this sort of idea a lot okay so what we're going to try to do in this case why would we want a q function it turns out it's going to make the policy Improvement step really easy so what we're going to first do is we're going to say I'm going to take my particular policy I'm going to compute the Q value for that particular policy Pi I CU we're going to be iterating and then after that we're going to compute a
25:30 - 26:00 new policy Pi I + 1 by just taking the ARX of Q so for our Q function we're just going to say according to this Q function which says what are the is the expected discounted sum of rewards if I start in this state take this action and follow Supply um which of those actions is the best and we can Define that per state yeah is there any relationship between the Q function and the value function because it kind of looks similar yeah yeah so we often call it a q
26:00 - 26:30 function the the state action value function okay all right so this is sort of just what we do now now we're going to have this Q function we're generally going to do this by having this q and then we will do PI I + 1 of s is equal to arax over a of Q of sa per state and then we just repeat this over and over again
26:30 - 27:00 okay so there's um a number of questions you might have about this you might say okay this seems like a vaguely reasonable thing to do but does it have any formal properties are we guaranteed to improve you know what can we say about this so to do that I think it's useful to delve into what the policy Improvement step is actually doing okay so what the policy Improvement uh what when we compute the Q function this is the equation for the Q function so we s of take our old policy pii okay and then we compute the Q function of this and we
27:00 - 27:30 can do this iteratively and now what we want to do in this case is think about what is the performance going to be of the new policy we extract okay all right so what the Q function says is we're going to be able to show that the Q function the the best thing of the Q function is better than the value of the old policy okay so what does this say so the first thing
27:30 - 28:00 is just how we've computed this is just the policy evaluation step and we know that if we have a Q function over s and a for a particular s Max over a of Q Pi of sa has to be at least as good as the Q function for any of this any of the actions okay so we know that this has to be this thing is always greater than equal to Q Pi I of sa a for all
28:00 - 28:30 a and then this is just that equation this is just whatever what exactly is Q pii of sa is the definition almost except for it's particularly for the actions this is first specifically if we were to follow the previous policy so remember this is the equation for Q Pi I of sa think about one of
28:30 - 29:00 those actions that you could have done is exactly what the old policy would have told you to do that is what this equation is if you just take a here and you plug in pi I of s a so that's just exactly what this is and that is just the definition of v Pi I of s okay so what is this saying what this is
29:00 - 29:30 saying is if you were to take your new policy Pi I + 1 so remember Pi I + 1 is defined as the argmax of this Q function is whatever maximizes your new Q function so what this says is if you were to take Pi I + one of s for one action and then follow Pi I forever so that's what the Q function represents then our expected sum of rewards is at least as good as if we'd always followed pii so that's what this equation is
29:30 - 30:00 telling us it's like if I get to make one decision differently and then from then on I follow my old policy the value I can expect is at least as good as the value I could expect if I had always followed the old policy okay does anybody have any questions about that because then the next step is going to build on that okay can you go back to that sure for the policy Improvement yeah
30:00 - 30:30 so the step that we talking about is this one right like the policy improv yeah we're trying to see like when we do the policy Improvement step and we extract out instead of Max here argmax to get out the new policy how does the value of that relate to the value of the thing you could have done before in that state and so this is just trying to say like what really is is Q Pi I of sa it
30:30 - 31:00 is the value you get if you first take a and then you follow Pi I from then onwards okay so it's saying if you were to do that then this new action you've computed this argmax policy is actually better than what you would have gotten before or at least as good but the thing that should seem slightly strange to you is I am not creating this sort of hybrid policy where like I take one new action and then I follow pii forever I'm creating an entirely new policy where I'm not just going to follow Pi I + 1
31:00 - 31:30 for one step I'm going to follow it for all remaining steps okay so this should not yet convince you that doing that is actually any better than my old policy this would say if you take one new action and then follow your old policy it's going to be better than your old policy so that's why we have to do additional work to show that we're actually going to get a monotonic Improvement if we just follow this new policy always okay all right so let's go through that all right so what we're going to prove is we're going to say actually that's true the new policy we
31:30 - 32:00 construct through this policy Improvement step is somewhat remarkably going to be strictly a monotonic Improvement compared to the old policy unless it's identical that means that every step of policy Improvement we're going to get a better and better policy for every state okay so um and the only time we not is if if we've already converged okay so let's go through that okay so this is going to prove to us
32:00 - 32:30 that the value of the old policy is less than or equal to the value of the new policy meaning we're going to get like this monotonic Improvement so what we're going to do in this case is we are going to first write out so this is just the definition this is the definition of Max over a of our Q pii okay all right so let's just write out what this is okay this is going to be equal
32:30 - 33:00 to and it'll be written out more neatly on the next page too okay so what did I do here I noticed that the definition of pi I + 1 is exactly the ARG Max of this expression
33:00 - 33:30 instead of Max so when we did the policy Improvement the way we did the policy Improvement was we took the argmax of the Q function so instead of having this max out here I'm just going to plug in pi I + 1 because that's going to give me something that's exactly equal to the max a for that whole expression okay right and so this is exactly equal to that but what we can show next or what we can do next is that we can just add the same terms and notice that this is the same this is
33:30 - 34:00 less than or equal to Q Pi I of S Prime a prime because the value of pi I um for S Prime so following a particular policy always has to be less than or equal to taking the max over the Q function for that policy why is that true because either either the max is the same as the pi I action or there's a better action
34:00 - 34:30 okay all right so that's the less than or equals and then we can just expand this expression out and this is going to start to get a little bit messy which is why it'll be nice to have it on the next slide too but what will happen here is you can see how the expansion works okay and why is this important this is important because this is going to allow us to think about not if we just take this new action on the first step but on all all future steps okay so what we had here is we had this thing which was Max
34:30 - 35:00 a over Q Pi I of s Prim a prime we're going to expand out what that expression is okay because notice this thing here is exactly equal to this thing which we know is here okay so we're just going to substitute it okay so we can put that in here so this is R of S Prime Pi I + 1 of s prime plus gamma sum over S Prime meaning S Prime
35:00 - 35:30 here I'm just using to speed two time steps away okay why was that useful well what we've just said is that the value of pi of s is less than or equal to taking this new better action for one time step and then following the old policy I've now done that recursively so I've said well now that's also less than or equal to if I take that new action once and
35:30 - 36:00 then I take it again and then I follow the old policy and then you just repeat this okay so like you just keep nesting this and what you can see here is that you have these less than or equal that happen when instead of plugging in the value of the old policy you allow yourself to take a Max over that Q function okay and what happens if you do this all the way out it's will exactly become equal to V of Pi I + 1 of s so dot dot okay so I have that here so what this is
36:00 - 36:30 shown here is that the value you had under the old policy of the state is less than or equal to the value of that State under the new policy so this proves the monotonic Improvement which is super cool so this now says if we do policy evaluation where you just keep Computing the Q function and taking a Max you will always monotonically improve unless you stay the same
36:30 - 37:00 all right so now let's do our Le next check your understanding which is given everything I've just said if the policy doesn't change can it ever change again and is there a maximum number of iterations of policy iteration yeah it yeah um on the previous slide is the dot dot dot supposed to represent like just like ALB manipulation yeah yeah you just keep expanding this all the way yeah good question
37:00 - 37:30 all right let's take a second and do the p for
37:30 - 38:00 yeah what's your
38:00 - 38:30 name um where at what point did we feel that this is actually leading to an improvement like can we just like stay in the same like value level and like cuz the inequality was greater than or equal to so is it possible that you're always equal to where you started yeah it's a great question so um and in
38:30 - 39:00 fact that really it so right that like I've just shown less than or equal to what what we can um well I guess we can discuss this in a second but um it will be a monotonic Improvement unless you're already the optimal policy so if there's any state at which you can improve you will and if you stay the same well actually we'll talk about this now so um because the it's nicely split between the answers for both of these questions so may maybe everybody turn to somebody nearby you and discuss
39:00 - 39:30 whether you think the policy can ever change if it didn't change initially and is there a maximum number of iterations okay because it's pretty evenly split amongst people who voted e
39:30 - 40:00 that's
40:00 - 40:30 I want to make sure to clarify something
40:30 - 41:00 cuz I that came up in a good conversation which is let's assume for
41:00 - 41:30 the moment there are no ties I know I said that in general optimistic policies can have ties and that's true but for the point of view of this question it is easiest to think about if there is only a single unique optimal policy so why don't we do that again none of these are for um assessment they're only for your own learning but um but just in terms of what you're thinking through my intention was to think about the simpler case where there is a single optimal policy and then under that case we the policy can ever change once it once it
41:30 - 42:00 hasn't changed once what I mean by the policy doesn't change meaning when we have had a policy and we do policy Improvement and our new improved policy is the same as the old policy um so under the case that I just said which is that it's deterministic um and that there is a single optimal policy uh raise your hand if you said the policy once it doesn't change it can never change again that's the correct answer okay does somebody want to explain why
42:00 - 42:30 you're all correct yeah what's your remind me your name um it kind of intuitively made sense in the sense of like you're doing the expected value so like you're summing over all or you're over all the actions even if there's like St cacity in the system you're still taking the average value so like didn't change before won change again yeah you are taking those and so definitely along those line so if we look at sort of what was the definition of the policy
42:30 - 43:00 Improvement step let me just go a couple slides back okay so what we said is we computed the Q function and then we extracted a new policy if Pi I + 1 is the same as Pi I is Q of Pi I + 1 equal to Q of I Pi I + 1 or I probably said that wrong to many eyes let me just so the question is if if pi I is equal to Pi I + 1 is Q Pi
43:00 - 43:30 I = to Q Pi I + 1 so if it's the same policy do they have the same Q function yeah is he men Nots okay so if your policy hasn't changed meaning that your old policy is the same as your new policy then Q pii is equal to Q Pi I + 1 which means that when you do this for Q Pi I + 1 and then try to extract a policy it'll be exactly the same so once you're stuck there it'll be you'll stuck forever now if you have
43:30 - 44:00 ties it's more complicated so if you have multiple actions that can achieve the same Q function depends how you break them if you break them deterministically you'll stay in the same place if not you may sort of oscillate between all the policies which are optimal otherwise known as all the policies for which they have the same Q Pi I okay but in the simpler case that I mentioned once you've got to that single policy you won't ever change and what that means is given that we
44:00 - 44:30 also only have a finite number of policies if it's deterministic so assuming if we stick to determinance so I'll just say no if Pi star is unique okay for all s so that means for every state there's a unique optimal action um is there maximum number of iterations for policy iteration if you have deterministic policies there's only a of the S policies as everyone was saying before which is great um and so since the policy Improvement step either improves
44:30 - 45:00 the value of your policy or halts that means you only go through each policy once and at most once right like there might be some you never bother to go through and so that means that policy iteration will Halt and it will take at most a to the S policies if it takes a to the s that means that you evaluated every single policy in generally you won't okay so this is um what shows that we actually they do get this monotonic Improvement this is really nice with every single because you could imagine
45:00 - 45:30 in cases where like there there's oh question yeah sure yet that we're going to do better than random right like there's no we haven't guaranteed that whatever we converge to is better than random oh we we've show we've proven that we're going to get get to the optimal policy and the optimal policy may be just random right because depending on the the environment you might just like there there is no you can't do better than random is that you
45:30 - 46:00 mean in terms of like how you design actions yeah so for example if it is the case that all of your actions have exactly the same reward doesn't matter whether you act randomly or you follow a policy the value you get would be exactly the same as random whether or not you can do better than the random will depend on the domain the hope is in general we can do a lot better Okay so we've shown now that here's an algorithm where as we do a more comp more and more computation we get better and better policies and this is great because you may not actually
46:00 - 46:30 want to go particularly if like the state space is you know very large you may not want to go till where your policy entirely stops changing so if you have like an ending time time requirement you can still guarantee that like hey I'm getting better and better and maybe I stop after 100 iterations or a thousand iterations um and just use that policy okay so this is one which has that nice monotony guarante can you say what that is oh sure yes and what's your name uh yeah so a here is the number of actions
46:30 - 47:00 and S here is the number of states so the decision policy space is for every state you could to pick one of the actions so you multiply all of those okay so so this also shows here about how yeah exactly what I said on the previous one that if your policy doesn't change it'll never change again again me five star is
47:00 - 47:30 unique per okay so that's one way to go one way is that we can do policy iteration and the interesting thing about policy iteration is that at every time point you have an explicit policy and what that policy tells you is how to act forever and when using that policy and when you compute the Q value it says how much rewards will you get if you take this action in this date and then follow that other policy forever so again remember today we're in the infinite Horizon case for you know unless I
47:30 - 48:00 specify otherwise value iteration but but you know along the way a lot of those actions and the decisions we make may not be very good so your early policies might be pretty bad we know we're monotonically improving but the early policies might be bad value iteration is different the idea is that at every time every iteration we're going to maintain the optimal value of starting in a state but as if we only get to make a finite number of decisions so remember in in policy iteration we always have a policy and we
48:00 - 48:30 have the value of acting in it forever it just might not be very good value iteration is say is how what is the optimal thing for me to do if I can just make one decision okay like I can take one step okay I'm going to figure out what the optimal thing is to do for one step now I'm going to imagine I get to take two steps and I'm going to build on what I know I can do for one step and so now build like the optimal thing to do for two steps so the interesting thing with value iteration is you always have an optimal value but for the wrong Horizon so one has a value for the full horiz
48:30 - 49:00 the you know the infinite Horizon it might be a bad policy the other one has the optimal value and thing to do but for the wrong Horizon okay so and the idea in value iteration is you just kind of keep going to longer and longer and longer episodes thinking of getting to do like H+ one steps or h plus two steps and then you build upon your previous Solutions using dynamic programming so let's see how to do that okay so this is where we get into the Bellman equation is is the sort of the seminal work of um Richard belman
49:00 - 49:30 and the idea here is as we've said is that for a particular policy we satisfy the Bellman equation and we can for turn that into an algorithm so in particular there's a thing called the Bellman backup operator and what it says is if you give me a value function which right now we can think of it just being a vector later we'll get into function approximation and we do a Bellman backup essentially it's like saying I knew you know I had a value fun function and I want to think about what should I do if I get to do the best thing that maximizes my immediate reward plus my
49:30 - 50:00 expected future reward given that value function okay so it says I'm going to figure out if I take a Max over all the actions what's the reward at that State in the action plus the discounted sum of rewards using the value function you've given me and what that does is it need yields a new Vector of of values over all your States so this is being done person State okay and this is called the Bellman operator comes up all the time
50:00 - 50:30 okay all right so how do we do value iteration well we're just going to do this recursively so we're just going to Loop until we hit convergence um uh just as a refresher this is the um L Infinity norm and what that means is that it is equal to the max over S V of s right out more
50:30 - 51:00 carefully equal to Max over s v k + 1 of s minus VK of s what that just means is that if you have two vectors you look for every single entry and you find the entry in which those two vectors are most different and that's the L Infinity Norm just as a refresher for some of you who might not have seen it or not seen it recently so what value iteration does is is we're just going to have a loop it's going to look very similar to what we
51:00 - 51:30 saw for the markof reward process we're going to initialize our value function and then for each date we're just going to do this Bellman backup and so it's like we took our previous value function we do our Bellman backup and we get a new value function we do this over and over and over again until our value function stops changing so for policy iteration we kept going until our policy stops changing here we keep going until our value function stops changing and what that that condition means is it says I keep going until the difference
51:30 - 52:00 between my old value of estate and my new value of estate is really small for all the states yeah and REM me your name I have a question how to conduct this value iteration to you just said it works with finite Horizon and why policy iteration works with yeah good question so what you could think of this as so great question the beginning if you don't get to make any decisions the expected discount of sum rewards you get from many states is zero you didn't get
52:00 - 52:30 to make any decisions you never got any reward the first round of this it's like you're saying okay I get I get zero reward if I don't make any decisions this would K would be you know K would be one here for the next round so we'd say like if I get to make one decision then I would take a Max over a my reward times discount Factor time zero so it's now saying what is the best thing I get to do if I get to make one decision okay so what this will be is on
52:30 - 53:00 the on the first round this will just be equal to so if V if V is equal to zero for all s then what we would get when we do this backup we get VK + 1 is just equal to of let me put over s is equal to Max over a r of sa a because this part will be zero if your value is zero so so now this is like saying okay before I get no reward because I make no decisions now what's
53:00 - 53:30 the best thing I should do if I get to make one decision the next round I'll say what if I get to make one decision now and then plus this will get plugged into your value yeah um so so the expression that we're plugging into the max function is that is that the same as like Q of sa like a good question so in general that's going to be um max over a q of because yeah because here we're we're requiring ourselves to be take the max
53:30 - 54:00 over the actions we take great question yeah uh in policy iteration we were also initializing the values of we were in the policy iteration we were just randomly initializing our policy so we're saying like in this state you go left in this state you go right Etc and then we were evaluating the value of that we could when we do that evaluation yes in that part we were setting b equal to zero and then doing this iteratively
54:00 - 54:30 if we had done this in well we can do that we can do this as part of the policy evaluation um for policy iteration but what do you mean we would be able to detect Cy comp successive policies that's right so it's a good point say if inside of the policy iteration one
54:30 - 55:00 instead of just um halting when your uh policy has stopped changing you could also hold your value function and stop changing very cute function yeah you say yeah and what's your [Music] name yeah it's a great question um I don't I'll look that up for next week to my knowledge Al there isn't over that in
55:00 - 55:30 one that would not be instance dependent um impr practice policy search is very very popular I think part of it maybe I think part of it is probably that um it also often has this nice monotonic Improvement value iteration does does not necessarily have the monotonic Improvement requirement here so it is
55:30 - 56:00 always the optimal thing to do for the wrong Horizon whereas the other one says it may not be optimal for ages but it will always be monotonically improving great questions okay let's see what the properties are for Value iteration because these are a really useful great questions and we'll see why this whole thing ends up working so just I want to highlight here you could think of belt policy iteration also as Bellman operations and I think this good to what your question just out too so the bman back operator for a particular policy Pi
56:00 - 56:30 is defined as follows you don't you see you don't see the max anymore you just are committing to doing a particular policy and then policy evaluations amounts to Computing the fix point and I'll Define that more formally in a second um and so to do policy evaluation you just repeatedly apply this operator until V stops changing that was the iterative algorithm we saw before just with different notation all right let's start to talk soon second about fixed point so that and then what we would say here is when
56:30 - 57:00 we do policy another way to do policy Improvement is to explicitly do another backup but take the arc Max instead of the max okay so it's the only difference this is the same as what you're doing for the Q function so this is Q Pi K of sa I'm just showing you different notations but the same thing and also help you will talk about the Bellman backup for a particular policy but normally when people say bman backups they mean for the optimal all right so let's just go back to Value
57:00 - 57:30 iteration because while I've told you how to compute a value function I haven't told you how to get out a policy from it so the standard way to do this would be you would go through this process and then you would do it say one more time and extract the argmax instead of the max to actually get your policy so normally in this case you don't bother to compute a policy along the way you just do value iteration a bunch of times then at some point you extract a policy all right let's see about some properties of this so why this is a good
57:30 - 58:00 thing I've already told we've already seen that policy iteration is guaranteed to converge because there's only a finite number of policies you never repeat a policy and so either you're at the optimal policy already or you keep improving for Value iteration it may not be clear yet that this should converge so I'm first going to just Define a contraction operator so let's let OB be an operator like the Bellman backup so if you can just think of as like an algebraic equation if this is something you know if you haven't seen before which is totally fine um and then this is just going to denote any Norm so
58:00 - 58:30 like the L Infinity Norm or others if when you apply the operator to two different say value functions we can just think of these here as vectors and that distance after you apply the operator is smaller than the distance before then it's a contraction operator so just give it a bit of intuition for this in case contraction operators isn't something you've seen before if you think about having two value functions and then and there is some states on which they really differ in their value what this says is that if
58:30 - 59:00 you then apply an operator to them and we're going to prove that the Bellman operator is one of them that afterwards they get closer together so that Max difference between the states is smaller afterwards yeah is it uh yeah um the is this like a if or just if like can there be contraction operators that don't satisfy this we're going to look at this um as specifically Uh Oh you mean are there so my um I'd have to check I'm not an expert in all of contraction operators so I'll say
59:00 - 59:30 what I will show is that the Bellman operator satisfies this statement and therefore then we can show that we're going to converge to a fixed Point all right so particularly under a couple minor assumptions your discount Factor being less than one or you end up in a terminal state with probability one essentially both of these make sure that your your expected sum of discounted rewards is bounded then the Bellman backup where you do this Max is a contraction which means that your
59:30 - 60:00 distance is going to shrink and Shrink between two value functions so your v k + 1 versus its difference to V K that distance in terms of the max difference in any of the states is going to get smaller and we we'll go through that now okay so this is proving that we end up getting the um this is a contraction operator so the bell bone backup is a contraction operator on V for gamma less than one let me just make this large okay so we're going to use the infinity
60:00 - 60:30 Norm which again is just same for where is the max difference in the values for any two states and what I'm defining here is two different value functions so this could be anything what I'm going to try to show is that after you do the Bellman operator that can be no larger than the max difference before you did the Bellman backup okay all right so what we have here this is the first inequality so this is important what I'm going to say is right now so this is just the definition of the Bellman backup operator what you can
60:30 - 61:00 see here is I have two different Maxes because I'm going to do the max over a for the first value function and a Max over a prime for the other value function what I'm going to say now is if you do that instead this has to be less than or equal to if you pulled the max a out and you required both of them to use the same action
61:00 - 61:30 and why is this true because essentially what we're allowing here is we're allowing us before we could pick different actions for both bman backups and now we can pick one okay so that means that instead of getting to maximize the second thing separately we're just going to try to maximize the difference so that's the first place this less than or equal to is going to come in okay once we do that
61:30 - 62:00 now everything's taking the same action on the same time step so we can um get rid of these because they're identical okay so we can just say this is just exactly equal to Max a I'm going to pull out the discount factor of sum over S Prime probability of S Prime given sa of VK of S Prime VJ of s okay now again what I'm going to do is
62:00 - 62:30 I'm going to bound this and I'm going to say the difference between the two value functions that any state is always less than or equal to their Max difference across all the states okay so this is less than or equal to Max a gamma Su over S Prime probability of S Prime given an sa VK minus VJ okay because the difference between any
62:30 - 63:00 between particular States is always less than the max difference between any of the states so I upper bounded it using this expression okay but now this term Now does not depend on States and so I can take it out of the sum okay this is just some constant it's like seven so this is equal to Max over a gamma
63:00 - 63:30 but this is just a transition model and the probability that we go to Su state has to sum up to one if we sum over all next States because for any state in action you're in you always have to go to Su next state so this is just equal to one so we get that this is just equal to Max over a and now there's no more dependence on a so it just disappears okay so what that said is that the
63:30 - 64:00 distance the max difference between the Bellman backed up value functions we get from starting with two different value functions has to be no larger than the max difference between the value functions before times gamma and if your gamma's less than one that means you're strictly Contracting because it means that that Max difference has to be smaller than it was before it would be like 0.9 * whatever the at most 09 times whatever
64:00 - 64:30 the distance was before okay so that's really really cool because that means now that if we apply value iteration we're repeatedly doing the bman backup we're shrinking the distance so if you think of having a series of value functions so you've got like v0 and V1 and V2 and V3 V4 dot dot dot can think of what this distance is okay and what this is saying saying is that these distances are going to be shrinking over
64:30 - 65:00 time and I've told you before that the value function is unique so that means as you shrink and shrink and shrink and Shrink this is eventually going to become a unique value function because if there were you can think about it too if there were two different value functions you can think about what would happen after you do a BMA backup operator they different okay so this proves it's a contraction yeah and just to note this
65:00 - 65:30 here even if all the inequalities are equalities is still a contraction if gamma is less than one still kind of making progress all right so here's some thoughts in case you want to think about this more to prove the valuator convergence to Unique solution for discret and state action spaces whether initialization matters at all and is the value of the policy extracted from value iteration at each round guaranteed to monotonically improve so these are all great things to think about okay so let's go back to sort of
65:30 - 66:00 more practically this is then value iteration for finite well actually I'll pause here in case anybody has a question about the proof yes under my mirr name can you go back to yeah um I I understand except the first one where we like take the max out of the norm Max action why is that greater than having separate inside yeah good question so um this uh what this is saying here is you
66:00 - 66:30 know we have you could think of this a q function so we get to pick a Max for that Q function and then we subtract off the max over another q function and you could imagine that you could think of there being lots of different pairs of actions in this case and either this Max is the same as this one so it's actually let say in particular concretely that this is like gives you a one that's those one so this one is either A1 or another a if it's another a that's just because
66:30 - 67:00 this is actually larger than what Qs of like so let me just write this out just in case so we can think of there as either being qf A1 under this or QJ of s a where a is not equal to A1 okay so either this Max is exactly the same as this one in which case this is equal or this is different and the only time it would be different
67:00 - 67:30 is if that value was actually larger than the value of Q sa1 and if this was larger this difference would be smaller because you'd be subtracting off a larger value so that's why we can turn this into an inequality it's either the same if they happened to have both picked the same action or it would have picked another action for which that whole difference would have been smaller good question okay yeah can you go back to the like the questions you were posing yep yeah
67:30 - 68:00 um is the value of the third question out there is that is the value of the policy extracted from value iteration aren't we isn't isn't that like implicit within value iteration that it like with each uh like each new value function is better than the previous one and therefore the policy will also be better or that's a question yeah the question whether it is guaranteeing that we have not proven anything about that yet we Prov that for policy
68:00 - 68:30 duration but this is just to think about it in this case Okay so let's go now anybody else have a question on the proof okay all right so one thing I just want to mention briefly and it'll come up on the homework is thinking about this for finite Horizons so most of today almost all of today we've assumed that we get to act forever so we have an infinite Horizon but there will certainly be cases where there's a finite Horizon and in this case this goes back to the sort of thinking of value
68:30 - 69:00 iteration is just Computing the optimal value for each Horizon so if we have K 1 to H H being like the max Horizon you want to compute for then for each date at each round you would have a a value function k+ 1 which tells you sort of how many decisions you make what your horizon is um and we would just do this back up so this looks exactly the same as what we saw before but now you could also get a policy here so this would be the policy associated with that value function of what is the
69:00 - 69:30 argmax action so it compute a series of policies one for each step okay one other thing I want to mention here and there'll be um we'll talk about this on the homework too um one other thing I I want to mention here is that you can also just simulate the value um and then uh of a particular policy so this is also really popular once we start to get into really big state
69:30 - 70:00 spaces so if we think of the fact that in a lot of these algorithms we're doing some sort of policy evaluation one thing you could do is you just take your policy and you know what your Dynamics model is you know what your reward is and you just roll it out so if you're in a state you simulate what the next state might be then you get a reward um so you can just generate a really large number of episodes and then just average them so like I'm just like oh how you know how good is this policy if your boss asking you you just like run it on 100 people you average their rewards and then you're
70:00 - 70:30 done and this is something that becomes really popular when it starts to be say hard to write down what that Dynamics model is explicitly or do that kind of sum over S Prime but it's really easy to sample and I'll note for that that um you could use concentration inequalities like hting inequality um and uh Bernstein for those of you familiar with them to bound how many how much data do you need for this estimate to be close to the true one and the great thing is that it's not that many so like if you have an
70:30 - 71:00 enormous State space like I don't know your Amazon or something like that or you've got patient data and it's incredibly High dimensional um you don't have to do that huge sum over S Prime you can just sample and you're your accuracy generally improves by one over square root n the number of samples you're doing and the nice thing is that this also requires no assumption about the Markoff structure so you might have a partially observable scenario which also comes up a lot in things like healthcare and then you can just roll out your policy and just see how well it works okay well in healthcare you probably wouldn't just randomly roll out
71:00 - 71:30 any policy but um You probably see my mean yeah remember your name that's right so here this is just this is just the policy evaluation stage exactly and you could either do it to compute the value of a policy or as you just just suggested do it for the Q value so you like start off in a state for each of the different actions then you roll out the policy till the end and
71:30 - 72:00 this is just a really popular other technique and it'll come up in other places so I wanted to start saying we can do that here too okay so you can also think about doing all of these in the case of the Mars Rober and and I I won't go through it now but I'll you can use these as sort of examples to step through these different algorithms okay and kind of think about how you would compute these type of policies all right so I will maybe I'll I'll get to the end
72:00 - 72:30 of this but I'll leave you with sort of two things one is um sort of a thought question which is is the optimal policy stationary what that means is independent of time steps and the finite Horizon tasks and we'll we'll explore this issue to on the homework and I also just want to S of refresh some terminology so in the context of markof decision processes and reinforcement learning when we say models what we normally mean is a mathematical model of the Dynamics and reward policy is a function mapping from states to actions that can be
72:30 - 73:00 deterministic or stochastic and the value function is this expected discounted sum of rewards from starting in a state and following a policy the things that should be clear to you is you should be a to understand what a Markoff process is a Markoff reward process is an mdp the Bellman operator contraction model Q value and policy and you should be able to implement both value iteration and policy iteration and you'll have practice doing that on the homework you should also understand some of this strengths and weaknesses in terms of what we've discussed in terms of the computational complexity of some of these different operations and be able
73:00 - 73:30 to prove contraction properties about these as well as sort of understand like which of these are really leveraging the Markoff assumption versus which of them don't require that so next week we'll continue to talk about this and we'll start to talk about uh function approximation and how we can also learn when we don't know what these models are I'll see you then thanks