Bellman Equations, Dynamic Programming, Generalized Policy Iteration | Reinforcement Learning Part 2

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In the second part of the series on reinforcement learning, Mutual Information delves into the Bellman equations, dynamic programming, and generalized policy iteration. These are crucial concepts underlying reinforcement learning, providing a framework to improve an agent's behavior by incorporating environmental information. The Bellman equations serve as a bedrock, dynamic programming as a way to find optimal policies with perfect environmental knowledge, and generalized policy iteration as a versatile solution template. Through examples and explanations, the video unravels these concepts, offering insights into computing optimal policies even when assumptions are removed. This video is a stepping stone to more complex topics in the series.

Highlights

Bellman equations are the foundation of reinforcement learning, converting environmental info to better agent behavior 🌍.
Dynamic programming allows finding optimal policies by assuming full knowledge of the environment ✔️.
Generalized policy iteration casts RL algorithms operations as both competitive and cooperative forces to create robust policies 🤝.
The video simplifies complex RL concepts, aiming to make them understandable for all viewers 🎓.
The journey through Bellman equations and policy iteration is necessary to master reinforcement learning fundamentals 📚.

Key Takeaways

The Bellman equations are fundamental for connecting and improving state space in reinforcement learning 🌟.
Dynamic programming assumes perfect knowledge of the environment, helping discover optimal policies 🤖.
Generalized policy iteration is a template used by many RL methods to compute strong policies 🚀.
Understanding these concepts makes reinforcement learning appear simpler 🧩.
The video provides a structured path to grasp reinforcement learning essentials for anyone committed 🚀.

Overview

This video explores the foundational aspects of reinforcement learning, focusing on Bellman equations, dynamic programming, and generalized policy iteration. These topics are crucial for understanding how to effectively train models to make intelligent decisions.

Dynamic programming is discussed as a method to compute optimal policies under the assumption of perfect information about the environment. Although somewhat unrealistic in real-world applications, it serves as a powerful theoretical tool.

Alongside, generalized policy iteration is explored as a flexible approach, utilizing improvements in policy and evaluation metrics to converge towards optimal solutions in reinforcement learning tasks.

Chapters

00:00 - 00:30: Introduction to Bellman Equations and Dynamic Programming The chapter introduces Bellman equations as the foundational theory for reinforcement learning. These equations are crucial as they connect and provide structure to the state space that an agent must navigate intelligently. Bellman equations allow for transforming environmental information into improvements in an agent's behavior, which is vital for learning.
00:30 - 01:00: Generalized Policy Iteration Overview The chapter provides an overview of Generalized Policy Iteration in the context of reinforcement learning (RL). It highlights the vast space of potential policies for any given problem and positions Generalized Policy Iteration as a solution template used by many RL methods. The chapter emphasizes how RL algorithm components both compete and cooperate to produce strong policies. The speaker aims to simplify the subject for better understanding.
01:00 - 01:30: Introduction to Markov Decision Process (MDP) In this introduction to Markov Decision Process (MDP), the discussion is part of a six-part series on Reinforcement Learning fundamentals. The chapter starts with a brief overview, which can be skipped if the previous video has been seen. The MDP is presented as a framework where an agent takes actions that influence the environment, leading to a reward and a subsequent state.
02:00 - 02:30: State Value Functions and Action Value Functions This chapter covers the concepts of State Value Functions and Action Value Functions in the context of reinforcement learning. It discusses how these functions are used by an agent to interact with an environment, repeating the process until a terminal state is reached in an episodic setting. The chapter explains that state, action, and reward values are finite and problem-specific. It emphasizes the role of a probability function to determine the dynamics of the environment by giving the likelihood of the next state and reward based on the current state and action. The ultimate goal discussed is to determine a policy that is dependent on the current state.
03:00 - 03:30: Dynamic Programming and Bellman Equation Dynamic Programming and Bellman Equation
03:30 - 04:30: Example: State Value Calculation The chapter titled 'Example: State Value Calculation' introduces the concept of state value calculation in the context of dynamic programming. It starts by explaining the expense affected value of GT given that the agent follows policy pi and is at a certain state or state-action pair. It reiterates that finding the optimal policy is possible if one can determine the optimum of these functions. The chapter concludes with a transition into dynamic programming, defined here as algorithms used to determine optimal policies when complete knowledge of the environment is available, meaning exact access to the required information is present.
04:30 - 05:30: Optimal Policies and Bellman Optimality Equation This chapter introduces the concept of optimal policies in sequential decision-making, focusing on the Bellman Optimality Equation. It highlights the importance of the Bellman equation in this context, acknowledging that while rooted in algebra, and potentially dull, it is crucial for understanding value functions.
06:00 - 07:00: Bellman Equations Explained This chapter introduces the concept of Bellman Equations within the context of Reinforcement Learning (RL). It begins with a practical example to motivate the discussion. The example involves a decision-making process in a Markov Decision Process (MDP) where an agent, from a given state (s0), can choose between two actions: moving left or right. The focus is on explaining the outcome and probabilities associated with the right action, emphasizing the fundamental principles behind Bellman Equations in RL.
08:00 - 10:00: Policy Evaluation: Iterative Solution This chapter explores the intricacies of policy evaluation utilizing the iterative solution approach. It begins by outlining how an agent, with a nine percent probability, might end up in a state (S2) with a zero reward after an action. This probability is paralleled when choosing a different action, where the agent has a forty percent chance of selecting one option and sixty percent another, under a discount factor of 0.95. The chapter poses a fundamental question: given the state values of potential subsequent states, can one ascertain the state value of the initial state (s0)? The answer lies in the utilization of the Bellman equation, which is introduced as the method for solving this problem.
11:00 - 13:00: Policy Improvement The chapter titled 'Policy Improvement' discusses the process of discovering improvements in decision-making policies. It begins with the definition of expected returns, conditional on a certain initial state. The chapter emphasizes a particular shorthand notation for simplicity. The core idea is to break down decisions by actions, leading to a sum of expected returns for each possible action. The focus is on calculating these expected returns to determine the best actions.
16:00 - 18:00: Generalized Policy Iteration and Policy Iteration The chapter focuses on Generalized Policy Iteration (GPI) and Policy Iteration in reinforcement learning. It highlights the significance of conditional probabilities and action-value functions derived from the law of total probability. The transcript explains the recursive relationship in expected rewards, particularly in formulating GPI. Key points include understanding the expectations concerning action values and recognizing the mathematical underpinnings that guide actions towards optimal policies.
18:00 - 22:00: Value Iteration Example The chapter explores the concept of value iteration focusing on the computation of the discounted return over one step. It introduces a mathematical technique showing how the future states' expected values impact current decisions, even when actual immediate results don't match expectations, emphasizing the importance of expectation in decision-making processes.
22:00 - 23:00: Limitations of Dynamic Programming The chapter titled 'Limitations of Dynamic Programming' delves into how dynamic programming can sometimes be restrictive or complex. It emphasizes understanding the state value function and using the law of total probability to simplify problems. The focus is on calculating expectations, which inherently involves dealing with random variables and understanding probability weighted averages of these variables. The chapter briefly mentions the importance of the go right distribution, hinting that solutions in dynamic programming often require summing over all possible pairs of rewards, which can be computationally intensive.
23:00 - 23:30: Conclusion and Next Steps In the conclusion chapter, the author discusses how each term in the Bellman equation is weighted by probability. They elaborate on calculating the right action expectation using provided values and mention that the same can be done for the left action. With these calculations, one can achieve all necessary numbers to determine an answer, underlining the utility of the Bellman equation in this context.

Bellman Equations, Dynamic Programming, Generalized Policy Iteration | Reinforcement Learning Part 2 Transcription

00:00 - 00:30 it's time to learn about the Bellman equations Bedrock theory for reinforcement learning the Bellman equations provide the connecting web of our state space a space that our agent needs to intelligently Traverse as we'll see these equations allow us to convert information on the environment into improvements in the agent's Behavior obviously that's important and on our path to learning this we'll also cover dynamic programming whereby we assume perfect knowledge of the environment's Dynamics and can therefore discover optimal policies policies that are better than all other policies
00:30 - 01:00 considering the huge space of potential policies for any given problem this is quite powerful Theory we'll also learn about generalized policy iteration which is a kind of solution template used by a huge number of RL methods we'll see that it cast the component operations of RL algorithms as both competing and cooperating forces that together yield strong policies for me personally this angle on the subject makes everything seem a bit simpler and if I do my job you'll soon feel the same way so if you'd like to understand reinforcement
01:00 - 01:30 learning these are must know Topics in fact this is part two of my six part series on RL fundamentals so enough with the selling let's get into it we'll start with a short view which you can skip with the time codes if you've seen the previous video okay we cast quite generally our problem with a finite Markov decision process or mdp we can think of it as a process whereby an agent takes actions that are passed to an environment which then produces a reward and the next state
01:30 - 02:00 these are then passed back to the agent and the process repeats in the episodic case it repeats until some terminal state is reached marking the end of an episode the possible State action and reward values are restricted to problem-specific finite sets the Dynamics of the environment are given with a probability function which provides the probability of the next state and reward given the current state and action our goal is to determine a policy which is a state dependent
02:00 - 02:30 distribution over actions the agent selects actions by sampling from this distribution if we run an agent's policy through the mdp we'll get a trajectory of States actions and rewards our goal is to obtain a high discounted sum of future rewards averaged over many trajectories one such sum is written as GT which means our goal is this now to make progress we consider State value functions and action value functions
02:30 - 03:00 which give us the expense affected value of GT assuming the agent follows the policy pi and is at a given state or state action pair we learned previously we can find the optimal policy if we can find the optimum of these functions okay end of review now we can mention something new dynamic programming in our context DP refers to algorithms used to find Optimal policies when we have complete knowledge of the environment in other words we have exact access to this
03:00 - 03:30 which is an unrealistic assumption in the real world but for now it gives us a lot of strength now to exploit this we'll use an absolutely Universal Concept in the world of sequential decision making the Bellman equation there are actually a few of them we'll start with the one for the value function now before I show you this I should warn you this panel is mostly algebra it's not complicated but it's a little boring now as a YouTube educator I'm supposed to make this fun cool and exciting well algebra is none of those but you got to know some of it if you
03:30 - 04:00 want to know RL okay let's motivate this with an example let's say we have these states which we can label like this now instead of describing the whole mdp I'm going to focus on this state s0 and tell you from this state the agent can pick left or right if the agent selects right the reward and next state probabilities are given by this distribution for example if the agent selects right this cell
04:00 - 04:30 s us there is a nine percent chance the agent will end up at S2 with a reward of zero as you'd expect something similar is true for selecting left next the policy from this state is a 40 chance of choosing left and a sixty percent chance of choosing right and the discount factor is 0.95 now here's the question if I also tell you the state values of the possible next states can you determine the state value of s0 yes and it's the Bellman equation
04:30 - 05:00 that does this to discover it ourselves we start with the definition of what we're trying to calculate remember if the expected return conditional on the center State as zero regarding notation I'll be using this shorthand to omit the random variable when it's obvious okay back to it we break this down by action meaning it'll be a sum of two terms one for the left action and one for the right action each term will be the expected return
05:00 - 05:30 conditional on the current state and the action weighted by the probability of taking each action if you know your probability Theory you'll notice this is just an application of the law of total probability also if you're observant you'll recognize that this term is in fact an action value okay from here let's pick out the go right expectation and recognize something about GT it obeys a simple recursive relationship GT is equal to the reward
05:30 - 06:00 plus the once discounted return one step later with a little algebra that's not too hard to see and so we can write this and then something sneaky this what I've done here is I've replaced GT plus one with what you get when plugging the randomly determined next state into the value function this is sneaky because they're not equal however in expectation they are
06:00 - 06:30 this isn't too hard to see if you recall the definition of the state value function and apply the law of total probability but more importantly because it's true this substitution works now from here it gets easier we just remember that these are random variables and we're calculating an expectation by definition it's just a probability weighted average of values and we can find that information in the go right distribution specifically this will equal a sum over all pairs of reward and
06:30 - 07:00 next state each term will be weighted by their probability the terms themselves are just this expression with the values substituted in which is this and remember at the start we were given these values so we have everything we need to calculate the right action expectation and we can do the exact same thing for the left action that would give us all numbers for this which means we have everything to calculate our answer and this is what the Bellman equation does for us it
07:00 - 07:30 connects State values in fact it connects all state values this means if we can solve for some State values we can solve for others that's huge but it's not the whole story these are connections for any policy let's now talk about optimal policies in particular the Bellman optimality equation for State value functions if you recall the optimal State value function gives the highest possible values across all policies
07:30 - 08:00 the optimal action value function is the same for State action pairs now let's revisit those five states and focus again on the left and right actions okay if I gave you the optimal action value for going left and going right could you tell me the optimal State value of the center State just imagine you're the agent if you select left and continue to follow whatever the optimal policy is you'll get on average a return of 19.1
08:00 - 08:30 if you select write you'll get on average a return of 21.3 so what would you do clearly select write there's more reward there and in fact it's obvious that's what the optimal policy must do otherwise you're leaving expected return on the table that means the value of the center state must be 21.3 it's the max action value and that's the fundamental condition of bellman optimality the agent must always select the highest action value otherwise it's not behaving optimally
08:30 - 09:00 now there is one complication that arises from multiple actions achieving the same highest value it just means that there are multiple optimal policies read this for more details okay at this point it'll help if we take a step back and talk generally about the Bellman equations what we've seen is that for any policy and for all states and for all actions in each state we have the following the value of a state when following the policy Pi is equal to the policy weighted average of the action values which is something we
09:00 - 09:30 saw earlier to be clear if we had 100 States this would provide 100 equations second the action value of a state action pair is the probability weighted average of the reward you'll get in the next step and the once discounted value of the next state hopefully given what we've covered already these equations make sense to you now these aren't the Bellman equations but they can be used to build them to frame this I'll use a grid with the state
09:30 - 10:00 and action value along the rows and along the columns each of these cells will refer to an equation where what appears on the left hand side of the equation is given by the row label and what appears within the right hand side is given by the column label for example an equation with v on the left hand side and Q in the right hand side is given by equation one and naturally equation two goes here all I'm doing is organizing things a bit but
10:00 - 10:30 here's the nice part the grid suggests something more if we substitute two into one we can relate any state value to All State values one step away and this gives us the Bellman equation for the state value function here it is written out explicitly this is exactly what we worked through earlier in the video and if we substitute one into two we can relate any action value to all action values one step away giving us the Bellman
10:30 - 11:00 equation for the action value function again here is what the equation looks like written out explicitly now the most important thing to understand is using the mdp distribution the Bellman equations allow us to calculate State values from one step away State values and to do something similar for State action pairs also for each unknown value of the function there is a Bellman equation so finding the solution can be thought of as solving a system of equations finally there's something else
11:00 - 11:30 that's nice about this view to refer to the Bellman optimality equations which concern optimal policies in particular we only need to make a few modifications there's basically only one tweak to call out the state value must be the maximum over optimal action values like we discussed earlier structurally everything else is exactly the same take a second to absorb this okay now let's step way back and remember our original goal is to find an optimal policy for a given mdp as we'll see
11:30 - 12:00 development equations will be the backbone of our algorithms to do this in particular we'll see that a major subroutine is policy evaluation this refers to Computing the state value function or action value function for a given fixed policy pie for all states or state action pairs under some mdp now there are several ways we could approach this in our case we'll consider a very simple iterative solution which we'll explore with example 4.1 from Sun and Bardo here's the setup we have 16 states
12:00 - 12:30 of which two are terminal in each date the agent can move up down left or right and that will deterministically move it to whichever neighboring state unless that state is Off the Grid in which case nothing happens every move will incur a reward of -1 and there's no discounting so the goal is to reach a terminal State as quickly as possible and here's the question given a policy which simply samples actions with equal probability what is its value function well the
12:30 - 13:00 approach starts by randomly initializing the state values with the exception that the terminal states must have a value of zero which is always true we will refer to each of these values with a capital vs it's Capital because especially in our future algorithms it's a random estimate which approximates the True Value okay now clearly these values aren't the solution and we know that because they don't obey the Bellman equation but what's the simplest thing we could do to fix that how about just reassigning each
13:00 - 13:30 state value according to the Bellman equation like this for example that means this value will be determined by these values in fact one application gives us this from here we continue reassigning each state one pass over all states is called a sweep and we repeat these sweeps until the values stop changing by more than some small amount and due to some nice
13:30 - 14:00 Theory this process always converges to the right answer my intuition is that each assignment Imports some small piece of the known mdp always making the value estimates a little better and that's it for policy evaluation we just turned the Bellman equation into an assignment and repeatedly applied it starting from random values for Action values the process is virtually identical except states are replaced with State action Pairs and we use the Bellman equation for the action value function instead next we'll cover another major subroutine of many RL algorithms policy
14:00 - 14:30 Improvement the question here is given the value function of some policy Pi how can we determine a better policy to explain this let's say for Simplicity the policy is deterministic next let's recall what an optimal policy does the action selected in each state must achieve the max action value so does this Inspire any ideas here's one we Define a new policy which always
14:30 - 15:00 selects a Max action value in every state as a reminder we can form the action values like this using the state values Also regarding terminology we say the new policy is acting greedily with respect to the value function since it's akin to a greedy algorithm which uses many local optimizations to travel towards a global Optimum moving on because of something called the policy Improvement theorem we are guaranteed that this new policy is as least as good as the policy of the given value
15:00 - 15:30 function so this always helps and that's great news as an example let's revisit the value function from earlier remember this values the policy that gives all actions the same probability now applying policy Improvement here means our new policy would in each state select only the highest action value and by eyeballing this you can tell this is an optimal policy though in general it'll take more than one iteration of policy Improvement to get to the optimal
15:30 - 16:00 policy now you may ask what if the policy is already greedy with respect to its value function there will be no improvement to make yes but very interestingly if you meet that condition where the value function values the policy and that policy is greedy with respect to that value function you must in fact be at the optimal policy and that's huge news it means this tweaking only becomes impossible when there is no more helping to do and so this leads us finally to the big picture generalized policy iteration or GPI
16:00 - 16:30 it refers to an entire class of algorithms so we'll start with a specific case called naturally policy iteration which will get us the optimal policy and value function it starts with some arbitrarily initialized policy which we'll call Pi zero we then apply policy evaluation which means we'll determine its value function V Pi zero next because we know about policy Improvement we can create a slightly better policy pi 1. but now our value function no longer applies so we apply evaluation again to
16:30 - 17:00 get a new value function then we apply policy Improvement to this to yield another better policy and this process continues until both the policy and the value function stop changing at which point we must have arrived at the optimal policy and value function let's appreciate this it's rare you come across something so simple that does something so useful today's a lucky day okay now generalized policy iteration refers to something well more General
17:00 - 17:30 the textbook has a nice description of it which I'll just read GPI refers to the general idea of letting policy evaluation and policy Improvement processes interact independent of the granularity and the other details of the two processes almost all reinforcement learning methods are well described as GPI I couldn't say it any better than myself this really captures just how big of a picture GPI really is to get a sense of the generality let's imagine adjusting policy iteration
17:30 - 18:00 remember policy evaluation as we covered involves multiple sweeps over the states until the values stop changing but what if we switch to policy Improvement after only one sweep would we still converge to the optimum yes and this is in fact called value iteration a different algorithm within GPI which we'll see an example of in a minute generally speaking as long as each process does their job at least partially the algorithm will still recover the optimal policy in the limit as a footnote the book has this really beautiful way of picturing how these two processes
18:00 - 18:30 interact I was going to walk through it but it's a bit too redundant with what I've already covered and rather heavy to explain I'm including it here because I just don't have the heart to cut it so feel free to mute me and pause your way through this if you're interested okay now let's do the value iteration example this is the gambler's problem from the book here's the setup The Gambler state is his Capital which is an integer between 1 and 99. his goal is to reach the terminal state of 100 before the
18:30 - 19:00 terminal state of zero dollars to achieve the goal he can repeatedly bet on an inbounds coin which has a probability of heads of 40 percent specifically he can bet an integer up to the amount he has or the difference to 100 dollars if the coin comes up heads he gains as much as he bet otherwise he loses it this can be encoded as an mdp by applying no discounting and saying all rewards are zero except there is a reward of one when transitioning to the state of one hundred dollars this can all be expressed with an
19:00 - 19:30 indicator function like this to visualize the solution we'll start with a plot of our value function where you have the state across the horizontal axis and the value along the vertical axis second we'll have a similar plot except the vertical axis gives us the policies bet amount in each state which in this case is deterministic since this problem will give us quite a bit of ties among highest action values we'll break those ties by always selecting the smallest bet okay we start by initializing the value
19:30 - 20:00 function to zero doesn't have to be that but it works from here we do our first iteration of policy Improvement which gives us these bets this policy says once over 50 bet whatever it takes to get to 100 now we ask what value does this policy imply for each state well if we're doing policy iteration we would continually sweep over the states until the value functions stopped changing but as we mentioned that's not
20:00 - 20:30 required in value iteration we only do one sweep before making the policy greedy with respect to this value function and this process iterates back and forth and back and forth and back and forth until the process stops changing and we have the optimal policy and as you'll notice the optimal policy looks interesting and unusual and that's part of the beauty I'm sure some clever
20:30 - 21:00 analysis of this problem would uncover the strategy but dynamic programming makes that unnecessary with a simple General algorithm you get whatever optimal strategy the problem demands so this is a very powerful tool but its power is limited to get here we required perfect knowledge of the environment and I didn't go into detail about this but the state space is also limited it can't be too big nonetheless the theory is foundational to RL as real world problems Stripped Away these cushy assumptions we can address those holes directly and still make good progress to
21:00 - 21:30 see that you'll need to watch the next video on Monte Carlo methods for RL don't be soft be tough and keep focusing see you there