Exploring Policy Gradient Methods

Stanford AA228/CS238 Decision Making Under Uncertainty I Policy Gradient Estimation & Optimization

Estimated read time: 1:20

    AI is evolving every day. Don't fall behind.

    Join 50,000+ readers learning how to use AI in just 5 minutes daily.

    Completely free, unsubscribe at any time.

    Summary

    In this lecture, Amelia from Stanford Online discusses policy gradient estimation and optimization, focusing on decision-making in uncertain environments. The talk begins with the importance of identifying sequential decision problems in your project proposals. She then delves into the concept of policy gradients, using room temperature control as an example. Amelia explains the necessity of parameterized policies due to large state spaces and explores methods for optimizing these policies using various estimation techniques such as finite differences, regression gradient, and likelihood ratio. These methods are essential for computing policy gradients, critical for improving decision-making strategies in machine learning models.

      Highlights

      • Finite differences method relies on basic calculus for gradient estimation. 📊
      • Regression gradient method uses perturbations and a regression approach for optimization. 🔍
      • Likelihood ratio is a calculus-heavy approach to analytically derive gradients. 🔢
      • Policies are parameterized to handle large or continuous state spaces efficiently. 🗺️
      • Understanding utility and parameter optimization is key in policy gradient methods. 🔑

      Key Takeaways

      • Understanding sequential decision problems is crucial for project planning. 📝
      • Finite differences is a basic calculus method for estimating gradients. 📏
      • Regression gradient involves perturbation and regression fitting. 📈
      • Likelihood ratio uses calculus to find gradients analytically. 🧮
      • Parameterization of policies helps manage large state spaces. 🌡️

      Overview

      In the Stanford Online lecture, Amelia introduces the concept of policy gradient estimation, a crucial aspect of decision-making under uncertainty. She emphasizes the importance of choosing a sequential decision problem for projects, noting that this forms the backbone of policy optimization in such contexts.

        The discussion progresses into the realm of parameterized policies, where Amelia illustrates using temperature control in buildings. By parameterizing these policies with Theta values, one can optimize actions like turning on the heat or air conditioning efficiently, even in extensive state spaces.

          Various techniques for estimating and optimizing policy gradients are explored, including finite differences, regression gradients, and likelihood ratio methods. Each of these presents a unique approach to handling the complexities of policy optimization, focusing on improving the utility and efficiency of decision-making models.

            Stanford AA228/CS238 Decision Making Under Uncertainty I Policy Gradient Estimation & Optimization Transcription

            • 00:00 - 00:30 hi everyone I'm Amelia I'm a master student with Michael and Sizzle um and today I'll be talking about policy gradient estimation but first Friday at 5:00 p.m what is happening propal yes um so for your proposal the most important thing is that the project you want to work on is
            • 00:30 - 01:00 a sequential decision problem um if you're unsure on any of those components please come by office hours or post on Ed we're happy to clarify um oh with uncertainty don't forget that um come by office hours if you're not sure we're happy to help talk through it and also as a note it's totally fine if you don't yet know how you're going to solve that problem but but you need to
            • 01:00 - 01:30 make sure that the problem you're interested in solving has those characteristics all right and with that we will get started with policy gradient estimation so like Josh talked about in the last lecture we have some policy pi and Pi is going to tell us what action to take based on the state that we're in so for this lecture I'm going to be using the example of having a policy to control the temperature in a building
            • 01:30 - 02:00 um and we're going to say that the actions that we could take would be turning on the Heat or turning on the AC and then we'll additionally have the constraint that we're never doing both at the same time just because that pretty clearly seems like that wouldn't be a great policy based on our expert intuition so one thing that that policy might look like is for every possible temperature we have some Associated action so we could say at 30° we're going to turn on the heat and at 31 we also have
            • 02:00 - 02:30 heat and then at some point things get warm enough maybe at like 85 now we're going to turn on the AC and as you can imagine in cases where we have large or even continuous State spaces storing a policy like this wouldn't be possible so that is why we're going to be looking at policies that are parameterized and we're going to call the parameters Theta so to give
            • 02:30 - 03:00 you an example of what that might look like for this case along this axis we're going to put possible temperatures um we'll say 0 to 100 for these purposes the weather here is pretty nice um and then rather than storing a table with 100 different values we could set for example two parameters maybe Theta 1 and Theta 2 and what we'll do our policy based on these parameters will be that if we're we're below Theta 1 we turn on the
            • 03:00 - 03:30 heat and then if we're above Theta 2 we turn on the AC and between them we have neither so once we have this policy that's parameterized the thing that we're generally interested in optimizing which will be Kiana's half of the lecture is the utility of it and we write the utility of the policy parameterized by Theta U
            • 03:30 - 04:00 of theta so in our optimization many of the methods rely on the gradient a view of theta and that's why we're going to be looking at a few different options for estimating that in this lecture so because Theta will generally be multiple variables otherwise we'd have a very simple policy gradient e of theta is going to have this form so for every
            • 04:00 - 04:30 Theta I we're going to have a partial derivative and let's say that we have n of these parameters Theta for the example that I just showed you n would be two and this is going to give us an N by1 Vector um and before I go on I wanted to make a note this portion of the lection there and also probably your forc too
            • 04:30 - 05:00 it's pretty calculus heavy and depending on your background you may have done that more or less recently so I don't want you to feel like calculus related questions are less legitimate than questions about material that maybe you've only seen in this course I definitely had to look up stuff preparing for this and um if the calculus is a sticking point please say something we want you to feel good about every step that we're taking so the first method is something that you probably saw in early calculus
            • 05:00 - 05:30 and that's going to be finite differences so in finite differences we have some function and then some point where we're interested in Computing the derivative and does anyone remember what we do for
            • 05:30 - 06:00 this method no worries so we're going to take another Point some small distance away that we call Delta and we're going to compute the slope between those two so the dotted line is the thing that we're Computing or the dotted line is the thing that we're estimating uh the non- doted line is the thing that we're Computing and we're using that to get a basic approximation just just around that one point of
            • 06:00 - 06:30 Interest so in the multivariable case that's we're considering for the gradient of U Theta this is going to look something like U of theta plus Delta E1 minus U of theta all over Delta and then we're going to do one of
            • 06:30 - 07:00 these for each of our parameters theeta so uh you might be wondering what this e here is this is a standard basis vector and this index indicates the position at which we have a one all other position are going to be zero so
            • 07:00 - 07:30 here we would have a one followed by n minus one Zer and so on and the reason that we're doing that is that we're only applying this difference Delta to one parameter at a time and so this is going to give us an N by1 Vector with a simple estimate of the gradient of U of theta around some point uh does anyone have questions here how do we feel about the first one
            • 07:30 - 08:00 yes what's U great question U is the utility of theta and we get U of theta as Josh mentioned previously by using roll out simulations so this is something that can impact this method is that depending on the variance of U of theta it may take a lot of rollouts to get a good estimate and the book talks about some techniques that you could practically use to mitigate the potential and
            • 08:00 - 08:30 variance between when you're Computing this and then running the policy yes the Thea here are the parameters for the parameterization I guess like for high dimens where this first order approximation always work oh rep question so the question is for the higher dimensional case where we have a lot of thetas would this first order approximation always work
            • 08:30 - 09:00 kind of depends on the on the data that you're working with we will get into more complex methods that are a bit more robust but you can imagine like if um the Theta parameters are on very different scales then you may run into issues applying the same difference to all of them so this uh this would be ideal for a simpler case and some of the other methods would be better for cases like what you're talking about all anything else
            • 09:00 - 09:30 cool okay so the next method that we're going to talk about is called the regression gradient and um this may have come up in other ml classes that you've taken but when we do a regression we have some set of data
            • 09:30 - 10:00 points and then we're finding the line that best fits them so this is that regression and what we're going to talk about is what the line is that we're fitting so we know that the thing that we're interested in is that we want to find the gradient of of
            • 10:00 - 10:30 theta and around some point Theta we're estimating what this line would be so the first thing that we're going to do is we're going to collect data points and then we'll fit our regression line so what should the data points be we're interested in Theta and we want to collect a bunch of points let's say
            • 10:30 - 11:00 this is where Thea is and we want to collect points around it in its neighborhood uh which we can find the gradient at so we're going to create this Matrix that we'll call Delta Theta and we're doing a perturbation for each of our parameters
            • 11:00 - 11:30 and this is going to be M by n where m is the number of pations and N again is the number of parameters there's no right or wrong answer for choosing the number of perturbations to do empirically a rule of thumb is to have double perturbation um double that number of parameters be
            • 11:30 - 12:00 the number of perturbations but that could be based on computational constraints or expert knowledge or E toy things and this is what worked the best and then the way that we get the perturbations is also something there's no one analytical option that's the best but empirically we find that randomly sampling from the normal distribu
            • 12:00 - 12:30 ution works well uh and these are also going to be normalized random samples so you can visualize that as though we have a sphere and we're sampling points along the surface and so those would be what the perturbations are so that's how we get our changes in Theta um yes question just what if you're randomly samp around
            • 12:30 - 13:00 thata how are you getting yeah great question so the question is if you're randomly sampling around the point Theta where does the line come from so once we have our random samples which are the differences from the original Theta we also need to find differences from the original U of theta and then I'll talk about the method that we use for fitting the line between the two but basically we collect both of those data points and then use
            • 13:00 - 13:30 the method that we would use for regression like I think this also came up in the policy or no the approximate value function chapter as well yes but that's a great point that is what we need to do next um any other questions yes you kind sampling on the sphere that you yeah so um let's see I'll draw a picture and hopefully this helps but if not let me you know
            • 13:30 - 14:00 so we have some normal distribution so we're sampling from some normal distribution and we could have any oops mean mu here um and then we're using the samples from that distribution to perturb it but when we normalize then they'll all be um distances of one from the original value so we're just like we could be
            • 14:00 - 14:30 moving a distance of one in different directions for each of the parameters but we're not going to be moving like five and then three so it's just concerning a bit yes would we from the normal distribu uh would we be sampling from the normal distribution for each coordinate we're applying the same perturbation for each coordinate um or the same set of perturbations so we're not doing like we only look at this point for one of them
            • 14:30 - 15:00 we would choose some number of points and apply that to all of them which is what gives uh the m byn in N Dimensions you're sampling from n dimensional mul yes yeah uh yes random points call peration uh are the random points themselves that we would call perturbations so the perturbation would the points are how we're perturbing Theta so the
            • 15:00 - 15:30 perturbation is like okay we've taken a point this is how much we're going to change Theta and then we apply that change and I think once we've applied that change that's what we'd call the perturbation yeah any other questions cool okay so that would be how we get our data set of points for Theta and that's definitely the more complicated part of building this data set and then when it
            • 15:30 - 16:00 comes to Computing Delta U this is going to be more similar to what we've seen before so using roll outs like with Monte caros simulation we're going to find U of theta plus Delta Theta 1 minus U of theta and then we're going to find this difference for each of our nend parameters C theta
            • 16:00 - 16:30 plus Delta Theta um and this will be M by1 and again these are coming from uh policy roll outs so does everyone feel good on where
            • 16:30 - 17:00 Delta Theta and Delta U come from all right so now that we have Delta Theta and Delta U we need to fit the line between them to estimate the gradient so we're going to say gradient U of theta is approximately equal to Delta Theta this upside down t means pseudo inverse Delta U and the reason that we're doing the
            • 17:00 - 17:30 pseudo inverse instead of the regular inverse is that we're not guaranteed that Delta Theta um is invertible because m&n ideally should be different so it's not going to be square and so when we do a pseudo inverse like if we were doing the pseudo inverse of X what this actually is going to work out to is the inverse of X transpose X time x transpose and supposing that X is
            • 17:30 - 18:00 M by n then X transpose will be n by m and we can see that we'll be taking the inverse of something n by n which is square so that's what the pseudo inverse allows us to do and this is the same as what we used for approximate value functions if you want to take a look at that in more detail the um the 229 notes which are available online also go into this in great depth um and I recommend
            • 18:00 - 18:30 that as well any other questions here yes uh why Delta oh good question so in this case it's m by one because we're doing one of these differences for each perturbation so in the previous method we just did one difference per parameter and that was n by one and then
            • 18:30 - 19:00 here we're doing one per perturbation so it's m by one now and the important thing is just that we have an equal number um of perturbations of theta and of U um and in the previous one we just did one for each parameter Theta so that's how we me sure those were the same in that case would you do this for
            • 19:00 - 19:30 uh no you don't have to do this for every single Theta since the perturbations are being sampled from an n-dimensional normal distribution so we're moving all of the parameters in each perturbation and then we're doing that M times yes any other questions here cool okay
            • 19:30 - 20:00 so the last method that we're going to look at is the likelihood ratio and this is definitely the most calculus heavy component of this lecture so I'll go try to go through it slowly line by line and if any line isn't feeling good for you all just let me know and we can pause there
            • 20:00 - 20:30 so as previously the thing that we want to compute is the gradient of U of theta and the way that we're going to do that in this case is that we're going to use an analytical form of U of theta which says U of theta is the integral over all possible trajectories of P Theta of to R of to detail and I'll go through
            • 20:30 - 21:00 what each of the things in this integral is so first toel is going to be some trajectory which is a sequence of states and actions so for example for our temperature setting we could have a towel look something like this maybe 30° and we turn on the heat and then it gets really hot 90° we turn on the AC
            • 21:00 - 21:30 and then um we have apparently a super powerful heater in AC so now it's back down to 40° and we turn the heat on again so that's one example of what a trajectory might be R of tow is going to be the return for that trajectory and we say return rather than reward because we're not just looking at the reward for the immediate state we can use whatever type of formulation we
            • 21:30 - 22:00 want to have a discount over the entire sequence in towel generally we'll be using the bment optimality equation Michael's famous favorite equation in this class but that's up to you if you choose to do that some different way so we have the return for that trajectory and then this term is going to be the likelihood of the trajectory as the ter mined
            • 22:00 - 22:30 by the policy parameterized by Theta so intuitively what does that mean our policy is telling us based on a state what action we should take so let's say that we have some policy where we're heavily prioritizing being super energy efficient and so we're going to say Theta 1 is like 10 and we're only going
            • 22:30 - 23:00 to turn the heat on if it's below 10° and then Theta 2 is 95 and we only have the AC on above 95° so if we had a policy parameterized like that then this trajectory that I just showed you would be example or it would be impossible that would never happen um since according to these parameters we wouldn't turn on the heat and AC so the policy is determining How
            • 23:00 - 23:30 likely a trajectory is um do we feel good on that yes um I thought we were trying to find the policy or do we already do we already know the policy or just like we have some initial policy and then we're optimizing um Theta that's parameterizing that policy by optimizing the utility of those thas so um maybe an example
            • 23:30 - 24:00 that is more familiar is that if we have some neural network with parameters Theta then we're going to be optimizing for the best parameters of our neural network by finding the gradient of the loss okay so it'll be like slowly changing yeah we're trying to make sure that we're improving the utility incrementally and the way that we're doing that is by finding the gradient of the utility and then trying to to move our Theta in a way that's positive for
            • 24:00 - 24:30 the gradient of the utility no problem that was a good question um anything else cool okay so the policy that we have is determining the likelihood and so what we're going to do is we're going to take this analytical expression and we're going to get an expression for the gradient of U of
            • 24:30 - 25:00 theta from it and to start we're just going to take the gradient of both sides so the first step that we'll take is we're just going to move the gradient inside of the integral
            • 25:00 - 25:30 so and then the next thing that we're going to do is they're going to multiply what we have inside of this integral by one
            • 25:30 - 26:00 and the reason that we're doing this is that once we've done that multiplication now we can write these out in a way that allows us to simplify so we're going to write this as P Theta of towel time the gradient of pet of to
            • 26:00 - 26:30 over P Theta of to R of to D to and you'll see that when written in this way the expression that we have is actually taking the form of an expectation so now we can write this as the expectation over tow of G Theta P Theta of
            • 26:30 - 27:00 to over P Theta of to R of to and I'm going to pause here for a second does everyone feel good on why we can convert that to an expectation yeah definitely so um let's see I'll give an example let's say that we have f ofx
            • 27:00 - 27:30 then the [Music] expectation of f ofx is the integral over all possible x p of x f ofx DX and so we can map that form onto here where this would be the probability and then this would be F of to um P of to F of to and then we can
            • 27:30 - 28:00 write that as the expectation over this F of to okay sweet So based on a trick in chapter 10.5 that I will not be deriving today but I will note here in case you are interested this is called the log derivative trick and that allows us to
            • 28:00 - 28:30 write the expectation in this form so now we have an expression for the gradient U of theta in the form of this expectation do we feel good on that
            • 28:30 - 29:00 relationship all right so we're assuming that we have some formulation for the return that we know that's simple to compute so we're not going to worry about that part and then the last thing that we need to do in order to estimate the gradient from this method is we need to find this right here we're interested
            • 29:00 - 29:30 in the gradient with respect to Theta of the log of P Theta of to and there are going to be two different cases for this depending on the type of policy that we have parameters by Theta so we could have either a stochastic policy
            • 29:30 - 30:00 or we could have a deterministic policy so when we have a deterministic policy we have our policy pariz by Theta Pi Theta and every time we give it the same state s it's going to return the
            • 30:00 - 30:30 same action a so for example every time it's 30° we say turn on the heat that's always what's going to happen according to this policy by contrast with the stochastic policy Pi Theta of our say s is going to give us a distribution over
            • 30:30 - 31:00 actions so maybe if s is 30° then in this distribution 90% of the time we're turning on the heat 5% of the time we're turning on the AC 5% of the time we have neither and this allows us to write
            • 31:00 - 31:30 Pi Theta for a specific action like turning on the heat given the state 30 Dees as the probability of taking action a from State s so in the example I just said for the action turning on the heat uh when our state is 30° the probability is 90% how do we feel about uh deterministic
            • 31:30 - 32:00 versus stochastic policies cool okay great so during lecture we're only going to talk about the stochastic case and this may or may not be surprising but it's actually easier to compute and it's kind of interesting to see why um so for stochastic policy we're going to be finding the gradient with respect to Theta log P Theta of to and we're going to do each of these one
            • 32:00 - 32:30 at a time so first we have some to and in this case we'll have a p sequence where we have States actions and Rewards or returns up to some time Horizon we could set that however we want based on compute constraints knowledge about the problem Etc in this case I'm going to say we're only going to account for the last 10 steps
            • 32:30 - 33:00 so P Theta of to is going to be the probability of our initial State and this doesn't come from the policy you'll have some distribution over initial states that could be a uniform distribution again it could be based on some knowledge about the problem but we're sampling from that distribution in independent of the policy the policy doesn't affect where we're
            • 33:00 - 33:30 starting so we have that and then we're multiplying that by the product across our time stamps of the transition probability so How likely is it that we go to the next state given that we're in our current state and we took this chosen action in our trajectory and then we multi mly that by the likelihood of that
            • 33:30 - 34:00 action according to our policy so our policy says for this state How likely is this action and then we say based on our transition probabilities given the state in this action how lightly is the next state in the trajectory and this whole product gives us the likelihood of our trajectory towel according to our policy parameterized by
            • 34:00 - 34:30 Theta does that make sense uh sorry what I say l part one more time yeah definitely okay so Pi Theta of the action given the state is the likelihood that we take that action from the state according to our policy we're multiplying that by the transition probability that we get to the next state given that we were we in our current state and we took the action that our policy told us and
            • 34:30 - 35:00 then we're doing that product for every state in the sequence um and so that times the probability of our first state which is not dependent on the policy gives us the likelihood of the trajectory overall cool all right so we got a question oh yes follow from the chain rule exactly this follows from the chain rule nice uh yes sorry just um underneath the
            • 35:00 - 35:30 the pie your product what is what is that variable something's equal to one oh that's a be K let me write that a little bit more clearly and the super scripts on S's and a uh yes the super scripts on the s's are the time that we're at so it would be um S2 would be the second state in the TR and I think actually Michael please
            • 35:30 - 36:00 correct me if I'm wrong but I think this should be starting from two right the the product here no because we're going to two from one yes all right yes is it s to the k and a to the yeah let me just clean up my hand my b
            • 36:00 - 36:30 yes that's okay cool all right um any other questions here all right so the first part is done we have an expression for p Theta of to and now we're going to take the log of that
            • 36:30 - 37:00 so the log of P Theta of tow and because we're taking the log of a product we can convert this into a summation so we have the log of the probability of our first state and then we have let me let me print instead of cursive
            • 37:00 - 37:30 and then we have one last term in our summation
            • 37:30 - 38:00 all right so we've done the log the product is a summation now and then the last thing that we need to do is we need to take the gradient and this is where having a stochastic policy actually is going to make things easier so when we take the gradient with respect to Theta of both sides we will see that all of the yes should
            • 38:00 - 38:30 there be a l should there be a log yes there should be a log because all right so when we take the gradient of both sides all of the terms that aren't dependent on our policy parameterized by Theta drop out right so we can get rid of that and we can get rid of that and now the expression that we're left
            • 38:30 - 39:00 with is just this summation and this gives us another option for estimating yes
            • 39:00 - 39:30 log uh they dropped out because they didn't have um any dependence on Theta so it's like um if we were doing derivative with respect to X of Y same thing here yeah yeah anything else here any other steps that might not be clear yes this might be nitpicking but
            • 39:30 - 40:00 should it be from K 1 to D minus one or should it be from K 1 to D let's see good question not nitpicking um yeah that's a v to D minus one is it not I don't think so no we can we can take it offline though all right you want to find the uh
            • 40:00 - 40:30 gradient of the log likelihood over all of the state action Pairs and I think you have e of them right but if this term is oh right right oh I see what you mean yeah
            • 40:30 - 41:00 okay cat yes oh why we changed it to be depth minus one so for example in this case where our depth is 10 we only know up to the 10th state in the trajectory that's all we're remembering so if we went to to D then here we would be looking at 11 which is something thing that we don't know cuz it's not in our current trajectory so
            • 41:00 - 41:30 we're stopping at 10 which would be depth minus one plus one here Yesa what do we end up with like not looking at the state like the D State action here when we take the derivative of theta do we not end up looking at the D St otion pair um last terming ter we won't looking
            • 41:30 - 42:00 at given uh yes so the last term that we're left with would only be going to D minus one in this case but Michael should we assume that we know the trajectory Beyond D or yeah I think there's just a little tricky thing for the transition model because it goes to right plus one but you you know um for the second summation
            • 42:00 - 42:30 I think that can still go to D because you know what that is yes extra ter so that at the D level the transition wouldn't matter because it doesn't matter the D plus one state it just we don't consider possibilities we don't care what
            • 42:30 - 43:00 [Music] yeah anything else all right so that was the the last method that I had and you can see in the book which also goes through the derivation for the deterministic policy case um and now I will let Kiana take over to talk about what we're going to do with these estimates oh yes so D is the de what's is K uh K is just the
            • 43:00 - 43:30 counter that's the current stuff that we're looking at yeah so at d uh no so K is stopping at D so we would do K from one to D in this case so first uh first state in action and then the second and so on all the way to D would be included in the summation yes
            • 43:30 - 44:00 an but the only would you iterate over several different trajectories in order to find this um in order to find the expectation all yes in order to find the expectation you need to consider multiple
            • 44:00 - 44:30 POS it all sorry would you please repeat that I'm having a time for hearing uh so like what I'm saying is um for the trajector in practice it's might be hard to iterate we all of them so yeah you would be sampling um and you
            • 44:30 - 45:00 can use different charistics or how you want to sample trajectories some may be of more or less interest like maybe it's more important to have information about trajectories that we think are likely or high risk Etc but you would need to be sampling we would expect that in many cases you wouldn't be able to enumerate all of them yes I TR to understand from a high level what we do was we just try to calculate the gradient of the utility
            • 45:00 - 45:30 function and then there's like three different methods like with finite differences and with this yeah so these are all different methods for estimating the gradient of the utility function so that we can optimize the parameters of our policy Theta with respect to utility so we're assuming we're not able to compete that gradient directly and then these are some possible
            • 45:30 - 46:00 options yeah all right I will like show you more about what we're going to do with this [Applause]