# An Overview of Reinforcement Learning

## Reinforcement Learning Fundamentals

Reinforcement learning is a subfield of machine learning that is concerned with sequential decision making processes. In general, we say that the goal of reinforcement learning is to develop an agent that interacts with its environment in such a way so as to maximize its received cumulative reward.

The main aspect that makes the problem of reinforcement learning generally much more difficult than the related problem of supervised learning is the fact that the actions that an agent takes at one timestep can have a great amount of influence over the reward that the agent receives at some much later timestep. This time delay between cause and effect means that it is very difficult for an agent to know the long-term consequences of taking any action that it has not already taken before. In order to get around this difficulty, reinforcement learning agents typically incorporate some form of an iterative trial-and-error based approach in order to discover the optimal sequence of actions it must take.

The problem of properly resolving cause and effect is referred to as the credit assignment problem while the problem of designing an optimal trial-and-error approach for maximizing cumulative reward is referred to as the exploration vs. exploitation dilemma.

### Markov Decision Processes

In mathematical terms, reinforcement learning problems are typically formulated in terms of what are called Markov decision processes. A Markov decision process is simply a process where an agent makes sequential decisions within some environment that obeys the Markov property.

Formally, a Markov decision process (MDP) is made up of five distinct parts:

• A set of states $$S$$ that describe the observable states that the environment can be in at any given timestep
• A set of actions $$A$$ that an agent can take at any given timestep
• A transition probability distribution $$P_{a}(s, s')$$ that represents the probability that the agent taking some action $$a$$ while in state $$s$$ will result in the agent being in state $$s'$$ at the next timestep
• A reward signal $$R_{a}(s, s')$$ which yields the immediate reward for the agent in the situation that it takes action $$a$$ while in state $$s$$ and ends up in state $$s'$$ at the next timestep
• A discount factor $$\gamma$$, $$0 \leq \gamma \leq 1$$, that represents the factor by which a reward one timestep in the future ought to be discounted when calculating total future reward

### Reinforcement Learning Agents

Given the MDP formalization, we further formalize the goal of a reinforcement learning agent by saying that we desire to learn a policy $$\pi^{*}(s)$$ that takes a state as input and outputs an optimal action such that the long-term reward that the agent sees is maximized.

In order to learn such a policy, reinforcement learning agents typically maintain representations of one or more of the following:

• Environment model: the agent's representation or approximation of the underlying dynamics of the environment
• State value function $$V_{\pi}(s)$$: the expected total future reward given that the agent is in state $$s$$ and the agent follows policy $$\pi$$
• State-action value function $$Q_{\pi}(s, a)$$: the expected total future reward given that the agent is in state $$s$$, takes action $$a$$ and then follows policy $$\pi$$

One important thing to note about the value functions $$V_{\pi}(s)$$ and $$Q_{\pi}(s, a)$$ is that, by definition, the policy $$\pi$$ that optimizes $$V_{\pi}(s)$$ or $$Q_{\pi}(s, a)$$ over all $$s$$ or $$(s, a)$$ is the optimal policy $$\pi^{*}(s)$$. We can therefore learn the optimal policy $$\pi^{*}$$ implicitly if we can optimize over either $$V_{\pi}$$ or $$Q_{\pi}$$. We would, at every step, simply take the action that has the highest value of $$Q(s, a)$$ (if we optimized $$Q$$) or we would take the action that maximizes the expected value of $$V(s')$$ over all next possible states $$s'$$ (if we optimized $$V$$ and know the transition probabilities). We will return to this idea shortly.

## Classic Reinforcement Learning Algorithms

With some basic definitions now in hand, we are ready to discuss ways in which a reinforcement learning agent might go about actually learning some optimal policy $$\pi^{*}(s)$$.

### Planning via Dynamic Programming

In the case where we have either a model or direct knowledge of the dynamics of the environment, the process of discovering an optimal policy is referred to as planning. One common way to plan in an MDP is to use a mathematical optimization method called dynamic programming.

Dynamic programming involves decomposing a larger problem into smaller subproblems, solving those subproblems and then combining the solutions of the subproblems in order to solve the larger problem. In order for a problem to be amenable to dynamic programming, it must satisfy two conditions:

• Optimal substructure: the optimal solution to the problem can be written in terms of optimal solutions to subproblems
• Overlapping subproblems: the problem can be decomposed into subproblems whose solutions involve reusing the solutions to previously solved subproblems

Solving MDPs via optimization over the value functions $$V_{\pi}$$ and $$Q_{\pi}$$ (which would implicitly yield the optimal policy $$\pi^{*}$$, as discussed above) is amenable to dynamic programming because the value functions themselves give us a mechanism through which to reuse solutions and a recursive decomposition of the problem is provided via the Bellman equations.

#### Bellman Equations

The Bellman equations are a set of equations that express the relationship between the value functions $$V_{\pi}$$ and $$Q_{\pi}$$ for different states and state-action pairs.

The Bellman expectation equations define $$V_{\pi}$$ and $$Q_{\pi}$$ in terms of themselves in a fairly straightforward manner: $$\begin{eqnarray} &V_{\pi} (s) = \sum_{s' \in S} P_{\pi (s)} (s, s') (R_{\pi (s)} (s, s') + \gamma V_{\pi} (s')) \\ &Q_{\pi} (s, a) = \sum_{s' \in S} P_{a} (s, s') (R_{a} (s, s') + \gamma Q_{\pi} (s', \pi (s'))) \end{eqnarray}$$

where we assume that the policy $$\pi (s)$$ is deterministic, for the sake of simplicity.

The Bellman optimality equations define the optimal value functions $$V^*$$ and $$Q^*$$ that correspond to the optimal policy $$\pi^*$$ in terms of themselves. By definition, the optimal policy is the policy that, at every timestep, chooses the action that maximizes the expected future reward. Following this logic, we can use the Bellman expectation equations to get expressions for $$V_{\pi}$$ and $$Q_{\pi}$$ and then maximize those expressions over the set of all possible actions in order to get expressions for $$V^*$$ and $$Q^*$$. This then yields the Bellman optimality equations: $$\begin{eqnarray} &V^{*} (s) = \sup_{a \in A} [\sum_{s' \in S} P_{a} (s, s') (R_{a} (s, s') + \gamma V^{*} (s'))] \\ &Q^{*} (s, a) = \sum_{s' \in S} P_{a} (s, s') (R_{a} (s, s') + \gamma \sup_{a' \in A} Q^{*} (s', a')) \end{eqnarray}$$

While it seems mathematically appealing that the optimal value functions can be expressed in terms of themselves, it is not obvious how this might help us achieve our goal of arriving at these optimal value functions in the first place. In order to close this loop, we now introduce the concept of contraction mapping theorem.

#### Contraction Mapping Theorem

Let us reframe the Bellman optimality equation as an operator $$B^*$$ that acts on a value function $$Q$$ such that: $$[B^{*}Q](s, a) = \sum_{s' \in S} P_{a}(s, s') (R_{a}(s, s') + \gamma \sup_{a' \in A} Q(s', a'))$$

We say that an operator $$T$$ is L-Lipschitz over a set $$S$$ with respect to some norm $$\left \lVert \cdot \right \rVert$$ if, for all $$u, v \in S$$: $$\left \lVert Tu - Tv \right \rVert \leq L \left \lVert u - v \right \rVert$$

In plain English, this means that the operator $$T$$ expands the distance between any $$u$$ and $$v$$ by at least a factor of $$L$$ where the distance is taken with respect to some norm $$\left \lVert \cdot \right \rVert$$.

If $$L < 1$$, then we say that such an L-Lipschitz operator is a contraction mapping because, instead of expanding the distance between any $$u$$ and $$v$$, it actually contracts the distance. Therefore, iterative application of a contraction mapping causes the distance between any two elements of a set to exponentially decay to 0 where the decay factor is $$L$$. It is this reasoning that is at the heart of Banach fixed-point theorem (otherwise known as contraction mapping theorem) which states that the iterative application of any operator that is L-Lipschitz with $$L < 1$$ converges to a single unique fixed point.

Given that we know that $$Q^*$$ is a fixed point of $$B^*$$ (by definition), if we can prove that $$B^*$$ is L-Lipschitz with $$L < 1$$ then it follows that iterative application of $$B^*$$ starting from any initial estimate of $$Q$$ must yield $$Q^*$$. We will now prove that $$B^*$$ is indeed a contraction mapping.

Using the infinity norm as our metric of choice, we have the following via algebraic manipulation: $$\left \lVert B^{*}Q_1 - B^{*}Q_2 \right \rVert_{\infty} = \gamma \sup_{s, a \in S \times A} | \sum_{s' \in S} P_{a}(s, s') (\sup_{a' \in A} Q_{1}(s', a') - \sup_{a' \in A} Q_{2}(s', a')) |$$

We can then get rid of the probabilistic weighting from the transition probabilities in order to upper bound the above expression as so: $$\gamma \sup_{s, a \in S \times A} | \sum_{s' \in S} P_{a}(s, s') (\sup_{a' \in A} Q_{1}(s', a') - \sup_{a' \in A} Q_{2}(s', a')) | \leq \gamma \sup_{s' \in S} | \sup_{a' \in A} Q_{1}(s', a') - \sup_{a' \in A} Q_{2}(s', a') |$$

Given that we have: $$| \sup_a f(a) - \sup_a g(a) | \leq \sup_a | f(a) - g(a) |$$

by noting that it must be true that (assuming $$\max_{a} f(a) \geq \max_{a} g(a)$$, without loss of generality): $$f(a_1) - g(a_2) \leq f(a_1) - g(a_1), a_1 = arg\max_a f(a), a_2 = arg\max_a g(a)$$

we then have: $$\gamma \sup_{s' \in S} | \sup_{a' \in A} Q_{1}(s', a') - \sup_{a' \in A} Q_{2}(s', a') | \leq \gamma \sup_{s', a' \in S \times A} | Q_{1}(s', a') - Q_{2}(s', a') | = \gamma \left \lVert Q_1 - Q_2 \right \rVert_{\infty}$$

This then completes the proof that $$B^*$$ is a contraction mapping: $$\left \lVert B^{*} Q_1 - B^{*} Q_2 \right \rVert_{\infty} \leq \gamma \left \lVert Q_1 - Q_2 \right \rVert_{\infty}$$

#### Value Iteration

The above analysis lends itself very nicely to a very simple iterative planning algorithm called value iteration.

In value iteration, you start with some $$Q_0$$ which could, in principle, have any arbitrary initialization. You then perform the operation $$B^{*} Q_{k} = Q_{k+1}$$ until convergence. By contraction mapping theorem, this is guaranteed to converge to the optimal value function $$Q^*$$ which implicitly defines the optimal policy $$\pi^{*}$$.

### Model-Free Control

In the case where the agent has no knowledge of the dynamics of the environment, the problem of discovering an optimal policy is referred to as control or optimal control. Control algorithms involve the agent directly interacting with the environment in order to build up knowledge of the optimal policy as it goes.

Model-free control algorithms can largely be broken down into two categories:

• Temporal difference learning algorithms which try to learn $$Q^*$$
• Policy gradient algorithms which try to directly learn $$\pi^{*}$$

#### Temporal Difference Learning

In temporal difference (TD) learning, the agent iteratively updates its estimate of the value function using its existing estimate of the value function along with experience the agent collects as it interacts with the environment.

In general, the temporal difference learning update rule when taking an action $$a_t$$ in state $$s_t$$ at timestep $$t$$ takes the form: $$Q(s_t , a_t) \leftarrow Q(s_t , a_t) + \alpha [\tau - Q(s_t , a_t)]$$

where $$\alpha$$ is the learning rate and $$\tau$$ is called the TD target.

Using different expressions for the TD target $$\tau$$ yields different classes of reinforcement learning algorithms.

Using the empirical return: $$G_t = R_t + \gamma R_{t+1} + \gamma^{2} R_{t+2} + ... + \gamma^{T} R_T$$

where $$T$$ is the final timestep yields the class of Monte Carlo control algorithms, otherwise known as TD(1) algorithms.

Using the bootstrapped one-step lookahead: $$R_t + \gamma Q(s_{t+1}, a_{t+1})$$

yields the TD(0) class of algorithms.

We can also interpolate between the TD(0) and TD(1) domains with the TD(λ) algorithms, which use the following TD target: $$\begin{eqnarray} &G_{t}^{(n)} = \sum_{k=0}^{n-1} \gamma^{k} R_{t+k} + \gamma^{n} Q(s_{t+n}, a_{t+n}) \\ &G_{t}^{\lambda} = (1 - \lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_{t}^{(n)} + \lambda^{T-t-1} G_{t} \end{eqnarray}$$

where $$T$$ is the final timestep, $$\lambda$$ is a hyperparameter such that $$0 \leq \lambda \leq 1$$ and $$G_{t}^{(n)}$$ is called the n-step return.

The intuitive significance of the $$\lambda$$ hyperparameter in TD($$\lambda$$) algorithms is the fact that it modulates a bias-variance tradeoff of sorts. We would like for our target to be as close as possible to the value of the true future discounted rewards so that we converge to the proper value. We call this desideratum "low bias". At the same time, we would also like our target to be as stable as possible, so that we can converge quickly. We call this desideratum "low variance".

Using these criteria, we can see that the TD(1) target is an unbiased estimate of the true future discounted rewards. However, since the value of the TD(1) target is dependent on choices made by the agent through multiple timesteps, it is not a very stable target. On the other hand, the TD(0) target only uses the observed immediate reward along with the estimated value at the next timestep. Therefore, it is very stable. However, there is no guarantee that the TD(0) target is very close to the true future discounted rewards at any given timestep.

Generally speaking then, we have that high values of $$\lambda$$ correspond to low bias and high variance whereas low values of $$\lambda$$ correspond to high bias and low variance. This hyperparameter is tuned on a case-by-case basis.

##### Q-Learning

Q-learning is a type of TD learning algorithm that uses the TD target: $$R_t + \gamma \sup_{a' \in A} Q(s_{t+1}, a')$$

The fundamental principle of Q-learning is essentially that of contraction mapping theorem applied to the Bellman optimality operator $$B^*$$. It can be thought of as essentially an online version of value iteration where we have no knowledge of the reward signal $$R_{a} (s, s')$$ or the transition probability distribution $$P_{a} (s, s')$$ except through directly interacting with the environment. What we do then is simply estimate the reward signal $$R_{a} (s, s')$$ with the actual observed reward $$R_t$$ at timestep $$t$$ and estimate the probability distribution $$P_{a} (s, s')$$ as a delta function corresponding to the actual observed transition at timestep $$t$$ (i.e. from $$s_t$$ to $$s_{t+1}$$).

An equivalent formulation of Q-learning is to think about it in terms of the TD error which is the difference between our actual $$Q_t$$ and our TD target. In terms of the TD error, $$\delta$$, we have: $$\begin{eqnarray} &\delta_t = R_t + \gamma \sup_{a \in A} Q_{t} (s_{t+1}, a) - Q_{t} (s_{t}, a_{t}) \\ &Q_{t+1} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) + \alpha \delta_{t} \end{eqnarray}$$

##### Function Approximation

In practice, we don't want to keep track of a distinct value for every single possible state-action pair as this would be computationally intractable for any moderately interesting problem. What we want to do is approximate our value function in some way such that we can keep problems with high cardinality state spaces tractable.

The most common way to do this is to firstly represent the state-action space in terms of some feature vector space $$\Phi$$. Then, we use this feature space representation of the state-action as input into our value function which we estimate as some differentiable function parameterized by some d-dimensional parameter vector $$w$$. Concretely, we have: $$Q(s, a) \approx Q_{w}(\phi_{s, a}), \phi \in \Phi, w \in \mathbb{R}^{d}$$

Using the differentiability of our value function estimate, we can optimize our value function estimate towards our target via gradient descent. Gradient descent works by updating the parameters of our estimated value function in the negative direction of the error (in particular, the mean-squared error is a popular error function to use) between the estimated value function and our target.

Using squared TD error as the error we're optimizing with respect to, we have the following form of function approximation Q-learning: $$\begin{eqnarray} &J_{t}(w) = \delta_{t}^{2} = (R_t + \gamma \sup_{a \in A} \hat{Q}_w (\phi_{s_{t+1}, a}) - \hat{Q}_{w} (\phi_{s_t , a_t}))^{2} \\ &\nabla_{w} J_{t} (w) = - \alpha \delta_t \nabla_{w} \hat{Q}_{w} (\phi_{s_t , a_t}) \\ &w \leftarrow w + \alpha \delta_t \nabla_{w} \hat{Q}_{w} (\phi_{a_t , a_t}) \end{eqnarray}$$

There are certain complications that come along with function approximation, however. Before seeing why that is the case, let us first acknowledge that the objective of using value function approximation is to represent the value of a state or state-action in terms of a function that contains far fewer parameters than there are number of total possible states or state-actions. Because of this constraint, most possible value functions are not representable by our value function approximator. This means that the true optimal value function is likely not in the space of value functions we can represent.

Short of finding the exact true optimal value function, it would be useful for us to be able to find a value function that is close to optimal. Analyzing this requires us to formalize the concept of "closeness" of two value functions. Firstly, we can represent a value function as a vector $$v \in \mathbb{R}^{|S|}$$ or $$\mathbb{R}^{|S \times A|}$$ which contains the values of every state or state-action. Then, we can use this vector representation of value functions to define a norm which we will call the μ-norm: $$\left \lVert v \right \rVert_{\mu}^{2} = \sum_{s \in S} \mu (s) v(s)^{2}$$

where $$\mu (s)$$ is a relative importance weighting on the states.

We can then define a projection operator $$\Pi$$ that maps an arbitrary value function to the closest value function that is representable by our approximator: $$\Pi v = v_w, w = arg\min_{w \in \mathbb{R}^d} \left \lVert v - v_w \right \rVert_{\mu}^{2}$$

The convergence of Q-learning is fundamentally predicated on the fact that the Bellman optimality operator $$B^*$$ is a contraction mapping in the infinity-norm. However, in the case of function approximation, we're no longer just applying $$B^*$$ at every timestep. We are now applying the composition of the projection operator and the Bellman optimality operator, $$\Pi B^*$$ at every timestep. While $$B^*$$ is a contraction mapping in the infinity norm and $$\Pi$$ is a contraction mapping in the $$\mu$$-norm, the composition of two contraction mappings in two different norms is not necessarily a contraction mapping in any norm. Therefore, all theoretical guarantees about convergence no longer apply in the case of function approximation.

Instead of learning the optimal policy implicitly by learning the optimal value function, we can instead learn the optimal policy directly via policy gradient methods. Let us assume that we have some policy $$\pi_{\theta}$$ that is parameterized by the parameter vector $$\theta \in \mathbb{R}^d$$. We can then directly compute the gradient of the policy like so: $$\nabla_{\theta} \pi_{\theta} (s, a) = \pi_{\theta} (s, a) \frac{\nabla_{\theta} \pi_{\theta} (s, a)}{\pi_{\theta} (s, a)} = \pi_{\theta} (s, a) \nabla_{\theta} \log \pi_{\theta} (s, a)$$

Using this, we can then get the gradient of the expected reward from following the policy $$\pi_{\theta}$$: $$\begin{eqnarray} &J(\theta) = \mathbb{E}_{\pi_{\theta}} [R] = \sum_{s \in S} \sum_{a \in A} \pi_{\theta} (s, a) R \\ &\nabla_{\theta} J(\theta) = \sum_{s \in S} \sum_{a \in A} \pi_{\theta} (s, a) \nabla_{\theta} \log (\pi_{\theta} (s, a)) R = \mathbb{E}_{\pi_{\theta}} [\nabla_{\theta} \log (\pi_{\theta} (s, a)) R] \end{eqnarray}$$

We can then interact with the environment using the policy $$\pi_{\theta}$$ and use samples generated from our experience in order to perform gradient ascent on an objective whose gradient is, in expectation, equal to the gradient of the total future reward.

Estimating the value of the total future reward $$R$$ in different ways yields different policy gradient algorithms.

• Using full Monte Carlo returns $$G_t$$ yields the REINFORCE algorithm
• Using TD learning to learn and then use a value function estimator of the Q-function yields the Actor-Critic algorithm
• Using TD learning to learn and then use an estimator of the advantage function (i.e. the Q-function minus the V-function) yields the Advantage Actor-Critic algorithm

## Deep Reinforcement Learning Algorithms

With the advent of deep learning has come a revolution in our abilities to train neural network models to solve high-dimensional supervised learning tasks. Deep reinforcement learning algorithms are a class of algorithms that use techniques from deep learning that help us solve high-dimensional supervised learning tasks and adapt them to solve high-dimensional reinforcement learning problems that were previously considered intractable using traditional algorithms.

### Deep Q-Network

The first deep reinforcement learning algorithm to achieve noteworthy results was the Deep Q-Network (DQN) algorithm. The DQN algorithm is a Q-learning algorithm that is adapted to use neural network function approximators.

There are two main differences between the DQN algorithm and the standard Q-learning algorithm that allow DQN to converge to a good value function despite the use of a non-linear function approximator (which typically causes instability in classic reinforcement learning algorithms):

• The use of an experience replay buffer to sample previously experienced environment transitions from
• The use of a separate target network for computing the TD target

The experience replay buffer stores the last $$N$$ observed $$(s, a, s', r)$$ tuples where $$s$$ is some state, $$a$$ is some action taken in that state, $$s'$$ is the next observed state and $$r$$ is the observed immediate reward. When updating our Q-function approximator $$Q_w$$, we sample transitions from the experience replay buffer and compute the TD target in terms of an fixed value function approximator $$Q_{w^{-}}$$ where $$w^-$$ is an old version of the parameters $$w$$. Concretely, we perform the update: $$\begin{eqnarray} &J(w) = \delta^2 = (r + \gamma \sup_{a' \in A} \hat{Q}_{w^-} (\phi_{s', a'}) - \hat{Q}_{w} (\phi_{s, a}))^2 \\ &\nabla_{w} J(w) = - \alpha \delta \nabla_{w} \hat{Q}_{w} (\phi_{s,a}) \\ &w \leftarrow w + \alpha \delta \nabla_{w} \hat{Q}_{w} (\phi_{s,a}) \end{eqnarray}$$

where $$(s, a, s', r)$$ is sampled from the experience replay buffer.

After every $$T$$ updates, the fixed parameter vector $$w^-$$ is set equal to $$w$$.

#### Deep Q-Network Variants

When the DQN paper was first released, it garnered a lot of attention within the deep learning and reinforcement learning communities. As such, it has become extremely well-studied and several improvements to the original algorithm have been proposed.

##### Prioritized Replay

One major improvement to the DQN algorithm comes from modifying the way that transitions in the experience replay buffer are sampled. If we can prioritize transitions that would yield a large gradient (i.e. transitions where our estimator is far from convergence), then our Q-function estimator should converge quicker.

Knowing a priori which transitions will yield large gradients is impossible. However, the magnitude of the TD error from the last time a transition was sampled can be used as a proxy. It has been found that using prioritization schemes based on this approach does indeed speed up convergence in practice.

##### Multi-Step Learning

In the original DQN algorithm, the TD target was constructed in terms of a one-step lookahead. Instead of limiting ourselves to using a one-step lookahead, what we can do is to use $$n$$ steps with a tunable hyperparameter $$n$$ and formulate the TD target as so:

$$\begin{eqnarray} &r_{t}^{(n)} = \sum_{k=0}^{n-1} \gamma^{k} r_{t+k} \\ &J(w) = (r_{t}^{(n)} + \gamma^{n} \sup_{a' \in A} \hat{Q}_{w^-} (s_{t+n}, a') - \hat{Q}_{w} (s_t , a_t))^2 \end{eqnarray}$$

It has been found that properly tuning the hyperparameter $$n$$ can yield much quicker convergence in practice.

##### Double DQN

One known limitation of Q-learning is the fact that it systematically overestimates all Q-values. To see why this is, let us think of our Q-value estimates as random variables.

With this formulation, we can see that in our TD target we use the maximum of Q-value estimates over the next state in order to estimate the maximum value of the next state. Concretely, we use $$E[\max (Q_1, Q_2, ...)]$$ in order to estimate $$\max (E[Q_1], E[Q_2], ...)$$. However, from Jensen's inequality, we know that $$E[\max (Q_1, Q_2, ...)] \geq \max (E[Q_1], E[Q_2], ...)$$. Given that we perform this estimation at every optimization step of the algorithm, it follows that every Q-value will be systematically overestimated.

However, it was shown in the original double Q-learning paper that if we estimate $$\max (E[Q_1], E[Q_2], ...)$$ by decoupling the selection of the maximum random variable from the evaluation of its expectation, then this estimator can be shown to be unbiased in certain circumstances. We can then apply this result to the DQN algorithm by using the following TD target: $$r_t + \gamma \hat{Q}_{w^-} (s_{t+1}, arg\max_{a \in A} \hat{Q}_{w} (s_{t+1}, a))$$

It has been shown that using this TD target leads to more accurate Q-value estimates and quicker convergence.

##### Dueling Networks

When choosing the best action to take at a given state, we have been relying on taking the action with the maximum Q-value. However, the scale of the differences between the Q-values for different actions can be much smaller than the overall scale of the Q-values. What we really care about when it comes to picking the best possible action is precisely estimating the advantage values for each action.

In order to help with estimating the advantage values, Wang et al. developed the dueling networks architecture. The idea of the dueling network architecture is that the architecture has a single encoder that then splits into two different streams: one stream for estimating the state-value function $$V$$ and one for estimating the advantage function $$A$$. These two streams are then combined in order to recover an estimate for $$Q$$ as so: $$\hat{Q}_{\alpha , \beta} (s, a) = \hat{V}_{\beta} (s) + (\hat{A}_{\alpha} (s, a) + \frac{1}{|A|} \sum_{a' \in A} \hat{A}_{\alpha} (s, a'))$$

As such, the dueling network architecture can be used as a drop-in replacement for any sort of algorithm that involves estimating $$Q$$, with the added benefit that it will also implicitly estimate $$V$$ and $$A$$ as well. It has been shown that this architectural change leads to better performance for the DQN algorithm.

#### Proximal Policy Optimization

The Proximal Policy Optimization (PPO) algorithm is an advantage actor-critic algorithm that exhibits near-montonic improvement with every optimization step. In order to motivate the PPO algorithm, let us first briefly discuss the limitations of vanilla advantage actor-critic algorithms.

In vanilla advantage actor-critic algorithms, we optimize the following objective: $$J(\theta) = \mathbb{E}_{\pi_{\theta}} [\log (\pi_{\theta} (s, a)) \hat{A}_{w} (s, a)]$$

While optimizing this objective is appealing for its simplicity and does sometimes work, it will reasonably often be the case that this will not work because the policy updates end up being too large. In the case where policy updates are too large, performance can actually decrease. In fact, the authors of the PPO paper observed a situation where a policy trained via vanilla advantage actor-critic achieved a much worse score than a random policy in the HalfCheetah environment.

In order to ensure that we don't make too large of a gradient step, we can define the objective thusly: $$\begin{eqnarray} &r(\theta) = \frac{\pi_{\theta}(s, a)}{\pi_{\theta_{old}} (s, a)} \\ &J(\theta) = \mathbb{E}_{\pi_{\theta}} [\min (r(\theta) \hat{A}_{w} (s, a), clip(r(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{w} (s, a))] \end{eqnarray}$$

where $$\theta_{old}$$ is a fixed version of $$\theta$$.

The clipped term removes any incentive for gradient ascent to change the probability ratio by more than $$\epsilon$$ and the min term keeps the objective as small as possible. Performing gradient ascent on this new modified objective ensures near-monotonic improvement on policy updates and has been shown to boost performance across a myriad of tasks.

One major limitation of Q-learning is the fact that it is not well-suited to problems with continuous action spaces. This is because simply taking the max over actions of the function $$Q_{\phi} (s, a)$$ would require solving an optimization problem. The Deep Deterministic Policy Gradient (DDPG) algorithm works around this by estimating the argmax action by learning a deterministic policy.

In DDPG, we train a value function estimator $$Q_{\phi}$$ (which we call the critic) as well as a deterministic policy $$\mu_{\theta}$$ (which we call the actor). To do this, we use an experience replay as well as a separate target network for both the actor and critic in a way that is similar to DQN.

In a way that is also similar to DQN, we train the critic network $$Q_{\phi}$$ via gradient descent on the objective: $$J(\phi) = (r + \gamma \hat{Q}_{\phi^-} (s', \mu_{\theta^-} (s')) - \hat{Q}_{\phi} (s, a))^2$$

where $$(s, a, s', r)$$ is sampled from the experience replay buffer and $$\phi^-$$ and $$\theta^-$$ represent the target network parameter vectors.

Using the rationale that the best action to take at any state is the action that maximizes the Q-value, we train the actor network $$\mu_{\theta}$$ via gradient ascent on the objective:

$$J(\theta) = Q_{\phi} (s, \mu_{\theta} (s))$$

Finally, instead of the target network parameters being fixed over a period of $$T$$ timesteps as is the case in DQN, the target network parameter vectors are computed as a running average of the actual parameter vectors as so:

$$\begin{eqnarray} &\phi^{-} \leftarrow \rho \phi^{-} + (1 - \rho) \phi \\ &\theta^{-} \leftarrow \rho \theta^{-} + (1 - \rho) \theta \end{eqnarray}$$

### Implementation of Deep Reinforcement Learning Algorithms

As a companion to this post, see these implementations of the DQN, DDPG and PPO algorithms discussed above.