Website snapshot

This commit is contained in:
Brandon Rozek 2020-01-15 21:51:49 -05:00
parent ee0ab66d73
commit 50ec3688a5
281 changed files with 21066 additions and 0 deletions

View file

@ -0,0 +1,144 @@
# Chapter 2: Multi-armed Bandits
Reinforcement learning *evaluates* the actions taken rather than accepting $instructions$ of the correct actions. This creates the need for active exploration.
This chapter of the book goes over a simplified version of the reinforcement learning problem, that does not involve learning to act in more than one situation. This is called a *nonassociative* setting.
In summation, the type of problem we are about to study is a nonassociative, evaluative feedback problem that is a simplified version of the $k$-armed bandit problem.
## $K$-armed bandit problem
Consider the following learning problem. You are faced repeatedly with a choice among $k$ different options/actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected.
Your objective (if you choose to accept it) is to maximize the expected total reward over some time period. Let's say $1000$ time steps.
### Analogy
This is called the $k$-armed bandit problem because it's an analogy of a slot machine. Slot machines are nick-named the "one-armed bandit", and the goal here is to play the slot machine that has the greatest value return.
### Sub-goal
We want to figure out which slot machine produces the greatest value. Therefore, we want to be able to estimate the value of a slot machine as close to the actual value as possible.
### Exploration vs Exploitation
If we maintain estimates of the action values, then at any time step there is at least one action whose estimated value is the greatest. We call these *greedy* actions. When you select one of these actions we say that you are *exploiting* your current knowledge of the values of the actions.
If you instead select a non-greedy action, then you are *exploring*, because this enables you to better improve your estimate of the non-greedy action's value.
Uncertainty is such that at least one of the other actions probably better than the greedy action, you just don't know which one yet.
## Action-value Methods
In this section, we will look at simple balancing methods in how to gain the greatest reward through exploration and exploitation.
We begin by looking more closely at some simple methods for estimating the values of actions and for using said estimates to make action selection decisions.
### Sample-average method
One natural way to estimate this is by averaging the rewards actually received
$$
Q_t(a) = \frac{\sum_{i = 1}^{t - 1}R_i * \mathbb{I}_{A_i = 1}}{\sum_{i = 1}^{t - 1}\mathbb{I}_{A_i = 1}}
$$
where $\mathbb{I}_{predicate}$ denotes the random variable that is 1 if the predicate is true and 0 if it is not. If the denominator is zero (we have not experienced the reward), then we assume some default value such as zero.
### Greedy action selection
This is where you choose greedily all the time.
$$
A_t = argmax_a(Q_t(a))
$$
### $\epsilon$-greedy action selection
This is where we choose greedily most of the time, except for a small probability $\epsilon$. Where instead of selecting greedily, we select randomly from among all the actions with equal probability.
### Comparison of greedy and $\epsilon$-greedy methods
The advantage of $\epsilon$-greedy over greedy methods depends on the task. With noisier rewards it takes more exploration to find the optimal action, and $\epsilon$-greedy methods should fare better relative to the greedy method. However, if the reward variances were zero, then the greedy method would know the true value of each action after trying it once.
Suppose the bandit task were non-stationary, that is, the true values of actions changed over time. In this case exploration is needed to make sure one of the non-greedy actions has not changed to become better than the greedy one.
### Incremental Implementation
There is a way to update averages using small constant computations rather than storing the the numerators and denominators separate.
Note the derivation for the update formula
$$
\begin{align}
Q_{n + 1} &= \frac{1}{n}\sum_{i = 1}^n{R_i} \\
&= \frac{1}{n}(R_n + \sum_{i = 1}^{n - 1}{R_i}) \\
&= \frac{1}{n}(R_n + (n - 1)\frac{1}{n-1}\sum_{i = 1}^{n - 1}{R_i}) \\
&= \frac{1}{n}{R_n + (n - 1)Q_n} \\
&= \frac{1}{n}(R_n + nQ_n - Q_n) \\
&= Q_n + \frac{1}{n}(R_n - Q_n) \tag{2.3}
\end{align}
$$
With formula 2.3, the implementation requires memory of only $Q_n$ and $n$.
This update rule is a form that occurs frequently throughout the book. The general form is
$$
NewEstimate = OldEstimate + StepSize(Target - OldEstimate)
$$
### Tracking a Nonstationary Problem
As noted earlier, we often encounter problems that are nonstationary, in such cases it makes sense to give more weight to recent rewards than to long-past rewards. One of the most popular ways to do this is to use a constant value for the $StepSize$ parameter. We modify formula 2.3 to be
$$
\begin{align}
Q_{n + 1} &= Q_n + \alpha(R_n - Q_n) \\
&= \alpha R_n + Q_n - \alpha Q_n \\
&= \alpha R_n + (1 - \alpha)Q_n \\
&= \alpha R_n + (1 - \alpha)(\alpha R_{n - 1} + (1-\alpha)Q_{n - 1}) \\
&= \alpha R_n + (1 - \alpha)(\alpha R_{n - 1} + (1-\alpha)(\alpha R_{n - 2} + (1 - a)Q_{n - 2})) \\
&= \alpha R_n + (1-\alpha)\alpha R_{n - 1} + (1-\alpha)^2\alpha R_{n - 2} + \dots + (1-\alpha)^nQ_1 \\
&= (1-\alpha)^nQ_1 + \sum_{i = 1}^n{\alpha(1-\alpha)^{n - i}R_i}
\end{align}
$$
This is a weighted average since the summation of all the weights equal one. Note here that the farther away a value is from the current time, the more times $(1-\alpha)$ gets multiplied by itself. Hence making it less influential. This is sometimes called an *exponential recency-weighted average*.
### Manipulating $\alpha_n(a)$
Sometimes it is convenient to vary the step-size parameter from step to step. We can denote $\alpha_n(a)$ to be a function that determines the step-size parameter after the $n$th selection of action $a$. As noted before $\alpha_n(a) = \frac{1}{n}$ results in the sample average method which is guaranteed to converge to the truth action values assuming a large number of trials.
A well known result in stochastic approximation theory gives us the following conditions to assure convergence with probability 1:
$$
\sum_{n = 1}^\infty{\alpha_n(a) = \infty} \and \sum_{n = 1}^{\infty}{\alpha_n^2(a) \lt \infty}
$$
The first condition is required to guarantee that the steps are large enough to overcome any initial conditions or random fluctuations. The second condition guarantees that eventually the steps become small enough to assure convergence.
**Note:** Both convergence conditions are met for the sample-average case but not for the constant step-size parameter. The latter condition is violated in the constant parameter case. This is desirable since if the rewards are changing then we don't want it to converge to any one parameter.
### Optimistic Initial Values
The methods discussed so far are biased by their initial estimates. Another downside is that these values are another set of parameters that must be chosen by the user. Though these initial values can be used as a simple way to encourage exploration.
Let's say you set an initial estimate that is wildly optimistic. Whichever actions are initially selected, the reward is less than the starting estimates. Therefore, the learner switches to other actions, being *disappointed* with the rewards it was receiving.
The result of this is that all actions are tried several times before their values converge. It even does this if the algorithm is set to choose greedily most of the time!
![1536284892566](/home/rozek/Pictures/1536284892566.png)
This simple trick is quite effective for stationary problems. Not so much for nonstationary problems since the drive for exploration only happens at the beginning. If the task changes, creating a renewed need for exploration, this method would not catch it.
### Upper-Confidence-Bound Action Selection
Exploration is needed because there is always uncertainty about the accuracy of the action-value estimates. The greedy actions are those that look best at the present but some other options may actually be better. Let's choose options that have potential for being optimal, taking into account how close their estimates are to being maximal and the uncertainties in those estimates.
$$
A_t = argmax_a{(Q_t(a) + c\sqrt{\frac{ln(t)}{N_t(a)}})}
$$
where $N_t(a)$ denotes the number of times that $a$ has been selected prior to time $t$ and $c > 0$ controls the degree of exploration.
###Associative Search (Contextual Bandits)
So far, we've only considered nonassociative tasks, where there is no need to associate different actions with different situations. However, in a general reinforcement learning task there is more than one situation and the goal is to learn a policy: a mapping from situations to the actions that are best in those situations.
For sake of continuing our example, let us suppose that there are several different $k$-armed bandit tasks, and that on each step you confront one of these chosen at random. To you, this would appear as a single, nonstationry $k$-armed bandit task whose true action values change randomly from step to step. You could try using one of the previous methods, but unless the true action values change slowly, these methods will not work very well.
Now suppose, that when a bandit task is selected for you, you are given some clue about its identity. Now you can learn a policy association each task, singled by the clue, with the best action to take when facing that task.
This is an example of an *associative search* task, so called because it involves both trial-and-error learning to *search* for the best actions, and *association* of these actions with situations in which they are best. Nowadays they're called *contextual bandits* in literature.
If actions are allowed to affect the next situation as well as the reward, then we have the full reinforcement learning problem. This will be presented in the next chapter of the book with its ramifications appearing throughout the rest of the book.
![1536321791927](/home/rozek/Pictures/1536321791927.png)

View file

@ -0,0 +1,130 @@
# Chapter 4: Dynamic Programming
Dynamic programming refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).
Classic DP algorithms are of limited utility due to their assumption of a perfect model and their great computational expense.
Let's assume that the environment is a finite MDP. We assume its state, action, and reward sets, $\mathcal{S}, \mathcal{A}, \mathcal{R}$ are finite, and that its dynamics are given by a set of probabilities $p(s^\prime, r | s , a)$.
The key idea of dynamic programming, and of reinforcement learning is the use of value functions to organize and structure the search for good policies. In this chapter, we show how dynamic programming can be used to compute the value functions defined in Chapter 3. We can easily obtain optimal policies once we have found the optimal value functions which satisfy the Bellman optimality equations.
## Policy Evaluation
First we consider how to compute the state-value function for an arbitrary policy. The existence and uniqueness of a state-value function for an arbitrary policy are guaranteed as long as either the discount rate is less than one or eventual termination is guaranteed from all states under the given policy.
Consider a sequence of approximate value functions. The initial approximation, $v_0$, is chosen arbitrarily (except that the terminal state must be given a value of zero) and each successive approximation is obtained by using the Bellman equation for $v_\pi$ as an update rule:
$$
v_{k + 1} = \sum_{a}{\pi(a |s)\sum_{s^\prime, r}{p(s^\prime,r|s,a)[r + \gamma v_k(s^\prime)]}}
$$
This algorithm is called *iterative policy evaluation*.
To produce each successive approximation, $v_{k + 1}$ from $v_k$, iterative policy evaluation applies the same operation to each state $s$: it replaces the old value of $s$ with a new value obtained from the old values of the successor states of $s$, and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated.
<u>**Iterative Policy Evaluation**</u>
```
Input π, the policy to be evaluated
Initialize an array V(s) = 0, for all s ∈ S+
Repeat
∆ ← 0
For each s ∈ S:
v ← V(s)
V(s) ← ∑_a π(a|s) ∑_s,r p(s,r|s,a)[r+γV(s)]
∆ ← max(∆,|vV(s)|)
until ∆ < θ (a small positive number)
Output V ≈ v_π
```
### Grid World Example
Consider a grid world where the top left and bottom right squares are the terminal state. Now consider that every other square, produces a reward of -1, and the available actions on each state is {up, down, left, right} as long as that action keeps the agent on the grid. Suppose our agent follows the equiprobable random policy.
![1540262811089](/home/rozek/Documents/Research/Reinforcement Learning/1540262811089.png)
## Policy Improvement
One reason for computing the value function for a policy is to help find better policies. Suppose we have determined the value function $v_\pi$ for an arbitrary deterministic policy $\pi$. For some state $s$ we would like to know whether or not we should change the policy to determinatically chose another action. We know how good it is to follow the current policy from $s$, that is $v_\pi(s)$, but would it be better or worse to change to the new policy?
One way to answer this question is to consider selecting $a$ in $s$ and thereafter follow the existing policy, $\pi$. The key criterion is whether the this produces a value greater than or less than $v_\pi(s)$. If it is greater, then one would expect it to be better still to select $a$ every time $s$ is encountered, and that the new policy would in fact be a better one overall.
That this is true is a special case of a general result called the *policy improvement theorem*. Let $\pi$ and $\pi^\prime$ be any pair of deterministic policies such that for all $s \in \mathcal{S}$.
$$
q_\pi(s, \pi^\prime(s)) \ge v_\pi(s)
$$
So far we have seen how, given a policy and its value function, we can easily evaluate a change in the policy at a single state to a particular action. It is a natural extension to consider changes at all states and to all possible actions, selecting at each state the action that appears best according to $q_\pi(s, a)$. In other words, to consider the new *greedy* policy, $\pi^\prime$, given by:
$$
\pi^\prime = argmax (q_\pi(s, a))
$$
So far in this section we have considered the case of deterministic policies. We will not go through the details, but in fact all the ideas of this section extend easily to stochastic policies.
## Policy Iteration
Once a policy, $\pi$, has been improved using $v_\pi$ to yield a better policy, $\pi^\prime$, we can then compute $v_{\pi^\prime}$ and improve it again to yield an even better $\pi^{\prime\prime}$. We can thus obtain a sequence of monotonically improving policies and value functions.
Each policy is guaranteed to be a strict improvement over the previous one (unless its already optimal). Since a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations.
This way of finding an optimal policy is called *policy iteration*.
<u>Algorithm</u>
```
1. Initialization
V(s) ∈ R and π(s) ∈ A(s) arbitrarily for all s ∈ S
2. Policy Evaluation
Repeat
∆ ← 0
For each s∈S:
v ← V(s)
V(s) ← ∑_{s,r} p(s,r|s,π(s))[r + γV(s)]
∆ ← max(∆,|v V(s)|)
until ∆ < θ (a small positive number)
3. Policy Improvement
policy-stable ← true
For each s ∈ S:
old-action ← π(s)
π(s) ← arg max_a ∑_{s,r} p(s,r|s,a)[r + γV(s)]
If old-action != π(s), then policy-stable ← false
If policy-stable, then stop and return V ≈ v_,
and π ≈ π_; else go to 2
```
## Value Iteration
One drawback to policy iteration is that each of its iterations involve policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. If policy evaluation is done iteratively, then convergence exactly to $v_\pi$ occurs only in the limit. Must we wait for exact convergence, or can we stop short of that?
In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantee of policy iteration. One important special case is when policy evaluation is stopped after one sweep. This algorithm is called value iteration.
Another way of understanding value iteration is by reference to the Bellman optimality equation. Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule. Also note how value iteration is identical to the policy evaluation update except that it requires the maximum to be taken over all actions.
Finally, let us consider how value iteration terminates. Like policy evaluation, value iteration formally requires an infinite number of iterations to converge exactly. In practice, we stop once the value function changes by only a small amount.
```
Initialize array V arbitrarily (e.g., V(s) = 0 for all
s ∈ S+)
Repeat
∆ ← 0
For each s ∈ S:
v ← V(s)
V(s) ← max_a∑_{s,r} p(s,r|s,a)[r + γV(s)]
∆ ← max(∆, |v V(s)|)
until ∆ < θ (a small positive number)
Output a deterministic policy, π ≈ π_, such that
π(s) = arg max_a ∑_{s,r} p(s,r|s,a)[r + γV(s)]
```
Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement. Faster convergence is often achieved by interposing multiple policy evaluation sweeps between each policy improvement sweep.
## Asynchronous Dynamic Programming
*Asynchronous* DP algorithms are in-place DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms update the values of states in any order whatsoever, using whatever value of other states happen to be available.
To converge correctly, however, an asynchronous algorithm must continue to update the value of all the states: it can't ignore any state after some point in the computation.
## Generalized Policy Iteration
Policy iteration consists of two simultaneous, iterating processes, one making the value function consistent with the current policy (policy evaluation) and the other making the policy greedy with respect to the current value function (policy improvement).
We use the term *generalized policy iteration* (GPI) to competing and cooperating. They compete in the sense that they pull in opposing directions. Making the policy greedy with respect to the value function typically makes the value function incorrect for the changed policy. Making the value function consistent with the policy typically causes the policy to be greedy. In the long run, however, the two processes interact to find a single joint solution.

View file

@ -0,0 +1,66 @@
# Introduction to Reinforcement Learning Day 1
Recall that this course is based on the book --
Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
These notes really serve as talking points for the overall concepts described in the chapter and are not meant to stand for themselves. Check out the book for more complete thoughts :)
**Reinforcement Learning** is learning what to do -- how to map situations to actions -- so as to maximize a numerical reward signal. There are two characteristics, trial-and-error search and delayed reward, that are the two most important distinguishing features of reinforcement learning.
Markov decision processes are intended to include just these three aspects: sensation, action, and goal(s).
Reinforcement learning is **different** than the following categories
- Supervised learning: This is learning from a training set of labeled examples provided by a knowledgeable external supervisor. In interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all situations in which the agent has to act.
- Unsupervised learning: Reinforcement learning is trying to maximize a reward signal as opposed to finding some sort of hidden structure within the data.
One of the challenges that arise in reinforcement learning is the **trade-off** between exploration and exploitation. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task.
Another key feature of reinforcement learning is that it explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment. This is different than supervised learning since they're concerned with finding the best classifier/regression without explicitly specifying how such an ability would finally be useful.
A complete, interactive, goal-seeking agent can also be a component of a larger behaving system. A simple example is an agent that monitors the charge level of a robot's battery and sends commands to the robot's control architecture. This agent's environment is the rest of the robot together with the robot's environment.
## Definitions
A policy defines the learning agent's way of behaving at a given time
A reward signal defines the goal in a reinforcement learning problem. The agent's sole objective is to maximize the total reward it receives over the long run
A value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Without rewards there could be no value,s and the only purpose of estimating values is to achieve more reward. We seek actions that bring about states of highest value.
Unfortunately, it is much harder to determine values than it is to determine rewards. The most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values.
**Look at Tic-Tac-Toe example**
Most of the time in a reinforcement learning algorithm, we move greedily, selecting the move that leads to the state with greatest value. Occasionally, however, we select randomly from amoung the other moves instead. These are called exploratory moves because they cause us to experience states that we might otherwise never see.
Summary: Reinforcement learning is learning by an agent from direct interaction wit its environment, without relying on exemplary supervision or complete models of the environment.

View file

@ -0,0 +1,110 @@
# Chapter 5: Monte Carlo Methods
Monte Carlo methods do not assume complete knowledge of the environment. They require only *experience* which is a sample sequence of states, actions, and rewards from actual or simulated interaction with an environment.
Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. To ensure that well-defined returns are available, we define Monte Carlo methods only for episodic tasks. Only on the completion of an episode are value estimates and policies changed.
Monte Carlo methods sample and average returns for each state-action pair is much like the bandit methods explored earlier. The main difference is that there are now multiple states, each acting like a different bandit problems and the problems are interrelated. Due to all the action selections undergoing learning, the problem becomes nonstationary from the point of view of the earlier state.
## Monte Carlo Prediction
Recall that the value of a state is the expected return -- expected cumulative future discounted reward - starting from that state. One way to do it is to estimate it from experience by averaging the returns observed after visits to that state.
Each occurrence of state $s$ in an episode is called a *visit* to $s$. The *first-visit MC method* estimates $v_\pi(s)$ as the average of the returns following first visits to $s$, whereas the *every-visit MC method* averages the returns following all visits to $s$. These two Monte Carlo methods are very similar but have slightly different theoretical properties.
<u>First-visit MC prediction</u>
```
Initialize:
π ← policy to be evaluated
V ← an arbitrary state-value function
Returns(s) ← an empty list, for all s ∈ S
Repeat forever:
Generate an episode using π
For each state s appearing in the episode:
G ← the return that follows the first occurrence of
s
Append G to Returns(s)
V(s) ← average(Returns(s))
```
## Monte Carlo Estimation of Action Values
If a model is not available then it is particularly useful to estimate *action* values rather than state values. With a model, state values alone are sufficient to define a policy. Without a model, however, state values alone are not sufficient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy.
The only complication is that many state-action pairs may never be visited. If $\pi$ is a deterministic policy, then in following $\pi$ one will observe returns only for one of the actions from each state. With no returns to average, the Monte Carlo estimates of the other actions will not improve with experience. This is a serious problem because the purpose of learning action values is to help in choosing among the actions available in each state.
This is the general problem of *maintaining exploration*. For policy evaluation to work for action values, we must assure continual exploration. One way to do this is by specifying that the episodes *start in a state-action pair*, and that each pair has a nonzero probability of being selected as the start. We call this the assumption of *exploring starts*.
## Monte Carlo Control
We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes.
<u>Monte Carlo Exploring Starts</u>
```
Initialize, for all s ∈ S, a ∈ A(s):
Q(s,a) ← arbitrary
π(s) ← arbitrary
Returns(s,a) ← empty list
Repeat forever:
Choose S_0 ∈ S and A_0 ∈ A(S_0) s.t. all pairs have probability > 0
Generate an episode starting from S_0,A_0, following
π
For each pair s,a appearing in the episode:
G ← the return that follows the first occurrence of
s,a
Append G to Returns(s,a)
Q(s,a) ← average(Returns(s,a))
For each s in the episode:
π(s) ← arg max_a Q(s,a)
```
## Monte Carlo Control without Exploring Starts
The only general way to ensure that actions are selected infinitely often is for the agent to continue to select them. There are two approaches ensuring this, resulting in what we call *on-policy* methods and *off-policy* methods.
On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.
In on-policy control methods the policy is generally *soft*, meaning that $\pi(a|s)$ for all $a \in \mathcal{A}(s)$. The on-policy methods in this section uses $\epsilon$-greedy policies, meaning that most of the time they choose an action that has maximal estimated action value, but with probability $\epsilon$ they instead select an action at random.
<u>On-policy first-visit MC control (for $\epsilon$-soft policies)</u>
```
```
Initialize, for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$:
$Q(s, a)$ ← arbitrary
$Returns(s,a)$ ← empty list
$\pi(a|s)$ ← an arbitrary $\epsilon$-soft policy
Repeat forever:
(a) Generate an episode using $\pi$
(b) For each pair $s,a$ appearing in the episoe
$G$ ← the return that follows the first occurance of s, a
Append $G$ to $Returns(s,a)$
$Q(s, a)$ ← average($Returns(s,a)$)
(c) For each $s$ in the episode:
$A^*$ ← argmax$_a Q(s,a)$ (with ties broken arbitrarily)
For all $a \in \mathcal{A}(s)$:
$\pi(a|s)$ ← $\begin{cases} 1 - \epsilon + \epsilon / |\mathcal{A}(s)| & a = A^* \\ \epsilon / | \mathcal{A}(s)| & a \neq A^* \end{cases}$
```
```

View file

@ -0,0 +1,193 @@
# Chapter 3: Finite Markov Decision Processes
Markov Decision processes are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards. Thus MDPs involve delayed reward and the need to trade-off immediate and delayed reward. Whereas in bandit problems we estimated the value of $q_*(a)$ of each action $a$, in MDPs we estimate the value of $q_*(s, a)$ of each action $a$ in state $s$.
MDPs are a mathematically idealized form of the reinforcement learning problem. As we will see in artificial intelligence, there is often a tension between breadth of applicability and mathematical tractability. This chapter will introduce this tension and discuss some of the trade-offs and challenges that it implies.
## Agent-Environment Interface
The learner and decision maker is called the *agent*. The thing it interacts with is called the *environment*. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent.
The environment also give rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.
![1536511147205](/home/rozek/Documents/Research/Reinforcement Learning/1536511147205.png)
To make the future paragraphs clearer, a Markov Decision Process is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of the decision maker.
In a *finite* MDP, the set of states, actions, and rewards all have a finite number of elements. In this case, the random variables R_t and S_t have a well defined discrete probability distribution dependent only on the preceding state and action.
$$
p(s^\prime | s,a) = \sum_{r \in \mathcal{R}}{p(s^\prime, r|s, a)}
$$
Breaking down the above formula, it's just an instantiation of the law of total probability. If you partition the probabilistic space by the reward, if you sum up each partition you should get the overall space. This formula has a special name: the *state-transition probability*.
From this we can compute the expected rewards for each state-action pair by multiplying each reward with the probability of getting said reward and summing it all up.
$$
r(s, a) = \sum_{r \in \mathcal{R}}{r}\sum_{s^\prime \in \mathcal{S}}{p(s^\prime, r|s,a)}
$$
The expected reward for a state-action-next-state triple is
$$
r(s, a, s^\prime) = \sum_{r \in \mathcal{R}}{r\frac{p(s^\prime, r|s,a)}{p(s^\prime|s,a)}}
$$
I wasn't able to piece together this function in my head like the others. Currently I imagine it as if we took the formula before and turned the universe of discourse from the universal set to the state where $s^\prime$ occurred.
The MDP framework is abstract and flexible and can be applied to many different problems in many different ways. For example, the time steps need not refer to fixed intervals of real time; they can refer to arbitrary successive states of decision making and acting.
### Agent-Environment Boundary
In particular, the boundary between agent and environment is typically not the same as the physical boundary of a robot's or animals' body. Usually, the boundary is drawn closer to the agent than that. For example, the motors and mechanical linkages of a robot and its sensing hardware should usually be considered parts of the environment rather than parts of the agent.
The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment. We do not assume that everything in the environment is unknown to the agent. For example, the agent often knows quite a bit about how its rewards are computed as a function of its actions and the states in which they are taken. But we always consider the reward computation to be external to the agent because it defines the task facing the agent and thus must be beyond its ability to change arbitrarily. The agent-environment boundary represents the limit of the agent's absolute control, not of its knowledge.
This framework breaks down whatever the agent is trying to achieve to three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent, one signal to represent the basis on which the choices are made (the states), and one signal to define the agent's goal (the rewards).
### Example 3.4: Recycling Robot MDP
Recall that the agent makes a decision at times determined by external events. At each such time the robot decides whether it should
(1) Actively search for a can
(2) Remain stationary and wait for someone to bring it a can
(3) Go back to home base to recharge its battery
Suppose the environment works as follows: the best way to find cans is to actively search for them, but this runs down the robot's battery, whereas waiting does not. Whenever the robot is searching the possibility exists that its battery will become depleted. In this case, the robot must shut down and wait to be rescued (producing a low reward.)
The agent makes its decisions solely as a function of the energy level of the battery, It can distinguish two levels, high and low, so that the state set is $\mathcal{S} = \{high, low\}$. Let us call the possible decisions -- the agent's actions -- wait, search, and recharge. When the energy level is high, recharging would always be foolish, so we do not include it in the action set for this state. The agent's action sets are
$$
\begin{align*}
\mathcal{A}(high) &= \{search, wait\} \\
\mathcal{A}(low) &= \{search, wait, recharge\}
\end{align*}
$$
If the energy level is high, then a period of active search can always be completed without a risk of depleting the battery. A period of searching that begins with a high energy level leaves the energy level high with a probability of $\alpha$ and reduces it to low with a probability of $(1 - \alpha)$. On the other hand, a period of searching undertaken when the energy level is low leaves it low with a probability of $\beta$ and depletes the battery with a probability of $(1 - \beta)$. In the latter case, the robot must be rescued, and the batter is then recharged back to high.
Each can collected by the robot counts as a unit reward, whereas a reward of $-3$ occurs whenever the robot has to be rescued. Let $r_{search}$ and $r_{wait}$ with $r_{search } > r_{wait}$, respectively denote the expected number of cans the robot will collect while searching and waiting. Finally, to keep things simple, suppose that no cans can be collected ruing a run home for recharging and that no cans can be collected on a step in which the battery is depleted.
| $s$ | $a$ | $s^\prime$ | $p(s^\prime | s, a)$ | $r(s|
| ---- | -------- | ---------- | ------------- | ------------ | --- |
| high | search | high | $\alpha$ | $r_{search}$ |
| high | search | low | $(1-\alpha)$ | $r_{search}$ |
| low | search | high | $(1 - \beta)$ | -3 |
| low | search | low | $\beta$ | $r_{search}$ |
| high | wait | high | 1 | $r_{wait}$ |
| high | wait | low | 0 | $r_{wait}$ |
| low | wait | high | 0 | $r_{wait}$ |
| low | wait | low | 1 | $r_{wait}$ |
| low | recharge | high | 1 | 0 |
| low | recharge | low | 0 | 0 |
A *transition graph* is a useful way to summarize the dynamics of a finite MDP. There are two kinds of nodes: *state nodes* and *action nodes*. There is a state node for each possible state and an action node for reach state-action pair. Starting in state $s$ and taking action $a$ moves you along the line from state node $s$ to action node $(s, a)$. The the environment responds with a transition ot the next states node via one of the arrows leaving action node $(s, a)$.
![1537031348278](/home/rozek/Documents/Research/Reinforcement Learning/1537031348278.png)
## Goals and Rewards
The reward hypothesis is that all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal called the reward.
Although formulating goals in terms of reward signals might at first appear limiting, in practice it has proved to be flexible and widely applicable. The best way to see this is to consider the examples of how it has been, or could be used. For example:
- To make a robot learn to walk, researchers have provided reward on each time step proportional to the robot's forward motion.
- In making a robot learn how to escape from a maze, the reward is often $-1$ for every time step that passes prior to escape; this encourages the agent to escape as quickly as possible.
- To make a robot learn to find and collect empty soda cans for recycling, one might give it a reward of zero most of the time, and then a reward of $+1$ for each can collected. One might also want to give the robot negative rewards when it bumps into things or when somebody yells at it.
- For an agent to play checkers or chess, the natural rewards are $+1$ for winning, $-1$ for losing, and $0$ for drawing and for all nonterminal positions.
It is critical that the rewards we set up truly indicate what we want accomplished. In particular, the reward signal is not the place to impart to the agent prior knowledge about *how* to achieve what we want it to do. For example, a chess playing agent should only be rewarded for actually winning, not for achieving subgoals such as taking its opponent's pieces. If achieving these sort of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. The reward signal is your way of communicating what you want it to achieve, not how you want it achieved.
## Returns and Episodes
In general, we seek to maximize the *expected return*, where the return is defined as some specific function of the reward sequence. In the simplest case, the return is the sum of the rewards:
$$
G_t = R_{t + 1} + R_{t + 2} + R_{t + 3} + \dots + R_{T}
$$
where $T$ is the final time step. This approach makes sense in applications in which there is a natural notion of a final time step. That is when the agent-environment interaction breaks naturally into subsequences or *episodes*, such as plays of a game, trips through a maze, or any sort of repeated interaction.
### Episodic Tasks
Each episode ends in a special state called the *terminal state*, followed by a reset to the standard starting state or to a sample from a standard distribution of starting states. Even if you think of episodes as ending in different ways the next episode begins independently of how the previous one ended. Therefore, the episodes can all be considered to ending the same terminal states with different rewards for different outcomes.
Tasks with episodes of this kind are called *episodic tasks*. In episodic tasks we sometimes need to distinguish the set of all nonterminal states, denoted $\mathcal{S}$, from the set of all states plus the terminal state, denoted $\mathcal{S}^+$. The time of termination, $T$, is a random variable that normally varies from episode to episode.
### Continuing Tasks
On the other hand, in many cases the agent-environment interaction goes on continually without limit. We call these *continuing tasks*. The return formulation above is problematic for continuing tasks because the final time step would be $T = \infty$, and the return which we are trying to maximize, could itself easily be infinite. The additional concept that we need is that of *discounting*. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses $A_t$ to maximize the expected discounted return.
$$
G_t= \sum_{k = 0}^\infty{\gamma^kR_{t+k+1}}
$$
where $\gamma$ that is a parameter between $0$ and $1$ is called the *discount rate*.
#### Discount Rate
The discount rate determines the present value of future rewards: a reward received $k$ time steps in the future is worth only $\gamma^{k - 1}$ time what it would be worth if it were received immediately. If $\gamma < 1$, the infinite sum has a finite value as long as the reward sequence is bounded.
If $\gamma = 0$, the agent is "myopic" in being concerned only with maximizing immediate rewards. But in general, acting to maximize immediate reward can reduce access to future rewards so that the return is reduced.
As $\gamma$ approaches 1, the return objective takes future rewards into account more strongly; the agent becomes more farsighted.
### Example 3.5 Pole-Balancing
The objective in this task is to apply forces to a car moving along a track so as to keep a pole hinged to the cart from falling over.
A failure is said to occur if the pole falls past a given angle from the vertical or if the cart runs off the track.
![1537500975060](/home/rozek/Documents/Research/Reinforcement Learning/1537500975060.png)
#### Approach 1
The reward can be a $+1$ for every time step on which failure did not occur. In this case, successful balancing would mean a return of infinity.
#### Approach 2
The reward can be $-1$ on each failure and zero all other times. The return at each time would then be related to $-\gamma^k$ where $k$ is the number of steps before failure.
Either case the return is maximized by keeping the pole balanced for as long as possible.
## Policies and Value Functions
Almost all reinforcement learning algorithms involve estimating *value functions* which estimate what future rewards can be expected. Of course the rewards that the agent can expect to receive is dependent on the actions it will take. Accordingly, value functions are defined with respect to particular ways of acting, called policies.
Formally, a *policy* is a mapping from states to probabilities of selecting each possible action. The *value* of a state s under a policy \pi, denoted $v_{\pi}(s)$ is the expected return when starting in $s$ and following $\pi$ thereafter. For MDPs we can define $v_{\pi}$ as
$$
v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s] = \mathbb{E}_{\pi}[\sum_{k = 0}^\infty{\gamma^kR_{t+k+1} | S_t = s}]
$$
We call this function the *state-value function for policy $\pi$*. Similarly, we define the value of taking action $a$ in state $s$ under a policy $\pi$, denoted as $q_\pi(s,a)$ as the expected return starting from $s$, taking the action $a$, and thereafter following the policy $\pi$. Succinctly, this is called the *action-value function for policy $\pi$*.
### Optimality and Approximation
For some kinds of tasks we are interested in,optimal policies can be generated only with extreme computational cost. A critical aspect of the problemfacing the agent is always teh computational power available to it, in particular, the amount of computation it can perform in a single time step.
The memory available is also an important constraint. A large amount of memory is often required to build up approximations of value functions, policies, and models. In the case of large state sets, functions must be approximated using some sort of more compact parameterized function representation.
This presents us with unique oppurtunities for achieving useful approximations. For example, in approximating optimal behavior, there may be many states that the agent faces with such a low probability that selecting suboptimal actions for them has little impact on the amount of reward the agent receives.
The online nature of reinforcement learning which makes it possible to approximate optimal policies in ways that put more effort into learning to make good decisions for frequently encountered states at the expense of infrequent ones is the key property that distinguishes reinforcement learning from other approaches to approximately solve MDPs.
### Summary
Let us summarize the elements of the reinforcement learning problem.
Reinforcement learning is about learning from interaction how to behave in order to achieve a goal. The reinforcement learning *agent* and its *environment* interact over a sequence of discrete time steps.
The *actions* are the choices made by the agent; the states are the basis for making the choice; and the *rewards* are the basis for evaluating the choices.
Everything inside the agent is completely known and controllable by the agent; everything outside is incompletely controllable but may or may not be completely known.
A *policy* is a stochastic rule by which the agent selects actions as a function of states.
When the reinforcement learning setup described above is formulated with well defined transition probabilities it constitutes a Markov Decision Process (MDP)
The *return* is the function of future rewards that the agent seeks to maximize. It has several different definitions depending on the nature of the task and whether one wishes to *discount* delayed reward.
- The un-discounted formulation is appropriate for *episodic tasks*, in which the agent-environment interaction breaks naturally into *episodes*
- The discounted formulation is appropriate for *continuing tasks* in which the interaction does not naturally break into episodes but continue without limit
A policy's *value functions* assign to each state, or state-action pair, the expected return from that state, or state-action pair, given that the agent uses the policy. The *optimal value functions* assign to each state, or state-action pair, the largest expected return achievable by any policy. A policy whose value unctions are optimal is an *optimal policy*.
Even if the agent has a complete and accurate environment model, the agent is typically unable to perform enough computation per time step to fully use it. The memory available is also an important constraint. Memory may be required to build up accurate approximations of value functions, policies, and models. In most cases of practical interest there are far more states that could possibly be entries in a table, and approximations must be made.