mirror of
https://github.com/Brandon-Rozek/website.git
synced 2025-10-09 14:31:13 +00:00
Removing raw HTML
This commit is contained in:
parent
e06d45e053
commit
572d587b8e
33 changed files with 373 additions and 386 deletions
|
@ -24,7 +24,7 @@ This algorithm is called *iterative policy evaluation*.
|
|||
|
||||
To produce each successive approximation, $v_{k + 1}$ from $v_k$, iterative policy evaluation applies the same operation to each state $s$: it replaces the old value of $s$ with a new value obtained from the old values of the successor states of $s$, and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated.
|
||||
|
||||
<u>**Iterative Policy Evaluation**</u>
|
||||
**Iterative Policy Evaluation**
|
||||
|
||||
```
|
||||
Input π, the policy to be evaluated
|
||||
|
@ -69,7 +69,7 @@ Each policy is guaranteed to be a strict improvement over the previous one (unle
|
|||
|
||||
This way of finding an optimal policy is called *policy iteration*.
|
||||
|
||||
<u>Algorithm</u>
|
||||
**Algorithm**
|
||||
|
||||
```
|
||||
1. Initialization
|
||||
|
|
|
@ -16,7 +16,7 @@ Recall that the value of a state is the expected return -- expected cumulative f
|
|||
|
||||
Each occurrence of state $s$ in an episode is called a *visit* to $s$. The *first-visit MC method* estimates $v_\pi(s)$ as the average of the returns following first visits to $s$, whereas the *every-visit MC method* averages the returns following all visits to $s$. These two Monte Carlo methods are very similar but have slightly different theoretical properties.
|
||||
|
||||
<u>First-visit MC prediction</u>
|
||||
**First-visit MC prediction**
|
||||
|
||||
```
|
||||
Initialize:
|
||||
|
@ -45,7 +45,7 @@ This is the general problem of *maintaining exploration*. For policy evaluation
|
|||
|
||||
We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes.
|
||||
|
||||
<u>Monte Carlo Exploring Starts</u>
|
||||
**Monte Carlo Exploring Starts**
|
||||
|
||||
```
|
||||
Initialize, for all s ∈ S, a ∈ A(s):
|
||||
|
@ -74,7 +74,7 @@ On-policy methods attempt to evaluate or improve the policy that is used to make
|
|||
|
||||
In on-policy control methods the policy is generally *soft*, meaning that $\pi(a|s)$ for all $a \in \mathcal{A}(s)$. The on-policy methods in this section uses $\epsilon$-greedy policies, meaning that most of the time they choose an action that has maximal estimated action value, but with probability $\epsilon$ they instead select an action at random.
|
||||
|
||||
<u>On-policy first-visit MC control (for $\epsilon$-soft policies)</u>
|
||||
**On-policy first-visit MC control (for $\epsilon$-soft policies)**
|
||||
|
||||
```
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue