mirror of
https://github.com/Brandon-Rozek/website.git
synced 2025-10-09 06:21:13 +00:00
Website snapshot
This commit is contained in:
parent
ee0ab66d73
commit
50ec3688a5
281 changed files with 21066 additions and 0 deletions
13
content/research/deepreinforcementlearning/WeeklyProgress.md
Normal file
13
content/research/deepreinforcementlearning/WeeklyProgress.md
Normal file
|
@ -0,0 +1,13 @@
|
|||
## Weekly Progress
|
||||
|
||||
I didn't do the greatest job at writing a progress report every week but here on the page are the ones I did write.
|
||||
|
||||
[January 29 2019](Jan29)
|
||||
|
||||
[February 12 2019](Feb12)
|
||||
|
||||
[February 25 2019](Feb25)
|
||||
|
||||
[March 26 2019](Mar26)
|
||||
|
||||
[April 2 2019](Apr2)
|
Binary file not shown.
After Width: | Height: | Size: 24 KiB |
Binary file not shown.
After Width: | Height: | Size: 40 KiB |
Binary file not shown.
After Width: | Height: | Size: 11 KiB |
Binary file not shown.
After Width: | Height: | Size: 25 KiB |
Binary file not shown.
After Width: | Height: | Size: 68 KiB |
|
@ -0,0 +1,25 @@
|
|||
# Progress Report for Week of April 2nd
|
||||
|
||||
## Added Video Recording Capability to MinAtar environment
|
||||
|
||||
You can now use the OpenAI Monitor Wrapper to watch the actions performed by agents in the MinAtar suite. (Currently the videos are in grayscale)
|
||||
|
||||
Problems I had to solve:
|
||||
|
||||
- How to represent the channels into a grayscale value
|
||||
- Getting the tensor into the right format (with shape and dtype)
|
||||
- Adding additional meta information that OpenAI expected
|
||||
|
||||
## Progress Towards \#Exploration
|
||||
|
||||
After getting nowhere trying to combine the paper on Random Network Distillation and Count-based exploration and Intrinsic Motivation, I turned the paper \#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning.
|
||||
|
||||
This paper uses the idea of an autoencoder to learn a smaller latent state representation of the input. We can then use this smaller representation as a hash and count states based on these hashes.
|
||||
|
||||
Playing around with the ideas of autoencoders, I wanted a way to discretized my hash more than just what floating point precision allows. Of course this turns it into a non-differential function which I then tried turning towards Evolutionary methods to solve. Sadly the rate of optimization was drastically diminished using the Evolutionary approach. Therefore, my experiments for this week failed.
|
||||
|
||||
I'll probably look towards implementing what the paper did for my library and move on to a different piece.
|
||||
|
||||
|
||||
|
||||
Guru Indian: 3140 Cowan Blvd, Fredericksburg, VA 22401
|
|
@ -0,0 +1,63 @@
|
|||
# Weekly Progress Feb 12
|
||||
|
||||
## Finished writing scripts for data collection
|
||||
|
||||
- Playing a game now records
|
||||
- Video
|
||||
- State / Next-State as pixel values
|
||||
- Action taken
|
||||
- Reward Received
|
||||
- Whether the environment is finished every turn
|
||||
- Wrote scripts to gather and preprocess the demonstration data
|
||||
- Now everything is standardized on the npy format. Hopefully that stays consistent for a while.
|
||||
|
||||
## Wrote code to create an actor that *imitates* the demonstrator
|
||||
|
||||
Tweaked the loss function to be a form of cross-entropy loss
|
||||
$$
|
||||
loss = max(Q(s, a) + l(s,a)) - Q(s, a_E)
|
||||
$$
|
||||
Where $l(s, a)$ is zero for the action the demonstrator took and positive elsewhere.
|
||||
|
||||
Turns out, that as with a lot of deep learning applications, you need a lot of training data. So the agent currently does poorly on mimicking the performance of the demonstrator.
|
||||
|
||||
### Aside : Pretraining with the Bellman Equation
|
||||
|
||||
Based off the paper:
|
||||
|
||||
Todd Hester, Matej Vecerik, Olivier Pietquin, arc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian OsbandI, John Agapiou, Joel Z. Leibo, Audrunas Gruslys. **Learning from Demonstrations for Real World Reinforcement Learning**
|
||||
|
||||
|
||||
|
||||
This paper had the demonstration not include include the $(state, action)$ pairs like I did, but also the $(next_state, reward, done)$ signals. This way, they can pretrain with both supervised loss and with the general Q-learning loss.
|
||||
|
||||
That way, they can use the result of the pretraining as a starting ground for the actual training. The way I implemented it, I would first train an imitator which would then be used as the actor during the simulations from which we would collect data and begin training another net.
|
||||
|
||||
## Prioritized Replay
|
||||
|
||||
Instead of uniform sampling of experiences, we can sample by how surprised we were about the outcome of the Q-value loss.
|
||||
|
||||
I had a previous implementation of this, but it was faulty, so I took the code from OpenAI baselines and integrated it with my library.
|
||||
|
||||
It helps with games like Pong, because there are many states where the result is not surprising and inconsequential. Like when the ball is around the center of the field.
|
||||
|
||||
## Schedulers
|
||||
|
||||
There are some people who use Linear Schedulers to change the value of various parameters throughout training.
|
||||
|
||||
I implemented it as an iterator in python and called *next* for each time the function uses the hyper-parameter.
|
||||
|
||||
The two parameters I use schedulers in normally are:
|
||||
|
||||
- Epsilon - Gradually decreases exploration rate
|
||||
- Beta - Decreases the importance of the weights of experiences that get frequently sampled
|
||||
|
||||
|
||||
|
||||
## Layer Norm
|
||||
|
||||
"Reduces training by normalizes the activities of the neurons."
|
||||
|
||||
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. **Layer Normalization.**
|
||||
|
||||
It's nicely implemented in PyTorch already so I threw that in for each layer of the network. Reduces the average loss.
|
|
@ -0,0 +1,117 @@
|
|||
# Weekly Progress for February 25th
|
||||
|
||||
## Evolutionary Algorithms
|
||||
|
||||
### Genetic Algorithm
|
||||
|
||||
I worked towards implementing the Genetic Algorithm into PyTorch. I got a working implementation which operates like the following:
|
||||
|
||||
Generate $n$ perturbations of the model by taking the tensors in the model dictionary and add some random noise to them.
|
||||
|
||||
- Calculate the fitness of each model
|
||||
- Keep the $k$ best survivors
|
||||
- Sample (with replacement) $2(n - k)$ parents based on their fitness. (Higher fitness -> More likely to be sampled)
|
||||
|
||||
- Easy way to do this is: $prob = fitness / sum(fitness)$
|
||||
- Split the parents in half to $parent1$ and $parent2$
|
||||
- Perform crossover in order to make children
|
||||
|
||||
- For every tensor in the model dictionary
|
||||
|
||||
- Find a point in where you want the split to happen
|
||||
- Create a new tensor: the left part of the split comes from $parent1$, the other side from $parent2$
|
||||
- Mutate the child with $\epsilon$ probability
|
||||
- Add random noise to the tensor
|
||||
|
||||
Normally if you perform this algorithm with many iterations, your results start to converge towards a particular solution.
|
||||
|
||||
The main issue with this algorithm is that you need to carry with you $n$ models of the environment throughout the entire training process. You also need to have a somewhat good definition of the bounds of which your weights and biases can be otherwise the algorithm might not converge to the true value.
|
||||
|
||||
Due to these reasons, I didn't end up adding this functionality to RLTorch.
|
||||
|
||||
### Evolutionary Strategies
|
||||
|
||||
To combat these issues, I knocked into a paper by OpenAI Called "Evolutionary Strategies"
|
||||
|
||||
Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, Ilya Sutskever. **Evolution Strategies as a Scalable Alternative to Reinforcement Learning**
|
||||
|
||||
https://arxiv.org/abs/1703.03864
|
||||
|
||||
*This paper mostly describes the efforts made to make a certain evolutionary strategy scalable to many nodes. I ended up using only the algorithm from the paper and I didn't implement any of the scalable considerations.*
|
||||
|
||||
The following code below explains the process for maximizing a simple function
|
||||
|
||||
```python
|
||||
white_noise = np.random.randn(population_size, *current_solution.shape)
|
||||
noise = sigma * white_noise
|
||||
candidate_solutions = current_solution + noise
|
||||
|
||||
# Calculate fitness, mean shift, and scale
|
||||
fitness_values = calculate_fitness(candidate_solutions)
|
||||
fitness_values = (fitness_values - np.mean(fitness_values)) / (np.std(fitness_values) + np.finfo('float').eps)
|
||||
|
||||
new_solution = current_solution + learning_rate * np.mean(white_noise.T * fitness_values, axis = 1) / sigma
|
||||
```
|
||||
|
||||
To explain further, suppose you have a guess as to what the solution is. To generate new guesses, let us add random noise around our guess like the image below.
|
||||
|
||||

|
||||
|
||||
Now calculate the fitness of all the points, let us represent that by the intensity of blue in the background,
|
||||
|
||||

|
||||
|
||||
What ends up happening is that your new solution, the black square, will move towards the areas with higher reward.
|
||||
|
||||
## Q-Evolutionary Policies
|
||||
|
||||
**Motivation**
|
||||
|
||||
So this brings up the point, why did I bother studying these algorithms? I ran into a problem when I was looking to implement the DDPG algorithm. Primarily that it required your action space to be continuous, which is not something I'm currently on.
|
||||
|
||||
Then I thought, why can I not make it work with discrete actions? First let us recall the loss of a policy function under DDPG:
|
||||
$$
|
||||
loss_\pi = -Q(s, \pi(s))
|
||||
$$
|
||||
For the discrete case, your Q-function is going to be a function of the state and output the values of each action taken under that state. In mathematical terms,
|
||||
$$
|
||||
loss_\pi = -Q(s)[\pi(s)]
|
||||
$$
|
||||
Indexing into another array, however, is not a differentiable function. Meaning I cannot then calculate $loss_\pi$ with respect to $\pi$.
|
||||
|
||||
Evolutionary Strategies are non-gradient based methods. Meaning that I can bypass this restriction with the traditional methods.
|
||||
|
||||
**How it works**
|
||||
|
||||
Train your Value function with the typical DQN loss.
|
||||
|
||||
Every 10 Value function updates, update the policy. This gives time for the Value function to stabilize so that the policy is not chasing suboptimal value functions. Update the policy according to the $loss_\pi$ written above.
|
||||
|
||||
**Results**
|
||||
|
||||

|
||||
|
||||
The orange line is the QEP performance
|
||||
|
||||
Blue is DQN
|
||||
|
||||
## Future Direction
|
||||
|
||||
I would like to look back towards demonstration data and figure out a way to pretrain a QEP model.
|
||||
|
||||
It's somewhat easy to think of a way to train the policy. Make it a cross-entropy loss with respect to the actions the demonstrator took.
|
||||
$$
|
||||
loss_\pi = -\sum_{c=1}^M{y_{o,c}\ln{(p_{o,c})}}
|
||||
$$
|
||||
Where $M$ is the number of classes, $y$ is the binary indicator for whether the correction classification was observed, and $p$ is the predicted probability observation for class c.
|
||||
|
||||
It's harder to think about how I would do it for a Value function. There was the approach we saw before where the loss function was like so,
|
||||
$$
|
||||
loss_Q = max(Q(s, a) + l(s,a)) - Q(s, a_E)
|
||||
$$
|
||||
Where $l(s, a)$ is a vector that is positive for all values except for what the demonstrator took which is action $a_E$.
|
||||
|
||||
The main issue with this loss function for the Value is that it does not capture the actual output values of the functions, just how they are relative to each other. Perhaps adding another layer can help transform it to the values it needs to be. This will take some more thought.
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,65 @@
|
|||
# Weekly Progress Jan 29
|
||||
|
||||
## 1. Training From Demonstrations
|
||||
|
||||
Training from demonstrations is the act of using previous data to help speed up the learning process.
|
||||
|
||||
I read two papers on the topic:
|
||||
|
||||
[1] Gabriel V. de la Cruz Jr., Yunshu Du, Matthew E. Taylor. **Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning**.
|
||||
|
||||
https://arxiv.org/abs/1709.04083
|
||||
|
||||
The authors showed how you can speed up the training of a DQN network, especially under problems involving computer vision, if you first train the convolution layers by using a supervised loss between the actions the network would choose and the actions from the demonstration data given a state.
|
||||
|
||||
[2] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, Audrunas Gruslys. **Deep Q-learning from Demonstrations.**
|
||||
|
||||
https://arxiv.org/abs/1704.03732
|
||||
|
||||
The authors showed how from "expert" demonstrations we can speed up the training of a DQN by incorporating the supervised loss into the loss function.
|
||||
|
||||
### Supervised Loss
|
||||
|
||||
What is supervised loss in the context of DQNs?
|
||||
$$
|
||||
Loss = max(Q(s, a)+l(s,a )) - Q(s, a_E)
|
||||
$$
|
||||
Where $a_E$ is the expert action, and $l(s, a)$ is a vector of positive values with an entry of zero for the expert action.
|
||||
|
||||
The intuition behind this is that for the loss to be zero, the network would've had to have chosen the same action as the expert. The $l(s, a)$ term exists to ensure that there are no ties.
|
||||
|
||||
### What I decided to do
|
||||
|
||||
The main environment I chose to test these algorithms is Acrobot. It is a control theory problem and it has several physics related numbers as input. (Not image based)
|
||||
|
||||
I noticed when implementing [1] at least for the non-convolution case, there's no point in trying to train earlier layers. Perhaps I'll try again when I move onto the atari gameplays...
|
||||
|
||||
I decided against following [2] exactly. It's not that I disagree with the approach, but I don't like the need for "expert" data. If you decide to proceed anyways with non-expert data, you need to remember that it is incorporated into the loss function. Which means that you fall risk into learning sub-optimal policies.
|
||||
|
||||
In the end, what I decided to do was the following
|
||||
|
||||
1. Train a neural network that maps states->actions from demonstration data
|
||||
2. Use that network to play through several simulated runs of the environment, state the (state, action, reward, next_state, done) signals and insert it into the experience replay buffer and train from those (**Pretrain step**)
|
||||
3. Once the pretrain step is completed, replace the network that maps from demonstration data with the one you've been training in the pre-training step and continue with the regular algorithm
|
||||
|
||||
|
||||
|
||||
## 2. Noisy Networks
|
||||
|
||||
Based on this paper...
|
||||
|
||||
Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg. **Noisy Networks for Exploration.**
|
||||
|
||||
This paper describes adding parametric noise to the weights and biases and how it aids in exploration. The parameters of the noise are learned with gradient descent along with the other network parameters.
|
||||
|
||||
|
||||
|
||||
For the noise distribution I used the Gaussian Normal. One property that's handy to know about it is the following
|
||||
$$
|
||||
N(\mu, \sigma) = \mu + \sigma*N(0, 1)
|
||||
$$
|
||||
In our case, the $\mu$ would be the typical weights and biases, and the $\sigma$ is a new parameter representing how much variation or uncertainty needs to be added.
|
||||
|
||||
The concept is that as the network grows more confident about it's predictions, the variation in the weights start to decrease. This way the exploration is systematic and not something randomly injected like the epsilon-greedy strategy.
|
||||
|
||||
The paper describes replacing all your linear densely connected layers with this noisy linear approach.
|
|
@ -0,0 +1,11 @@
|
|||
# Progress for Week of March 26
|
||||
|
||||
## Parallelized Evolutionary Strategies
|
||||
|
||||
When the parallel ES class is declared, I start a pool of workers that then gets sent with a loss function and its inputs to compute whenever calculating gradients.
|
||||
|
||||
## Started Exploring Count-Based Exploration
|
||||
|
||||
I started looking through papers on Exploration and am interested in using the theoretical niceness of Count-based exploration in tabular settings and being able to see their affects in the non-tabular case.
|
||||
|
||||
""[Unifying Count-Based Exploration and Intrinsic Motivation](https://arxiv.org/abs/1606.01868)" creates a model of a arbitrary density model that follows a couple nice properties we would expect of probabilities. Namely, $P(S) = N(S) / n$ and $P'(S) = (N(S) + 1) / (n + 1)$. Where $N(S)$ represents the number of times you've seen that state, $n$ represents the total number of states you've seen, and $P'(S)$ represents the $P(S)$ after you have seen $S$ another time. With this model, we are able to solve for $N(S)$ and derive what the authors call a *Psuedo-Count*.
|
Binary file not shown.
After Width: | Height: | Size: 1.1 MiB |
|
@ -0,0 +1,15 @@
|
|||
---
|
||||
showthedate: false
|
||||
---
|
||||
|
||||
**Name:** Brandon Rozek
|
||||
|
||||
Department of Computer Science
|
||||
|
||||
**Mentor:** Dr. Ron Zacharski
|
||||
|
||||
QEP: The Q-Value Policy Evaluation Algorithm
|
||||
|
||||
|
||||
|
||||
*Abstract.* In Reinforcement Learning, sample complexity is often one of many concerns when designing algorithms. This concern outlines the number of interactions with a given environment that an agent needs in order to effectively learn a task. The Reinforcement Learning framework consists of finding a function (the policy) that maps states/scenarios to actions while maximizing the amount of reward from the environment. For example in video games, the reward is often characterized by some score. In recent years a variety of algorithms came out falling under the categories of Value-based methods and Policy-based methods. Value-based methods create a policy by approximating how much reward an agent is expected to receive if it performs the best actions from a given state. It is then common to choose the actions that maximizes such values. Meanwhile, in Policy-based methods, the policy function produces probabilities that an agent performs each action given a state and this is then optimized for the maximum reward. As such, Value-based methods produce deterministic policies while policy-based methods produce stochastic/probabilistic policies. Empirically, Value-based methods have lower sample complexity than Policy-based methods. However, in decision making not every situation has a best action associated with it. This is mainly due to the fact that real world environments are dynamic in nature and have confounding variables affecting the result. The QEP Algorithm combines both the Policy-based methods and Value-based methods by changing the policy's optimization scheme to involve approximate value functions. We have shown that this combines the benefits of both methods so that the sample complexity is kept low while maintaining a stochastic policy.
|
Loading…
Add table
Add a link
Reference in a new issue