mirror of
https://github.com/Brandon-Rozek/website.git
synced 2024-11-29 22:50:19 -05:00
67 lines
2.8 KiB
Markdown
67 lines
2.8 KiB
Markdown
---
|
|
title: Weekly Progress Feb 12
|
|
showthedate: false
|
|
math: true
|
|
---
|
|
|
|
## Finished writing scripts for data collection
|
|
|
|
- Playing a game now records
|
|
- Video
|
|
- State / Next-State as pixel values
|
|
- Action taken
|
|
- Reward Received
|
|
- Whether the environment is finished every turn
|
|
- Wrote scripts to gather and preprocess the demonstration data
|
|
- Now everything is standardized on the npy format. Hopefully that stays consistent for a while.
|
|
|
|
## Wrote code to create an actor that *imitates* the demonstrator
|
|
|
|
Tweaked the loss function to be a form of cross-entropy loss
|
|
$$
|
|
loss = max(Q(s, a) + l(s,a)) - Q(s, a_E)
|
|
$$
|
|
Where $l(s, a)$ is zero for the action the demonstrator took and positive elsewhere.
|
|
|
|
Turns out, that as with a lot of deep learning applications, you need a lot of training data. So the agent currently does poorly on mimicking the performance of the demonstrator.
|
|
|
|
### Aside : Pretraining with the Bellman Equation
|
|
|
|
Based off the paper:
|
|
|
|
Todd Hester, Matej Vecerik, Olivier Pietquin, arc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian OsbandI, John Agapiou, Joel Z. Leibo, Audrunas Gruslys. **Learning from Demonstrations for Real World Reinforcement Learning**
|
|
|
|
|
|
|
|
This paper had the demonstration not include include the $(state, action)$ pairs like I did, but also the $(next_state, reward, done)$ signals. This way, they can pretrain with both supervised loss and with the general Q-learning loss.
|
|
|
|
That way, they can use the result of the pretraining as a starting ground for the actual training. The way I implemented it, I would first train an imitator which would then be used as the actor during the simulations from which we would collect data and begin training another net.
|
|
|
|
## Prioritized Replay
|
|
|
|
Instead of uniform sampling of experiences, we can sample by how surprised we were about the outcome of the Q-value loss.
|
|
|
|
I had a previous implementation of this, but it was faulty, so I took the code from OpenAI baselines and integrated it with my library.
|
|
|
|
It helps with games like Pong, because there are many states where the result is not surprising and inconsequential. Like when the ball is around the center of the field.
|
|
|
|
## Schedulers
|
|
|
|
There are some people who use Linear Schedulers to change the value of various parameters throughout training.
|
|
|
|
I implemented it as an iterator in python and called *next* for each time the function uses the hyper-parameter.
|
|
|
|
The two parameters I use schedulers in normally are:
|
|
|
|
- Epsilon - Gradually decreases exploration rate
|
|
- Beta - Decreases the importance of the weights of experiences that get frequently sampled
|
|
|
|
|
|
|
|
## Layer Norm
|
|
|
|
"Reduces training by normalizes the activities of the neurons."
|
|
|
|
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. **Layer Normalization.**
|
|
|
|
It's nicely implemented in PyTorch already so I threw that in for each layer of the network. Reduces the average loss.
|