mirror of
https://github.com/Brandon-Rozek/website.git
synced 2025-10-10 23:11:14 +00:00
Website snapshot
This commit is contained in:
parent
ee0ab66d73
commit
50ec3688a5
281 changed files with 21066 additions and 0 deletions
|
@ -0,0 +1,63 @@
|
|||
# Weekly Progress Feb 12
|
||||
|
||||
## Finished writing scripts for data collection
|
||||
|
||||
- Playing a game now records
|
||||
- Video
|
||||
- State / Next-State as pixel values
|
||||
- Action taken
|
||||
- Reward Received
|
||||
- Whether the environment is finished every turn
|
||||
- Wrote scripts to gather and preprocess the demonstration data
|
||||
- Now everything is standardized on the npy format. Hopefully that stays consistent for a while.
|
||||
|
||||
## Wrote code to create an actor that *imitates* the demonstrator
|
||||
|
||||
Tweaked the loss function to be a form of cross-entropy loss
|
||||
$$
|
||||
loss = max(Q(s, a) + l(s,a)) - Q(s, a_E)
|
||||
$$
|
||||
Where $l(s, a)$ is zero for the action the demonstrator took and positive elsewhere.
|
||||
|
||||
Turns out, that as with a lot of deep learning applications, you need a lot of training data. So the agent currently does poorly on mimicking the performance of the demonstrator.
|
||||
|
||||
### Aside : Pretraining with the Bellman Equation
|
||||
|
||||
Based off the paper:
|
||||
|
||||
Todd Hester, Matej Vecerik, Olivier Pietquin, arc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian OsbandI, John Agapiou, Joel Z. Leibo, Audrunas Gruslys. **Learning from Demonstrations for Real World Reinforcement Learning**
|
||||
|
||||
|
||||
|
||||
This paper had the demonstration not include include the $(state, action)$ pairs like I did, but also the $(next_state, reward, done)$ signals. This way, they can pretrain with both supervised loss and with the general Q-learning loss.
|
||||
|
||||
That way, they can use the result of the pretraining as a starting ground for the actual training. The way I implemented it, I would first train an imitator which would then be used as the actor during the simulations from which we would collect data and begin training another net.
|
||||
|
||||
## Prioritized Replay
|
||||
|
||||
Instead of uniform sampling of experiences, we can sample by how surprised we were about the outcome of the Q-value loss.
|
||||
|
||||
I had a previous implementation of this, but it was faulty, so I took the code from OpenAI baselines and integrated it with my library.
|
||||
|
||||
It helps with games like Pong, because there are many states where the result is not surprising and inconsequential. Like when the ball is around the center of the field.
|
||||
|
||||
## Schedulers
|
||||
|
||||
There are some people who use Linear Schedulers to change the value of various parameters throughout training.
|
||||
|
||||
I implemented it as an iterator in python and called *next* for each time the function uses the hyper-parameter.
|
||||
|
||||
The two parameters I use schedulers in normally are:
|
||||
|
||||
- Epsilon - Gradually decreases exploration rate
|
||||
- Beta - Decreases the importance of the weights of experiences that get frequently sampled
|
||||
|
||||
|
||||
|
||||
## Layer Norm
|
||||
|
||||
"Reduces training by normalizes the activities of the neurons."
|
||||
|
||||
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. **Layer Normalization.**
|
||||
|
||||
It's nicely implemented in PyTorch already so I threw that in for each layer of the network. Reduces the average loss.
|
Loading…
Add table
Add a link
Reference in a new issue