Website snapshot

2025-10-10 23:11:14 +00:00 · 2020-01-15 21:51:49 -05:00 · 2020-01-15 21:51:49 -05:00 · 50ec3688a5
commit 50ec3688a5
parent ee0ab66d73
281 changed files with 21066 additions and 0 deletions
--- a/content/research/deepreinforcementlearning/WeeklyProgress/Feb12.md
+++ b/content/research/deepreinforcementlearning/WeeklyProgress/Feb12.md
@ -0,0 +1,63 @@
+# Weekly Progress Feb 12
+
+## Finished writing scripts for data collection
+
+- Playing a game now records	
+  - Video
+  - State / Next-State as pixel values
+  - Action taken
+  - Reward Received
+  - Whether the environment is finished every turn
+- Wrote scripts to gather and preprocess the demonstration data
+  - Now everything is standardized on the npy format. Hopefully that stays consistent for a while.
+
+## Wrote code to create an actor that *imitates* the demonstrator
+
+Tweaked the loss function to be a form of cross-entropy loss
+$$
+loss = max(Q(s, a) + l(s,a)) - Q(s, a_E)
+$$
+Where $l(s, a)$ is zero for the action the demonstrator took and positive elsewhere.
+
+Turns out, that as with a lot of deep learning applications, you need a lot of training data. So the agent currently does poorly on mimicking the performance of the demonstrator.
+
+### Aside : Pretraining with the Bellman Equation
+
+Based off the paper: 
+
+Todd Hester, Matej Vecerik, Olivier Pietquin, arc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian OsbandI, John Agapiou, Joel Z. Leibo, Audrunas Gruslys. **Learning from Demonstrations for Real World Reinforcement Learning**
+
+
+
+This paper had the demonstration not include include the $(state, action)$ pairs like I did, but also the $(next_state, reward, done)$ signals. This way, they can pretrain with both supervised loss and with the general Q-learning loss.
+
+That way, they can use the result of the pretraining as a starting ground for the actual training. The way I implemented it, I would first train an imitator which would then be used as the actor during the simulations from which we would collect data and begin training another net.
+
+## Prioritized Replay
+
+Instead of uniform sampling of experiences, we can sample by how surprised we were about the outcome of the Q-value loss.
+
+I had a previous implementation of this, but it was faulty, so I took the code from OpenAI baselines and integrated it with my library.
+
+It helps with games like Pong, because there are many states where the result is not surprising and inconsequential. Like when the ball is around the center of the field.
+
+## Schedulers
+
+There are some people who use Linear Schedulers to change the value of various parameters throughout training.
+
+I implemented it as an iterator in python and called *next* for each time the function uses the hyper-parameter.
+
+The two parameters I use schedulers in normally are:
+
+- Epsilon - Gradually decreases exploration rate
+- Beta - Decreases the importance of the weights of experiences that get frequently sampled
+
+
+
+## Layer Norm
+
+"Reduces training by normalizes the activities of the neurons." 
+
+Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. **Layer Normalization.**
+
+It's nicely implemented in PyTorch already so I threw that in for each layer of the network. Reduces the average loss.