Fixed titles, math rendering, and links on some pages

This commit is contained in:
Brandon Rozek 2021-07-26 09:13:20 -04:00
parent 9f096a8720
commit 330ace0de9
61 changed files with 303 additions and 115 deletions

View file

@ -1,4 +1,7 @@
# Progress Report for Week of April 2nd
---
title: Progress Report for Week of April 2nd
showthedate: false
---
## Added Video Recording Capability to MinAtar environment
@ -20,6 +23,3 @@ Playing around with the ideas of autoencoders, I wanted a way to discretized my
I'll probably look towards implementing what the paper did for my library and move on to a different piece.
Guru Indian: 3140 Cowan Blvd, Fredericksburg, VA 22401

View file

@ -1,4 +1,8 @@
# Weekly Progress Feb 12
---
title: Weekly Progress Feb 12
showthedate: false
math: true
---
## Finished writing scripts for data collection
@ -60,4 +64,4 @@ The two parameters I use schedulers in normally are:
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. **Layer Normalization.**
It's nicely implemented in PyTorch already so I threw that in for each layer of the network. Reduces the average loss.
It's nicely implemented in PyTorch already so I threw that in for each layer of the network. Reduces the average loss.

View file

@ -1,4 +1,8 @@
# Weekly Progress for February 25th
---
title: Weekly Progress for February 25th
showthedate: false
math: true
---
## Evolutionary Algorithms

View file

@ -1,4 +1,8 @@
# Weekly Progress Jan 29
---
title: Weekly Progress Jan 29
showthedate: false
math: true
---
## 1. Training From Demonstrations
@ -6,13 +10,13 @@ Training from demonstrations is the act of using previous data to help speed up
I read two papers on the topic:
[1] Gabriel V. de la Cruz Jr., Yunshu Du, Matthew E. Taylor. **Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning**.
(1) Gabriel V. de la Cruz Jr., Yunshu Du, Matthew E. Taylor. **Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning**.
https://arxiv.org/abs/1709.04083
The authors showed how you can speed up the training of a DQN network, especially under problems involving computer vision, if you first train the convolution layers by using a supervised loss between the actions the network would choose and the actions from the demonstration data given a state.
[2] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, Audrunas Gruslys. **Deep Q-learning from Demonstrations.**
(2) Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, Audrunas Gruslys. **Deep Q-learning from Demonstrations.**
https://arxiv.org/abs/1704.03732
@ -32,9 +36,9 @@ The intuition behind this is that for the loss to be zero, the network would've
The main environment I chose to test these algorithms is Acrobot. It is a control theory problem and it has several physics related numbers as input. (Not image based)
I noticed when implementing [1] at least for the non-convolution case, there's no point in trying to train earlier layers. Perhaps I'll try again when I move onto the atari gameplays...
I noticed when implementing (1) at least for the non-convolution case, there's no point in trying to train earlier layers. Perhaps I'll try again when I move onto the atari gameplays...
I decided against following [2] exactly. It's not that I disagree with the approach, but I don't like the need for "expert" data. If you decide to proceed anyways with non-expert data, you need to remember that it is incorporated into the loss function. Which means that you fall risk into learning sub-optimal policies.
I decided against following (2) exactly. It's not that I disagree with the approach, but I don't like the need for "expert" data. If you decide to proceed anyways with non-expert data, you need to remember that it is incorporated into the loss function. Which means that you fall risk into learning sub-optimal policies.
In the end, what I decided to do was the following
@ -62,4 +66,4 @@ In our case, the $\mu$ would be the typical weights and biases, and the $\sigma$
The concept is that as the network grows more confident about it's predictions, the variation in the weights start to decrease. This way the exploration is systematic and not something randomly injected like the epsilon-greedy strategy.
The paper describes replacing all your linear densely connected layers with this noisy linear approach.
The paper describes replacing all your linear densely connected layers with this noisy linear approach.

View file

@ -1,4 +1,8 @@
# Progress for Week of March 26
---
title: Progress for Week of March 26
showthedate: false
math: true
---
## Parallelized Evolutionary Strategies
@ -8,4 +12,4 @@ When the parallel ES class is declared, I start a pool of workers that then gets
I started looking through papers on Exploration and am interested in using the theoretical niceness of Count-based exploration in tabular settings and being able to see their affects in the non-tabular case.
""[Unifying Count-Based Exploration and Intrinsic Motivation](https://arxiv.org/abs/1606.01868)" creates a model of a arbitrary density model that follows a couple nice properties we would expect of probabilities. Namely, $P(S) = N(S) / n$ and $P'(S) = (N(S) + 1) / (n + 1)$. Where $N(S)$ represents the number of times you've seen that state, $n$ represents the total number of states you've seen, and $P'(S)$ represents the $P(S)$ after you have seen $S$ another time. With this model, we are able to solve for $N(S)$ and derive what the authors call a *Psuedo-Count*.
""[Unifying Count-Based Exploration and Intrinsic Motivation](https://arxiv.org/abs/1606.01868)" creates a model of a arbitrary density model that follows a couple nice properties we would expect of probabilities. Namely, $P(S) = N(S) / n$ and $P'(S) = (N(S) + 1) / (n + 1)$. Where $N(S)$ represents the number of times you've seen that state, $n$ represents the total number of states you've seen, and $P'(S)$ represents the $P(S)$ after you have seen $S$ another time. With this model, we are able to solve for $N(S)$ and derive what the authors call a *Psuedo-Count*.

View file

@ -0,0 +1,16 @@
---
title: Weekly Progress
showthedate: false
---
I didn't do the greatest job at writing a progress report every week but here on the page are the ones I did write.
[January 29 2019](jan29)
[February 12 2019](feb12)
[February 25 2019](feb25)
[March 26 2019](mar26)
[April 2 2019](apr2)