Archived
1
0
Fork 0
This repository has been archived on 2023-11-10. You can view files and clone it, but cannot push or open issues or pull requests.
DeepRL/PrioReplayPoleBalanceKeras.ipynb
2019-01-22 10:28:46 -05:00

759 lines
No EOL
26 KiB
Text

{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "PrioReplayPoleBalanceKeras.ipynb",
"version": "0.3.2",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "GPU"
},
"cells": [
{
"metadata": {
"id": "-L43Q8pmzW4a",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# Solving CartPole using Deep Q-Learning"
]
},
{
"metadata": {
"id": "qJ_KSZ1RzbVO",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Let's start by installing OpenAI Gym in order to get the CartPole environment"
]
},
{
"metadata": {
"id": "W07n66lhB6yq",
"colab_type": "code",
"outputId": "e02c6361-73a4-4abd-a3b1-2c580b7938f2",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
}
},
"cell_type": "code",
"source": [
"!pip install gym"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Requirement already satisfied: gym in /usr/local/lib/python3.6/dist-packages (0.10.9)\n",
"Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from gym) (1.1.0)\n",
"Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.6/dist-packages (from gym) (1.14.6)\n",
"Requirement already satisfied: requests>=2.0 in /usr/local/lib/python3.6/dist-packages (from gym) (2.18.4)\n",
"Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from gym) (1.11.0)\n",
"Requirement already satisfied: pyglet>=1.2.0 in /usr/local/lib/python3.6/dist-packages (from gym) (1.3.2)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.0->gym) (2018.11.29)\n",
"Requirement already satisfied: idna<2.7,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.0->gym) (2.6)\n",
"Requirement already satisfied: urllib3<1.23,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.0->gym) (1.22)\n",
"Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.0->gym) (3.0.4)\n",
"Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pyglet>=1.2.0->gym) (0.16.0)\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "xEaZoFujzfu6",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Now to load all the packages we need"
]
},
{
"metadata": {
"id": "l4X6w1HH_aYo",
"colab_type": "code",
"outputId": "85166b65-10cd-49e7-9f35-9bb207761e08",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"cell_type": "code",
"source": [
"import os # FOR PYTHONHASHSEED\n",
"import gym\n",
"import numpy as np\n",
"import random\n",
"import tensorflow as tf\n",
"from collections import namedtuple\n",
"from keras import backend as K\n",
"from keras.models import Sequential\n",
"from keras.layers import Dense\n",
"from keras.optimizers import Adam\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Using TensorFlow backend.\n"
],
"name": "stderr"
}
]
},
{
"metadata": {
"id": "q9qtilaKTVyY",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Setting the seeds in order to make our results reproducible"
]
},
{
"metadata": {
"id": "Xhur-j1NTjKz",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"SEED = 9001"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "0y2syh5WTtUa",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"# 1. `PYTHONHASHSEED` environment variable at a fixed value\n",
"os.environ['PYTHONHASHSEED']=str(SEED)\n",
"\n",
"\n",
"np.random.seed(SEED)\n",
"random.seed(SEED)\n",
"\n",
"\n",
"# Force TensorFlow to use single thread.\n",
"# Multiple threads are a potential source of non-reproducible results.\n",
"# For further details, see: https://stackoverflow.com/questions/42022950/\n",
"\n",
"session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,\n",
" inter_op_parallelism_threads=1)\n",
"\n",
"\n",
"# The below tf.set_random_seed() will make random number generation\n",
"# in the TensorFlow backend have a well-defined initial state.\n",
"# For further details, see:\n",
"# https://www.tensorflow.org/api_docs/python/tf/set_random_seed\n",
"\n",
"tf.set_random_seed(1234)\n",
"\n",
"sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)\n",
"K.set_session(sess)\n"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "OHPFS9jJW-Xs",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Experience Replay Class\n",
"This class allows for fast random memory access and holds only up to the max capacity"
]
},
{
"metadata": {
"id": "N4xJmHmhKUsv",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"Transition = namedtuple('Transition',\n",
" ('state', 'action', 'reward', 'next_state', 'done'))"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "9ZI-Hd5pvH_1",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"# Implements a Ring Buffer\n",
"class ReplayMemory(object):\n",
" def __init__(self, capacity):\n",
" self.capacity = capacity\n",
" self.memory = []\n",
" self.position = 0\n",
"\n",
" def append(self, *args):\n",
" \"\"\"Saves a transition.\"\"\"\n",
" if len(self.memory) < self.capacity:\n",
" self.memory.append(None)\n",
" self.memory[self.position] = Transition(*args)\n",
" self.position = (self.position + 1) % self.capacity\n",
"\n",
" def sample(self, batch_size):\n",
" return random.sample(self.memory, batch_size)\n",
"\n",
" def __len__(self):\n",
" return len(self.memory)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "VMr7pphvtSVH",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Deep Q-Learning Agent"
]
},
{
"metadata": {
"id": "2Ar3yLxrtU7P",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"The DQNAgent class is going to house all the logic necessary to have a Deep Q-Learning Agent. Since there's a lot going on here, this section will be longer than the others."
]
},
{
"metadata": {
"id": "FNyEJDty0Vn_",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### Parameters"
]
},
{
"metadata": {
"id": "kGIzk0E00XEh",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"There are several parameters that are hard-coded into the model that should be tweaked when applying it to different problems to see if it affects performance. We will describe each parameter briefly here.\n",
"\n",
"\n",
"\n",
"* Epsilon: The exploration rate. How often will the agent choose a random move during training instead of relying on information it already has. Helps the agent go down paths it normally wouldn't in hopes for higher long term rewards.\n",
" - Epsilon Decay: How much our epsilon decreases after each update\n",
" - Epsilon Min: The lowest rate of exploration we'll allow during training\n",
"* Gamma: Discount rate. This tells us how much we prioritize current versus future rewards.\n",
"* Tau: Affects how much we shift our knowledge based off of new information.\n",
"* Learning Rate: Affects the optimization procedures in the neural networks.\n",
"\n"
]
},
{
"metadata": {
"id": "0TOKiRNm13E3",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### Fixed Q-Targets"
]
},
{
"metadata": {
"id": "jGcn7vPp15VD",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"In Q-Learning we update our Q_Table through the following function:\n",
"\n",
"$Q_{TableEntry}(state, action) = Reward + max(Q_{TableEntry}(state))$\n",
"\n",
"Since our update is dependent on the same table itself, we can start to get correlated entries. This could cause oscillations to happen in our training. \n",
"\n",
"To combat this, we implemented a target model. It essentially is a copy of the original model, except that the values do not update as rapidly. The rate at which the target model updates is dependent upon `Tau` in our parameter list."
]
},
{
"metadata": {
"id": "UzSSTd5W8rD5",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### Prioritized Replay\n",
"Create a probability vector giving priority to experiences with higher error\n",
"\n",
"$ prob1 = error / sum(error) $\n",
"\n",
"Then calculate the standard uniform probabilities\n",
"$ prob_unif = 1 / length(error) $\n",
"\n",
"Finally, interpolate between the two using the hyperparameter $v$\n",
"\n",
"$ prob = v * prob1 + (1 - v) * prob_unif $\n"
]
},
{
"metadata": {
"id": "iqJQvpl94h5_",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### Agent Workflow"
]
},
{
"metadata": {
"id": "HFW8vNbp5f-V",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"1. Create an empty neural network for both the model and target model.\n",
"2. Given a starting state, perform an action.\n",
"3. Once you performed the action on the environment, store the information gained into the agent's memory.\n",
"4. Once the agent acquires enough memory to meet the batch size requirements, select `batch_size` number of past memories randomly and train the model on those experiences.\n",
"5. The model will be trained by calculating the value of the current state and updating the value proposed previously by the model.\n",
"6. The value of the state.\n",
" - If the game is finished, then the reward is the value of that state.\n",
" - If not, then take the current reward and add the discounted reward of future states.\n",
"7. Decay the epsilon value as described in the parameters.\n",
"8. Gradually update the target model.\n",
"9. Perform an action for the new state, either randomly through epsilon, or by choosing the best action based on what we currently know.\n",
"9. Repeat steps 4-9."
]
},
{
"metadata": {
"id": "yHQQa4PKCKtO",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"class DQNAgent:\n",
" def __init__(self, state_size, action_size):\n",
" self.state_size = state_size\n",
" self.action_size = action_size\n",
" # The deque will only contain the last 2000 entries\n",
" self.memory = ReplayMemory(capacity=2000)\n",
" self.v = 0.25 # How far away from uniform sampling? Max: 1\n",
" self.gamma = 0.95 # Discount Rate\n",
" self.epsilon = 1.0 # Exploration Rate\n",
" self.epsilon_min = 0.001\n",
" self.epsilon_decay = 0.9995\n",
" self.learning_rate = 0.001\n",
" self.model = self._build_model()\n",
" ## Additional components for Fixed Q-Targets\n",
" self.target_model = self._build_model()\n",
" self.tau = 0.1 # We want to adjust our network by 10% each time\n",
" \n",
" def _build_model(self):\n",
" model = Sequential()\n",
" model.add(Dense(24, input_shape=(self.state_size,), activation='relu'))\n",
" model.add(Dense(24, activation='relu'))\n",
" model.add(Dense(self.action_size, activation='linear'))\n",
" model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))\n",
" return model\n",
" \n",
" def update_target_model(self):\n",
" for layer_target, layer_src in zip(self.target_model.layers, self.model.layers):\n",
" weights = layer_src.get_weights()\n",
" target_weights = layer_target.get_weights()\n",
" # Adjust the weights of the target to be tau proportion closer to the current\n",
" for i in range(len(weights)):\n",
" target_weights[i] = self.tau * weights[i] + (1 - self.tau) * target_weights[i]\n",
" layer_target.set_weights(target_weights)\n",
" \n",
" def remember(self, state, action, reward, next_state, done):\n",
" self.memory.append(state, action, reward, next_state, done)\n",
" \n",
" def act_random(self):\n",
" return random.randrange(self.action_size)\n",
" \n",
" def best_act(self, state):\n",
" # Choose the best action based on what we already know\n",
" # If all the action values for a given state is the same, then act randomly\n",
" action_values = self.model.predict(state)[0]\n",
" action = self.act_random() if np.all(action_values[0] == action_values) else np.argmax(action_values)\n",
" return action\n",
" \n",
" def act(self, state):\n",
" action = self.act_random() if np.random.rand() <= self.epsilon else self.best_act(state)\n",
" if self.epsilon > self.epsilon_min:\n",
" self.epsilon *= self.epsilon_decay\n",
" return action\n",
" \n",
" def discount_rewards(self, reward, next_state, done):\n",
" return reward if done else reward + self.gamma * np.amax(self.target_model.predict(next_state)[0])\n",
" \n",
" def sample(self, batch_size):\n",
" # If done do the other calculation\n",
" td_error = np.array([abs(self.discount_rewards(reward, next_state, done) - self.target_model.predict(state)[0][action]) for state, action, reward, next_state, done in self.memory.memory])\n",
" greedy_prob = td_error / np.sum(td_error)\n",
" uniform_prob = np.full(greedy_prob.size, 1 / greedy_prob.size)\n",
" prob = self.v * greedy_prob + (1 - self.v) * uniform_prob\n",
" numpy_memory = np.array(self.memory.memory)\n",
" return numpy_memory[np.random.choice(numpy_memory.shape[0], size = batch_size, replace = False, p = prob)].tolist()\n",
" \n",
"\n",
" def replay(self, batch_size):\n",
" minibatch = self.sample(batch_size)\n",
" for state, action, reward, next_state, done in minibatch:\n",
" target = reward\n",
" if not done:\n",
" target = reward + self.gamma * np.amax(self.target_model.predict(next_state)[0])\n",
" action_values = self.model.predict(state)\n",
" action_values[0][action] = target\n",
" self.model.fit(state, action_values, epochs=1, verbose=0)\n",
" \n",
" self.update_target_model()\n",
"\n",
" \n",
" \n",
" def load(self, name):\n",
" self.model.load_weights(name)\n",
" \n",
" def save(self, name):\n",
" self.model.save_weights(name)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "5aM7h-6fE1SV",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Training"
]
},
{
"metadata": {
"id": "KTQprVosBZ-q",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"We will now use our Deep Q-Learning Agent to train it in CartPole by simulating a lot of runs through the environment."
]
},
{
"metadata": {
"id": "IB10AdjtEPGe",
"colab_type": "code",
"outputId": "677969c4-836c-4e65-c748-c7f76701aed8",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 799
}
},
"cell_type": "code",
"source": [
"%time\n",
"env = gym.make('CartPole-v1')\n",
"env.seed(SEED)\n",
"state_size = env.observation_space.shape[0]\n",
"action_size = env.action_space.n\n",
"\n",
"agent = DQNAgent(state_size, action_size)\n",
"done = False\n",
"batch_size = 32\n",
"EPISODES = 50\n",
"\n",
"for episode_num in range(EPISODES):\n",
" state = env.reset()\n",
" state = np.reshape(state, [1, state_size])\n",
" \n",
" time = 0\n",
" done = False\n",
" while not done:\n",
" action = agent.act(state)\n",
" next_state, reward, done, _ = env.step(action)\n",
" \n",
" next_state = np.reshape(next_state, [1, state_size])\n",
" \n",
" agent.remember(state, action, reward, next_state, done)\n",
" \n",
" state = next_state\n",
" \n",
" if done:\n",
" print(\"episode: {}/{}, score: {}, epsilon: {:.2}\"\n",
" .format(episode_num, EPISODES, time, agent.epsilon))\n",
" break # We finished this episode\n",
" \n",
" if len(agent.memory) > batch_size:\n",
" agent.replay(batch_size)\n",
" \n",
" time += 1\n",
" "
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"CPU times: user 3 µs, sys: 0 ns, total: 3 µs\n",
"Wall time: 8.11 µs\n",
"episode: 0/50, score: 15, epsilon: 0.99\n",
"episode: 1/50, score: 16, epsilon: 0.98\n"
],
"name": "stdout"
},
{
"output_type": "stream",
"text": [
"/usr/local/lib/python3.6/dist-packages/gym/envs/registration.py:14: PkgResourcesDeprecationWarning: Parameters to load are deprecated. Call .resolve and .require separately.\n",
" result = entry_point.load(False)\n"
],
"name": "stderr"
},
{
"output_type": "stream",
"text": [
"episode: 2/50, score: 19, epsilon: 0.97\n",
"episode: 3/50, score: 28, epsilon: 0.96\n",
"episode: 4/50, score: 30, epsilon: 0.95\n",
"episode: 5/50, score: 23, epsilon: 0.93\n",
"episode: 6/50, score: 15, epsilon: 0.93\n",
"episode: 7/50, score: 12, epsilon: 0.92\n",
"episode: 8/50, score: 11, epsilon: 0.91\n",
"episode: 9/50, score: 65, epsilon: 0.89\n",
"episode: 10/50, score: 20, epsilon: 0.88\n",
"episode: 11/50, score: 24, epsilon: 0.86\n",
"episode: 12/50, score: 49, epsilon: 0.84\n",
"episode: 13/50, score: 20, epsilon: 0.83\n",
"episode: 14/50, score: 10, epsilon: 0.83\n",
"episode: 15/50, score: 15, epsilon: 0.82\n",
"episode: 16/50, score: 23, epsilon: 0.81\n",
"episode: 17/50, score: 16, epsilon: 0.81\n",
"episode: 18/50, score: 27, epsilon: 0.8\n",
"episode: 19/50, score: 61, epsilon: 0.77\n",
"episode: 20/50, score: 76, epsilon: 0.74\n",
"episode: 21/50, score: 12, epsilon: 0.74\n",
"episode: 22/50, score: 49, epsilon: 0.72\n",
"episode: 23/50, score: 17, epsilon: 0.71\n",
"episode: 24/50, score: 19, epsilon: 0.71\n",
"episode: 25/50, score: 34, epsilon: 0.69\n",
"episode: 26/50, score: 38, epsilon: 0.68\n",
"episode: 27/50, score: 50, epsilon: 0.66\n",
"episode: 28/50, score: 28, epsilon: 0.65\n",
"episode: 29/50, score: 21, epsilon: 0.65\n",
"episode: 30/50, score: 138, epsilon: 0.6\n",
"episode: 31/50, score: 21, epsilon: 0.6\n",
"episode: 32/50, score: 75, epsilon: 0.57\n",
"episode: 33/50, score: 147, epsilon: 0.53\n",
"episode: 34/50, score: 42, epsilon: 0.52\n",
"episode: 35/50, score: 20, epsilon: 0.52\n",
"episode: 36/50, score: 34, epsilon: 0.51\n",
"episode: 37/50, score: 62, epsilon: 0.49\n",
"episode: 38/50, score: 145, epsilon: 0.46\n",
"episode: 39/50, score: 94, epsilon: 0.44\n",
"episode: 40/50, score: 153, epsilon: 0.4\n",
"episode: 41/50, score: 123, epsilon: 0.38\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "DnU8WAwz49-o",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Evaluate Performance"
]
},
{
"metadata": {
"id": "3JHXjw1YBjLX",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Now to test how our agent performed, we will run through more scenarios except this time we don't allow the agent to choose randomly and have it rely on its previous experiences."
]
},
{
"metadata": {
"id": "7HemRoBkGTEJ",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"time_achieved = []\n",
"for episode_num in range(EPISODES):\n",
" state = env.reset()\n",
" state = np.reshape(state, [1, state_size])\n",
" \n",
" time = 0\n",
" done = False\n",
" while not done:\n",
" action = agent.best_act(state)\n",
" next_state, reward, done, _ = env.step(action)\n",
" \n",
" next_state = np.reshape(next_state, [1, state_size])\n",
" \n",
" \n",
" state = next_state\n",
" \n",
" if done:\n",
" print(\"episode: {}/{}, score: {}\"\n",
" .format(episode_num, EPISODES, time))\n",
" time_achieved.append(time)\n",
" break # We finished this episode\n",
"\n",
" time += 1\n",
" "
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "FvVXCyj3CC1D",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Analysis of Performance"
]
},
{
"metadata": {
"id": "Zu_tyQ5RCFCD",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Let us load the pandas library to have some visualizations\n"
]
},
{
"metadata": {
"id": "AK7iJ2Z9CeMe",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import pandas as pd"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "aJ8Q2b-7CJI5",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"time_achieved = pd.DataFrame(time_achieved, columns = ['time'])"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "kHB5gPMOCa00",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Now to display a histogram of seconds that we lasted."
]
},
{
"metadata": {
"id": "eQWVX1flCXeW",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"time_achieved.hist()"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "eL4hRpK5CfIx",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Summary Statistics for summarization"
]
},
{
"metadata": {
"id": "jyXcdHH7CYpM",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"print(\"Mean: {} seconds\\nStandard Deviation: {} seconds\".format(time_achieved.time.mean(), time_achieved.time.std()))"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "RTKu-aHUOUa3",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}