mirror of
https://github.com/Brandon-Rozek/website.git
synced 2024-12-24 01:11:42 +00:00
79 lines
2.8 KiB
Markdown
79 lines
2.8 KiB
Markdown
|
# Reinforcement Learning
|
||
|
|
||
|
The goal of this independent study is to gain an introduction to the topic of Reinforcement Learning.
|
||
|
|
||
|
As such the majority of the semester will be following the textbook to gain an introduction to the topic, and the last part applying it to some problems.
|
||
|
|
||
|
|
||
|
## Textbook
|
||
|
|
||
|
The majority of the content of this independent study will come from the textbook. This is meant to lessen the burden on the both us of as I already experimented with curating my own content.
|
||
|
|
||
|
The textbook also includes examples throughout the text to immediately apply what's learned.
|
||
|
|
||
|
Richard S. Sutton and Andrew G. Barto, "Reinforcement Learning: An Introduction" http://incompleteideas.net/book/bookdraft2017nov5.pdf
|
||
|
|
||
|
## Discussions and Notes
|
||
|
|
||
|
Discussions and notes will be kept track of and published on my tilda space as time and energy permits. This is for easy reference and since it's nice to write down what you learn.
|
||
|
|
||
|
## Topics to be Discussed
|
||
|
|
||
|
###The Reinforcement Learning Problem (3 Sessions)
|
||
|
|
||
|
In this section we will get ourselves familiar with the topics that are commonly discussed in Reinforcement learning problems.
|
||
|
|
||
|
In this section we will learn the different vocab terms such as:
|
||
|
|
||
|
- Evaluative Feedback
|
||
|
- Non-Associative Learning
|
||
|
- Rewards/Returns
|
||
|
- Value Functions
|
||
|
- Optimality
|
||
|
- Exploration/Exploitation
|
||
|
- Model
|
||
|
- Policy
|
||
|
- Value Function
|
||
|
- Multi-armed Bandit Problem
|
||
|
|
||
|
### Markov Decision Processes (4 Sessions)
|
||
|
|
||
|
This is a type of reinforcement learning problem that is commonly studied and well documented. This helps form an environment for which the agent can operate within. Possible subtopics include:
|
||
|
|
||
|
- Finite Markov Decision Processes
|
||
|
- Goals and Rewards
|
||
|
- Returns and Episodes
|
||
|
- Optimality and Approximation
|
||
|
|
||
|
### Dynamic Programming (3 Sessions)
|
||
|
|
||
|
Dynamic Programming refers to a collection of algorithms that can be used to compute optimal policies given an environment. Subtopics that we are going over is:
|
||
|
|
||
|
- Policy Evaluation
|
||
|
- Policy Improvement
|
||
|
- Policy Iteration
|
||
|
- Value Iteration
|
||
|
- Asynchronous DP
|
||
|
- Generalized policy Iteration
|
||
|
- Bellman Expectation Equations
|
||
|
|
||
|
### Monte Carlo Methods (3 Sessions)
|
||
|
|
||
|
Now we move onto not having complete knowledge of the environment. This will go into estimating value functions and discovering optimal policies. Possible subtopics include:
|
||
|
|
||
|
- Monte Carlo Prediction
|
||
|
- Monte Carlo Control
|
||
|
- Importance Sampling
|
||
|
- Incremental Implementation
|
||
|
- Off-Policy Monte Carlo Control
|
||
|
|
||
|
### Temporal-Difference Learning (4-5 Sessions)
|
||
|
|
||
|
Temporal-Difference learning is a combination of Monte Carlo ideas and Dynamic Programming. This can lead to methods learning directly from raw experience without knowledge of an environment. Subtopics will include:
|
||
|
|
||
|
- TD Prediction
|
||
|
- Sarsa: On-Policy TD Control
|
||
|
- Q-Learning: Off-Policy TD Control
|
||
|
- Function Approximation
|
||
|
- Eligibility Traces
|