Author: Jiayi Chen
Time: Oct 2020
- Bandit:
- Multi-arm Bandit:
- epsilon-greedy
- upper confidence bound (UCB)
- Thompsom Sampling (TS)
- Perturbed-history Exploration (PHE)
- Contextual Linear Bandit:
- LinUCB
- LinTS
- LinPHE
- Multi-arm Bandit:
- Reinforcement Learning:
- Dynamic programming solution for Markov Decision Process (known environment):
- value iteration
- policy iteration
- Model-free control:
- off-policy Monte Carlo (MC) control
- off-policy Temporal Difference (TD) control (i.e., Q-learning)
- Dynamic programming solution for Markov Decision Process (known environment):
- python 3
- Action: articles
- User: users
- In each time step, we will iterate over each user, make recommendation to it and receive an reward of the recommended article.
run "/bandit/SimulationComparison.py"
See "/bandit/lib/$ALGOTHISNAME$.py" for each algorithm.
4-by-4 grid world. The goal of the agent is to get to the goal (cell grid[3][3]) as soon as possible, while avoid the pits (cell grid[1][1] and grid[2][1]).
run "/rl/runDP.py"
run "/rl/runRL.py"