Reinforcement learning

Background

course: deep-rl-class

Basics

Loop: state0, action0, reward1, state1, action1, reward2 …
State: complete description of the world, no hidden information
Observation: partial description of the state of the world
Goal: maximize expected cumulative reward
Policy: tells what action to take given a state
Task:
- episodic
- continuing
method:
- policy-based (learn which action to take given a state)
  - deterministic 给定 state s 会选择固定某个 action a: $a=\pi(s)$
  - stochastic: output a probability distribution over actions
- value-based (maps a state to the expected value of being at that state)
  - 随时间 reward 有一个衰减， $\gamma < 1$
  - 这时 policy 是个简单的人为确定的策略
value-based methods
- Monte Carlo: update the value function from a complete episode, and so we use the actual accurate discounted return of this episode (获得实际的一个 episode 的 reward)
  - \[V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)]\]
- Temporal Difference: update the value function from a step, so we replace Gt that we don’t have with an estimated return called TD target (获得一个 action 的 reward 以及对于下个 state 的估计来更新现在 state, 基于 bellman equation)
  - \[V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1})- V(S_t)]\]
we have 2 types of value-based functions:
- state-value function $V_{\pi}(s)=E_{\pi}[G_t \mid S_t = s]$ : value of a state
- action-value function $Q_{\pi}(s, a)=E_{\pi}[G_t \mid S_t = s, A_t = a]$ : value of state-action pair
Bellman equation:
- \[V_{\pi}{(s)} = E_{\pi}[R_{t+1} + \gamma \times V_{\pi}(S_{t+1}) | S_t = s]\]
- value of state s = expected value of immediate reward + the discounted value of next state

Q-learning

is off-policy value-based, uses a TD approach
Trains Q-Function (an action-value function) which internally is a Q-table that contains all the state-action pair values
- Quality of the action at the state
算法伪代码：
更新公式：注意到 $S_{t+1}$ 的 best state-action pair value 是通过 greedy (updating policy)取到 max，而不是 $\epsilon$-greedy (acting policy)，所以是 off-policy

Deep Q-learning

loss function that compares Q-value prediction and the Q-target to update DQN
Deep Q-learning has 2 phases:
- Sampling: we perform actions and store the observed experience tuples in a replay memory.
- Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step.
3 solutions to stabilize training:
- Experience Replay to make more efficient use of experiences. 通过 replay buffer 从中不断 sample batches, 多次学习 experience
- Fixed Q-Target to stabilize the training. 因为我们使用估计的 q value of next_state，所以 q-value, target-value 都在不断变化,
  - 拆分成 online network, target network，每 n steps copy online network 更新一次 target network
- Double Deep Q-Learning, to handle the problem of the overestimation of Q-values. 因为我们每次 select action 是 max，会导致系统性高估，通过 online network 挑选 action，而 target network 来计算 target q-value of next_state taking that action，因为 2 个 nets 没那么容易同时高估，所以降低了高估的 bias

Policy gradient

Written on October 9, 2025