Reinforcement learning

Background

Reinforcement learning

Basics

  • course: https://github.com/huggingface/deep-rl-class

  • loop: state0, action0, reward1, state1, action1, reward2 …

  • state: complete description of the world, no hidden information

  • observation: partial description of the state of the world

  • goal: maximize expected cumulative reward

  • policy: tells what action to take given a state

  • task:

    • episodic

    • continuing

  • method:

    • policy-based (learn which action to take given a state)

      • deterministic 给定 state s 会选择固定某个 action a Policy

      • stochatistic:  output a probability distribution over actions

    • value-based (maps a state to the expected value of being at that state)

      • Value based RL 随时间 reward 有一个衰减, γ < 1
      • 这时 policy 是个简单的人为确定的策略
  • value-based methods

    • Monte Carlo: update the value function from a complete episode, and so we use the actual accurate discounted return of this episode (获得实际的一个 episode 的 reward)

    • Temporal Difference: update the value function from a step, so we replace Gt that we don’t have with an estimated return called TD target (获得一个 action的 reward 以及对于下个 state 的估计来更新现在 state,基于 bellman equation)

  • Bellman equation:

Q-learning

  • Trains Q-Function (an action-value function) which internally is a Q-table that contains all the state-action pair values

  • 一般 q-table 都初始化成 0

Written on October 9, 2025