跳转至

1. Basic Concepts

本章是强化学习的一些基本先验知识,不做过多介绍。

Reward: \(r(s, a)\) 是状态s和a的函数

Return: 一条trajectory的奖励之和

Discount return: 对于一条trajectory而言 \(discount return = r_{1} + \gamma r_{2} + \gamma^{2} r_{3} + ...\)

MDP—Markov decision processes

M: Markov property(memoryless property)

\[ p(s_{t+1}|s_t,a_t,s_{t-1},a_{t-1},\ldots,s_0,a_0)=p(s_{t+1}|s_t,a_t), p(r_{t+1}|s_t,a_t,s_{t-1},a_{t-1},\ldots,s_0,a_0)=p(r_{t+1}|s_t,a_t), \]

D: Stochastic Policy