RL is better explained with a robot that is left in a location for the first time and programmed to map the territory. The robot evaluates the current position and takes a step based on the best evaluation. For example, let us say the robot can step left, right, forward and backward. The goal is to reach a charging station for an energy refill whose GPS location has been made available to the robot. As the robot takes a step, it takes its GPS location and compares it with that of the charging station to calculate distance. The objective is to minimize the distance. There could be hurdles on the way which the robot is unaware of. So the robot may have to step back at times. With reinforcement learning, the robot is guaranteed to reach the charging station, after some trial and error, if there is a path without hurdles. The interesting thing is the robot once after attaining the goal has learnt a complete path from starting point to the charging station with the intermittent paths that it had tried and backtracked that also contributed to the learning. In other words, the robot received negative reinforcement for some locations resulting in backtracking and positive reinforcement for some locations leading to the realization of the goal. Do kids learn this way? If left to their own machinations this is as close as it gets according to the researchers studying child psychology.
Biological Basis
While neural nets are based on the neuronal structure in brain and genetic algorithms are inspired by the evolution, RL is inspired by the behavior when behavior is viewed as a modification of a state based on the reward received from the environment. We know about the Pavlovian dog that responded to stimulus. The behavioral modification in response to reward goes one step further. It is within the human nature, especially among the children, to modify their behavior based on the reward or lack of it.
Exploration vs. Exploitation
Exploration refers to the acquisition of knowledge such as a robot trying to map a territory by wandering around and getting lost at times. Exploitation, on the other hand, is the application of previously acquired knowledge in a new situation with some modification. Both of these approaches are not optimal, and we have to find a proper balance between them to get maximum reward. This is said to be exploration vs exploitation dilemma of reinforcement learning.
The math behind reinforcement learning
Much of the math behind reinforcement learning is based on Bellman equation that goes like this:
$V^{*}(x_t) = r(x_t) + \gamma V^{*}x_{t+1})$
Where $V^*$ represents the ideal value function at state x, t and t+1 are time intervals, and $\gamma$ is the discount factor that is in the range[0,1] that causes immediate reinforcement to have more importance or weighted more heavily than future reinforcement. , and $r({x_t})$ is the reinforcement from the environment. The difference between actual value and ideal value is represented as an error $e$ that can be shown to be
$e(x_t) = \gamma e(x_{t+1})$
Note that the error is exponentially decaying with time and at the goal state is 0 by definition when learning is considered as complete. This is much simpler than what one encounters with neural nets. But one can use a neural net for the approximation of $V(t, w)$ of $V^{*}(x)$ as
$\Delta w(t) = -\alpha [ max(r(x_t, u) + \gamma V(x_{t+1}, w_t)) – V(x_t, w_t)] {{\partial {V(x_t, w_t)}} \over {\partial w_t}}$
where $\alpha$ is the learning rate.
Terminology in Reinforcement Learning
Policy - a mapping from states to actions. The actions for the robot are move right or move forward.
Reinforcement - a scalar variable that communicates the change in the environment to the reinforcement learning system. For example, if an RL system is a controller for a missile, the reinforcement signal might be the distance between the missile and the target (In which case, the RL system would learn to minimize reinforcement).
Markov decision process - An MDP consists of a set of states X; a set of start states S that is a subset of X; a set of actions A; a reinforcement function R where R(x,a) is the expected immediate reinforcement for taking action a in state x; and an action model P where P(x'|x,a) gives the probability that executing action a in state x will lead to state x'. Note: It is a requirement that the choice of action be dependent solely on the current state observation x. If knowledge of prior actions or states affects the current choice of action then the decision process is not Markov.
Deterministic - In the context of states, there is a one-to-one mapping from state/action pairs to successor states. In other words, with probability one, the transition from state x after performing action a will always result in state x'.
Non-deterministic - In the context of states, there exists a probability distribution function P(x'|x,a) that gives the probability that executing action a in state x will lead to state x'. state - The condition of a physical system as specified by a set of appropriate variables. unbiased estimate - The expected (mean) error in the estimate is zero.
Applications
RL was used for packet routing in dynamic networks, improving elevator performance, channel allocation in wireless networks, job-shop scheduling, missile guidance, etc. It is not clear if RA is applicable to higher cognitive functions like planning and diagnosis where the reward and feedback from environment are not applicable.
Conclusion
- let us say our robot simulates the path to the charging station even before taking a step; in this case the robot needs to evaluate each and every state and compute the value function which is memory intensive and time consuming so much so that the robot may never make a move.
- assuming the robot successfully met the goal, how to determine which move made it possible; to illustrate let us say the path involves forks where the robot has to choose between right and left; how to determine which of these forks is crucial? This is the credit assignment problem.
- a number of alternative algorithms to exhaustive search have been proposed such as Q-learning and Advantage learning which are approximations based on assumptions for the value functions
- it seems the reinforcement learning with policy gradients – i.e. using a neural net to compute the state values – has been successfully applied to ATARI games (https://en.wikipedia.org/wiki/Atari_Games) where the value of a state can be evaluated with respect to win or loss; so the policy gradient leading upto a win or loss can be used in gradient computation for the given set of inputs with some supervised learning.
One can sing the following ditty:
Hinton set the baseline for neural networks LeCun set the baseline for convolutional neural networks ImageNet set the baseline for visual recognition competitions DeepMind set the baseline for deep RL algorithms OpenAI set the baseline for deep RL environments
Online References
http://old.nbu.bg/cogs/events/2000/Readings/Petrov/rltutorial.pdf for a basic RL tutorial
http://karpathy.github.io/2016/05/31/rl/ for ATARI GAMES
No comments:
Post a Comment