Reinforcement Learning with Parameterized Models
|has title::Multiple Q-learning for Online Reinforcement Learning|
|Master:||project within::Computational Intelligence and Selforganisation|
|Student name:||student name::Ben Haanstra|
|Supervisor:||Hado van Hasselt|
|Second supervisor:||Guszti Eiben|
Over the past few decades, artificial intelligence (AI) has grown immensely as a field with numerous successes in building intelligent systems. Nowadays, many parts of our lives are being automated and assisted by such systems. For example, recognizing handwrittenaddresses on mail, detecting fraud, spam filtering, routing networks and Google search.
In the future, we expect, and may already witness, their increasing application in social, expedition, domestic, commercial and industrial domains. For example, self-driving cars have been developed in the past years, and robots are believed to take over various jobs from humans in the next few decades.
However, developing intelligent systems can be complicated, because one has to address how it should behave for every possible scenario that it may encounter. More often than not, it is difficult to foresee all scenarios and our understanding may differ fromreality. In some cases, we may not even be sure yet how it should behave. For example, we do not yet know how to play games like Go and Chess perfectly, nor is it always trivial to program how humans do things such as telling apart dogs from cats. This may result in the system performing faulty or inefficiently, and potentially have harmful and critical consequences. It is therefore essential that the system is capable of adapting to the task at hand by itself. That is, it has the capability of learning.
Machine learning is a branch of AI that concerns itself with developing computational systems that learn to perform tasks from data. In this thesis, we focus on reinforcement learning: an area of machine learning that is concerned with learning behaviour for a sequential decision problem with unknown dynamics through feedback signals.
Reinforcement learning (RL) is about an agent, a decision-making entity, that interacts with an environment, a sequential decision problem, in which it endeavours to accomplish a task through a series of actions. Instance problems of this type are prevalent in both the physical and digital world. Figure 1.1 shows a few examples.
Whenever an agent interacts by executing an action, it observes a feedback signal and a new state of the environment. This feedback signal indicates how good the executed action was and is usually referred to as a reward in the literature. In RL, one is concerned with how an agent should choose actions in succession so as to maximise the expected return: the expected sum of (discounted) rewards. Alternatively, a feedback signal corresponds to the incurred cost of an action and one is concerned with minimizing the expected sum of costs.
Commonly, the environment is modelled as a Markov decision process (MDP). Such modelling choices involve assumptions about the environment, made by the designer of the agent, that allows learning and decision making to be manageable. The policy of the agent, which defines its decision making and thus its behaviour, is in general a function of the history of made observations and actions. For any MDP, there always exists an optimal policy that maximises the expected return.
Naturally, one is interested in this optimal policy, which encompasses the optimal actions, and use it to control the environment. However, this optimal policy is typically not known in advance as the dynamics of the environment usually are not perfectly known. Namely, one is uncertain about the consequential rewards and states that followone's actions. In some cases, these consequences can even be stochastic, resulting in different rewards and states every time the same action is tried.
Furthermore, the agent should not only consider the immediate reward of an action, but also its consequences in the long term due to the environment's sequential nature. For example, one can choose to withdraw one's money from the bank and benefit from it immediately, or one can choose to wait and benefit from a higher interest over the initial deposit at a later time. This idea also applies to sequential decision problems.
Learning to Control
The goal is thus to find the optimal policy. Assuming that the unknown environment can be interacted with, we can proceed by having the agent try out actions, according to its behaviour policy, to gather samples about the actions' consequences. From these samples, we can learn about the environment and, more importantly, learn to control. However, unlike supervised learning, a different area of machine learning, the feedback signals in RL do not explicitly inform the agent about the optimal actions. Instead, it merely indicates numerically how good the tried actions were, and the agent has to discover by itself which course of actions will yield the most reward.
At the present, there exists a vast amount of RL algorithms that can learn about good and even optimal policies from interaction. These algorithms can be classified as follows. First, whether it learns a model of the environment (model-based) or not (model-free). Second, whether it memorizes all samples to compute a solution with every interaction (batch), or update an existing solution and discard samples immediately after (online). Third, one can learn about the behaviour policy that the agent follows in the environment (on-policy) or learn about an estimation policy while following a different behaviour policy for interaction (off-policy). Finally, whether it explicitly learns a policy, a value function or both.
Value functions are used to map states, or actions in states, for a policy to values. The value of a state, or the action-value of an action in a state, corresponds to the expected return that the agent will accrue when it is in that state, or takes that action in that state, and follows the policy afterwards. Many RL algorithms learn value functions to determine which actions and policies are better by comparing values. Ultimately, an optimal policy can be easily derived from the optimal action-values.
Temporal-difference learning (TD) is a family of model-free value-based RL algorithms and is one of the most popular topics in the field of RL. They estimate values by bootstrapping: they use value samples that are based on other value estimates. Despite that this makes learning intrinsically biased and non-stationary in MDPs due to bootstrapping, TD tends to be computationally cheap and can conveniently learn optimal policies. Q-learning is arguably the most famous TD algorithm, which can be used to learn an optimal policy regardless of the behaviour policy by directly estimating the optimal action-values in a model-free online off-policy manner.
Focus of this Thesis: Learning to Behave
In this thesis we are interested in online reinforcement learning (ORL). Here, we consider the case that an agent is deployed in an unknown environment and has to maximise its online performance: the expected return of the agent's behaviour policy. Ergo, it matters how the agent behaves during its lifetime in the environment.
Given that the environment is initially unknown to the agent, it will unlikely know how to behave optimally from the start and therefore has to learn from interaction. However, just learning about good behaviour is not sufficient, as we want the agent to perform well while still learning.
Obviously, if we have the agent randomly roam through the environment, then it may gather samples that it can use to learn control, but it will also jeopardize its online performance. The goal of ORL is thus to develop an algorithm that a learning agent can use to choose actions that will consistently accrue a lot of reward during its lifetime.
The challenge that accompanies this is that the agent can only learn about actions by trying them out. In doing so, it risks squandering valuable interaction time when the chosen actions do not amount to good behaviour nor help it learn to improve its behaviour. Moreover, if the agent sticks to choosing the same actions, then it will always experience the same and risks missing out on learning better behaviour. This is known as the exploration-vs-exploitation dilemma: the agent has to balance the exploration of learning about good actions with the exploitation of known good actions.
In the literature, there are many approaches that aim to explore efficiently. Commonly, they employ reinforcement learning algorithms to learn from samples and guide exploration through heuristics or statistical methods. Contemporary state-of-the-art algorithms are mainly model-based, and tend to be very computationally demanding as they account for every foreseeable future of actions and consider many plausible environments that the agent may be in.
We believe that autonomous agents in the future will have to be completely self- reliant and rarely have access to an abundance of computational resources. For example, robots that work remotely in deep sea or space can not rely on a supercomputer doing all the heavy work. Therefore, we focus on online model-free value-based algorithms for ORL in MDPs that are computationally economical and estimate action-values.
Early online model-free attempts at exploring more efficiently than random have based action selection on combinations of action-value estimates and other statistics such as the number of times that an action was tried before.
More recent and sophisticated alternatives pursue efficient exploration by accounting for one's uncertainty over the optimal action-values in some way. For example, using Bayesian statistics to quantify uncertainty explicitly, using interval estimation or probably approximately correct learning to estimate the optimal action-values optimistically to direct exploration towards poorly understood actions that are potentially optimal.
Unfortunately, these online model-free approaches involve assumptions that usually do not hold in MDPs, have parameters that are not easily tuned for the online performance, or do not properly account for the biased and non-stationary nature of learning as they incorporate bootstrapping. As a result, these approaches may work well for certain MDPs, but are not suitable to function as an 'off-the-shelf' algorithm for ORL.
Our observation is that Q-learning, as well as other temporal-difference learning algorithms, does not involve such restrictive assumptions or critical parameter tuning. But on the other hand, it only provides a single estimate that does not easily lend itself for balancing exploration with exploitation.
We hypothesize that using multiple estimates, for the optimal action-values, can potentially improve the agent's action selection as the difference among the estimates can be used to signify our (un)certainty over the optimal action-values, and hence which actions are potentially optimal. I.e., when an optimal action-value is estimated at different values, then we are uncertain. This can be used as an incentive for exploration. We call our approach Multiple Q-learning (MQ) as the estimation is based on Q-learning.
However, if we update all the estimates identically, then action selection will be no different than using a single estimate. Thus, the addition of more estimates poses several challenges such as how the multiple estimates should be updated, how many to use, and how they are best used for ORL.
Our main research question and related questions can be formulated as follows: 1. "Is it beneficial to use multiple action-value estimates for model-free online reinforcement learning?" (a) How should the multiple estimates be updated? (b) How many estimates should one use? (c) Which one of the considered exploration strategies, that translates the multiple estimates to an actual decision, works best?