Beyond Reward

6 minute read

Date:

In a previous post I mentioned an interesting paper that made the claim that much of human intelligence could be viewed under the lense of reward maximization, you can see that blog post here. This point of view may not be the most common among either psychologists or computer scientists, but it would be great news for reinforcement learning researchers who are interested in making very smart systems by training them to maximize reward. However, I previously discussed the training aspect of reward maximization and what I believe it can achieve. Instead of discussing the ways that we can train artificial intelligence, specifically reinforcement learning agents, in this blog post I am going to talk about the assessment of RL agents.

The most common way to assess the performance of RL agents is their episodic or long-term reward, usually displayed as a learning curve as the training of the agent progresses and compared against alternative RL models in a given environment. This differs from other ML techniques like classification or regression problems, two of the main applications of ML, which are typically quantified in terms of error or difference between what the model predicts and the correct solution. This is in part due to the unique nature of RL among ML and AI techniques as a semi-supervised method. We could theoretically graph the error of the NN being trained in either the Q-network or policy network that instantiates our RL agent, but that may not be as informative as the reward. For many applications reward maximization is a good metric for performance, such as playing chess or go, or controlling the electrical output of a factory, etc.

However, if we are interested in human-like reinforcement learning, as I am, then reward maximization may not be the best or even a particularly useful metric to judge how well we are achieving our goals. TO use a specific example I will talk about the multi-agent particle environment suite, which was used in one of my previous papers that tried to apply human-like learning onto RL agents. Those environments are shown in Figure 1 and a citation to both my paper as well as the original paper that described them is listed in the reference section. These are very interesting environments from the perspective of designing human-like RL since they require some aspect of multi-agent interaction consisting of a mix of communication in cooperative, competitive, and mixed environments. These types of communicative tasks may be simple while simultaneously revealing many important and interesting aspects of human behaviour.

Trulli
Figure from my paper [2] describing the Multiagent Particle Environments [2]. Adversary: 2 Good agents and 1 adversary are rewarded by closeness to a target, good agents must not reveal which object is the target by spreading to both the target and distraction. Crypto: 1 Good agent communicates target landmarks to another good agent over a public communication channel, 1 adversary attempts to decode the communicated target. Push: 1 good agent moves towards the target landmark while avoiding 1 adversary. Reference: 2 mobile good agents communicate to determine which landmark is the target. Speaker: 1 static good agent communicates to 1 mobile good agent which landmark is the target. Spread: 3 good agents spread to cover all landmarks. Tag: 1 good agent moves to distance itself from 3 adversaries using obstacles to slow their approach. World: 2 good agents move to gather food, hide within trees, and avoid 4 adversaries, 1 adversary leader can observe good agents hiding in trees and communicate their location.

In this paper, we measured the performance in the same way that many papers that utilized this suite of environments had, by taking two models and comparing the performance against each other. In cooperative games all agents are the same model, and in competitive games one team of agents is the proposed model with the other being controlled by some baseline. Results from this direct reward comparison showed some modest improvements in reward, especially in the competitive tasks which amplified any differences in ability between the two agents. However, these types of modest results are often thought of by reviewers as within the bound of what you might expect by performing some hyper-parameter optimization on a given model, and thus not typically enough to claim strongly that one model is superior to another. While true, I think that these types of assessments partially miss the point of human-like RL in general, which seeks to make the performance of RL agents more human-like where possible. Though it is understandable that one would come to this conclusion given the way that we tried to compare performance in the paper.

Trulli
Reward results in MPE suite comparing our information-constrained model with a baseline in the final 1K episodes of training.

Instead of continuing to use averaged rewards from thousands of randomized trials, I believe it may be more useful and perhaps necessary for the evaluation of human-like behaviour in environments that require complex social interactions and communication. Take for example the adversary environment which requires the two good agents to split up between the target location and the distractor, to prevent the bad agent from determining which is the target. If we imagine a constructed beginning state with deterministic good agents that always move towards the target for a bit before splitting up, it should be easy for the bad agent to determine which is the area of interest. However, the model I proposed in the previous paper would have no way to directly determine this based on the previously revealed goals and behaviour of these good agents. Thus this type of constructed example could be perfect to assess human-like psychologizing in our agents.

I hope to use these ideas as I begin my work on card games that require theory-of-mind reasoning like poker, hanabi, mahjong, and bridge. All of these are great and interesting domains that allow for the analysis of complex psychological phenomena, but simply averaging over thousands or even millions of trials may result in average rewards that suggest different models are more similar than they truly are. If we were to construct specific environment states and opponent or team mate behaviours then we could hopefully get a more fruitful assessment of performance that doesn’t miss the forest for the trees.

References

[1] T. Malloy, C. R. Sims, T. Klinger, M. Liu, M. Riemer and G. Tesauro, “Capacity-Limited Decentralized Actor-Critic for Multi-Agent Games,” 2021 IEEE Conference on Games (CoG), 2021, pp. 1-8, doi: 10.1109/CoG52621.2021.9619081.

[2] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6382–6393.