Papers
arxiv:2303.17156

MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations

Published on Mar 30, 2023
Authors:
,
,

Abstract

We study a new paradigm for sequential decision making, called offline Policy Learning from Observation (PLfO). Offline PLfO aims to learn policies using datasets with substandard qualities: 1) only a subset of trajectories is labeled with rewards, 2) labeled trajectories may not contain actions, 3) labeled trajectories may not be of high quality, and 4) the overall data may not have full coverage. Such imperfection is common in real-world learning scenarios, so offline PLfO encompasses many existing offline learning setups, including offline imitation learning (IL), ILfO, and reinforcement learning (RL). In this work, we present a generic approach, called Modality-agnostic Adversarial Hypothesis Adaptation for Learning from Observations (MAHALO), for offline PLfO. Built upon the pessimism concept in offline RL, MAHALO optimizes the policy using a performance lower bound that accounts for uncertainty due to the dataset's insufficient converge. We implement this idea by adversarially training data-consistent critic and reward functions in policy optimization, which forces the learned policy to be robust to the data deficiency. We show that MAHALO consistently outperforms or matches specialized algorithms across a variety of offline PLfO tasks in theory and experiments.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2303.17156 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2303.17156 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2303.17156 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.