This post was originally published by Facebook Research
Embodied agents operating in human spaces must be able to master how their environment works: what objects can the agent use, and how can it use them? We introduce a reinforcement learning approach for exploration for interaction, whereby an embodied agent autonomously discovers the affordance landscape of a new unmapped 3D environment (such as an unfamiliar kitchen). Given an egocentric RGB-D camera and a high-level action space, the agent is rewarded for maximizing successful interactions while simultaneously training an image-based affordance segmentation model. The former yields a policy for acting efficiently in new environments to prepare for downstream interaction tasks, while the latter yields a convolutional neural network that maps image regions to the likelihood they permit each action, densifying the rewards for exploration. We demonstrate our idea with AI2-iTHOR. The results show agents can learn how to use new home environments intelligently and that it prepares them to rapidly address various downstream tasks like “find a knife and put it in the drawer.” Project page: http://vision.cs.utexas.edu/projects/interaction-exploration/
The ability to interact with the environment is an essential skill for embodied agents operating in human spaces. Interaction gives agents the capacity to modify their environment, allowing them to move from semantic navigation tasks (e.g., “go to the kitchen; find the coffee cup”) towards complex tasks involving interactions with their surroundings (e.g., “heat some coffee and bring it to me”).
Today’s embodied agents are typically trained to perform specific interactions in a supervised manner. For example, an agent learns to navigate to specified objects , a dexterous hand learns to solve a Rubik’s cube , a robot learns to manipulate a rope . In these cases and many others, it is known a priori what objects are relevant for the interactions and what the goal of the interaction is, whether expressed through expert demonstrations or a reward crafted to elicit the desired behavior.
Despite exciting results, the resulting agents remain specialized to the target interactions and objects for which they were taught. In contrast, we envision embodied agents that can enter a novel 3D environment, move around to encounter new objects, and autonomously discern the affordance landscape—what are the interactable objects, what actions are relevant to use them, and under what conditions will these interactions succeed? Such an agent could then enter a new kitchen (say), and be primed to address tasks like “wash my coffee cup in the sink.” These capabilities would mimic humans’ ability to efficiently discover the functionality of even unfamiliar objects though a mixture of learned visual priors and exploratory manipulation.
To this end, we introduce the exploration for interaction problem: a mobile agent in a 3D environment must autonomously discover the objects with which it can physically interact, and what actions are valid as interactions with them.
Figure 1: Main idea. We train interaction exploration agents to quickly discover what objects can be used and how to use them. Given a new, unseen environment, our agent can infer its visual affordance landscape, and
efficiently interact with all the objects present. The resulting exploration policy and affordance model prepare the agent for downstream tasks that involve multiple object interactions. Exploring for interaction presents a challenging search problem over the product of all objects, actions, agent positions, and action histories. Furthermore, many objects are hidden (e.g., in drawers) and need to be discovered, and their interaction dynamics are not straightforward (e.g., cannot open an already opened door, can only slice an apple if a knife is picked up). In contrast, exploration for navigating a static environment involves relatively small action spaces and dynamics governed solely by the presence/absence of obstacles [12, 50, 51, 18, 11, 47].
Towards addressing these challenges, we propose a deep reinforcement learning (RL) approach in which the agent discovers the affordance landscape of a new, unmapped 3D environment. The result is a strong prior for where to explore and what interactions to try. Specifically, we consider an agent equipped with an egocentric RGB-D camera and an action space comprised of navigation and manipulation actions (turn left, open, toggle, etc.), whose effects are initially unknown to the agent.
We reward the agent for quickly interacting with all objects in an environment. In parallel, we train an affordance model online to segment images according to the likelihoods for each of the agent’s actions succeeding there, using the partially observed interaction data generated by the exploration policy. The two models work in concert to functionally explore the environment. See Figure 1. Our experiments with AI2-iTHOR  demonstrate the advantages of interaction exploration. Our agents can quickly seek out new objects to interact with in new environments, matching the performance of the best exploration method in 42% fewer timesteps and surpassing them to discover 1.33× more interactions when fully trained. Further, we show our agent and affordance model help train multi-step interaction policies (e.g., washing objects at a sink), improving success rates by up to 16%
on various tasks, with fewer training samples, despite sparse rewards and no human demonstrations.
To see the paper click the link below.